Understanding XGBoost, GridSearchCV, and R-squared in Machine Learning

Apr 11, 2024, 5:09 PM

Explain XGBoost, GridSearchCV, and R-squareds.

XGBoost

XGBoost stands for eXtreme Gradient Boosting. It's a decision-tree-based ensemble machine learning algorithm that uses a gradient boosting framework. It's designed for speed and performance.

GridSearchCV

GridSearchCV is a method used to tune hyperparameters of a model. It exhaustively tries multiple combinations of parameters, cross-validating as it goes to determine which tune gives the best performance.

R-squared

R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. It indicates how well the data fit the regression model (the higher the value, the better the fit).

Use xgboost to predict MEDV using the other features in the Boston house price data. Get the data from Sklearn. Then, use GridSearchCV to choose the max depth and the learning rate.

Summary of XGBoost Model Training with GridSearchCV

  • Data Preparation: Loaded the California housing dataset as a substitute for the Boston dataset, due to the latter's removal from sklearn. The dataset includes features like median income, house age, and average rooms, with the target variable being the median house value (MEDV).
  • Data Splitting: The dataset was split into training (80%) and testing (20%) sets.
  • Model Training: Used XGBoost with GridSearchCV to optimize the hyperparameters 'max_depth' and 'learning_rate'. The GridSearchCV was set up with 5-fold cross-validation and focused on maximizing the R-squared score.
  • Results: The GridSearchCV process fitted 16 different combinations of parameters, totaling 80 fits. The best parameters and R-squared score will be retrieved and displayed next.

Summary of XGBoost Model Training with GridSearchCV

  • Data Preparation: Loaded the California housing dataset as a substitute for the Boston dataset, due to the latter's removal from sklearn. The dataset includes features like median income, house age, and average rooms, with the target variable being the median house value (MEDV).
  • Data Splitting: The dataset was split into training (80%) and testing (20%) sets.
  • Model Training: Used XGBoost with GridSearchCV to optimize the hyperparameters 'max_depth' and 'learning_rate'. The GridSearchCV was set up with 5-fold cross-validation and focused on maximizing the R-squared score.
  • Results: The GridSearchCV process fitted 16 different combinations of parameters, totaling 80 fits. The best parameters and R-squared score will be retrieved and displayed next.

Best Model Parameters and Performance

  • Best Parameters: {'learning_rate': 0.2, 'max_depth': 6}
  • Best R-squared Score: 0.8343
    This indicates a strong predictive performance of the model on the training data.

What's the R-squareds on the training data and on the test data?

XGBoost Model Evaluation

  • R-squared on Training Data: 0.9285
  • R-squared on Testing Data: 0.8352
    These results indicate that the model fits the training data very well and also generalizes effectively to new, unseen data.
Get started with Vizly