This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled changes constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists — Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company — STARDATAPEPS — will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.
In the previous article Meta Oric’s aim was to quickly create a default XGBoost and therefore she sticked with the default settings. In this article Aki Razi is going to search for a better model performance and tries to tune the hyperparameters. Before we start with discussing how she tries to do that, let me first remind you of who Aki Razzi is:

Aki has won multiple Kaggle competitions, since her models achieve the highest possible performance. Time and resources do not matter that much to her. Hail the almighty accuracy, precision and recall. She does not care whether a technique is easy to explain or not. Similarly, she is no stranger to using ensemble models to achieve the near-perfect performance as well as very convoluted feature engineering techniques.
First of all, Aki imports the necessary packages, among which ‘xgboost’ enabling her to create a XGBoost:
pip install xgboost
# Importing packages and settings:
import warnings
warnings.filterwarnings(action= 'ignore')
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve
import joblib
Second, Aki loads the dataset. A bit of preparation on this data was done, as described here. The target variable ‘target’ indicates whether a data scientist in this historic dataset has left the company. All other columns in the dataset are possible predictors of whether a data scientist is likely to leave the company soon.
# Loading the data:
df_prep = pd.read_csv('https://bhciaaablob.blob.core.windows.net/featurenegineeringfiles/df_prepared.csv')
df = df_prep.drop(columns=['Unnamed: 0','city', 'experience', 'enrollee_id'])
df.head()
| city_development_index | gender | enrolled_university | education_level | major_discipline | company_size | company_type | last_new_job | training_hours | target | ind_relevent_experience | experience_num | city name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.920 | Male | no_enrollment | Graduate | STEM | NaN | NaN | 1 | 36 | 1 | 1 | 22.0 | Denver-Aurora-Lakewood CO (Metro) |
| 1 | 0.776 | Male | no_enrollment | Graduate | STEM | 50-99 | Pvt Ltd | >4 | 47 | 0 | 0 | 15.0 | Odessa TX (Metro) |
| 2 | 0.624 | NaN | Full time course | Graduate | STEM | NaN | NaN | never | 83 | 0 | 0 | 5.0 | Auburn-Opelika AL (Metro) |
| 3 | 0.789 | NaN | NaN | Graduate | Business Degree | NaN | Pvt Ltd | never | 52 | 1 | 0 | 0.0 | Corvallis OR (Metro) |
| 4 | 0.767 | Male | no_enrollment | Masters | STEM | 50-99 | Funded Startup | 4 | 8 | 0 | 1 | 22.0 | Tulsa OK (Metro) |
Aki’s aim is to assess if she can improve Meta’s default XGBoost by tuning the hyperparameters. To make a fair comparison she performs the same data preparation steps as Meta did. She imputes the missing values, converts the categorical variables into dummies and standardizes the numerical variables. In addition, just like Meta did, Aki also uses pipelines to prep the data, to train her XGBoost, and to make predictions.
In the code below Aki separates the target from the features, creates a train and test dataset, and creates the pipelines to prep the features:
# Define the target vector y
y = df['target']
# Creating a dataset without the DV:
X = df.drop('target', axis = 1)
# Split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=1121218
)
# Creating an object with the column labels of only the categorical features and one with only the numeric features:
categorical_features = X.select_dtypes(exclude="number").columns.tolist()
numeric_features = X.select_dtypes(include="number").columns.tolist()
# Create the categorical pipeline, for the categorical variables Aki imputes the missing values with a constant value and we encode them with One-Hot encoding:
categorical_pipeline = Pipeline(
steps=[
("impute", SimpleImputer(strategy= 'constant', fill_value= 'unknown')),
("one-hot", OneHotEncoder(handle_unknown="ignore", sparse=False))
]
)
# Create the numeric pipeline, for the numeric variables Aki imputes the missings with the mean of the column and standardize them, so that the features have a mean of 0 and a variance of 1:
numeric_pipeline = Pipeline(
steps=[("impute", SimpleImputer(strategy="mean")),
("scale", StandardScaler())]
)
# Combining the two pipelines with a column transformer:
full_processor = ColumnTransformer(transformers=[
("numeric", numeric_pipeline, numeric_features),
("categorical", categorical_pipeline, categorical_features),
]
)
Next, Aki recreates Meta’s default XGBoost:
# Instantiate the XGBClassifier:
xgb_cl = xgb.XGBClassifier(eval_metric='logloss', seed=7)
# Create XGBoost pipeline:
xgb_pipeline = Pipeline(steps=[
('preprocess', full_processor),
('model', xgb_cl)
])
# Evaluate the model with the use of cv:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=7) #, shuffle=True with or without shuffle??
scores = cross_val_score(xgb_pipeline, X_train, y_train, cv=cv, scoring = 'roc_auc')
print("roc_auc = %f (%f)" % (scores.mean(), scores.std()))
With only default parameters without hyperparameter tuning, Meta’s XGBoost got a ROC AUC score of 0.7915. As you can see below XGBoost has quite a lot of hyperparameters that Aki can tune to try to improve Meta’s default XGBoost.
# The default hyperparameters of the XGBoost:
xgb_cl
After introducing you to Aki Razzi you can image that Aki is not yet satisfied with a XGBoost with only the default parameters. Aki attempts to improve Meta’s default XGBoost with the use of the GridSearchCV function from the scikit-learn package to optimize the model. GridSearchCV accepts possible values for the provided hyperparameters and fits separate models on the given data for each combination of hyperparameters. The performance of each combination of hyperparameters is evaluated and afterwards the best performing model can easily be selected. Thus, GridSearchCV enables Aki to tune multiple hyperparameters at once. It is not feasible to tune all hyperparameters at once, because this will result in way too many models. Aki optimizes seven hyperparameters with the use of 5-fold cross validation, meaning that if she tries to tune all parameters with just one gridsearch and with e.g. five different values for each parameter a total of 5x5x5x5x5x5x5x5 = 390.625 trees (seven times 5 for each parameter and one time 5 for the 5-fold cross validation) would be trained. Hence, Aki tunes the model in multiple steps.
As Meta did in the previous article, Aki evaluates each model created in the grid search based on ROC AUC score. She does so with the use of this function:
def print_results_gridsearch(gridsearch, list_param1, list_param2, name_param1, name_param2):
# Checking the results from each run in the gridsearch:
means = gridsearch.cv_results_['mean_test_score']
stds = gridsearch.cv_results_['std_test_score']
params = gridsearch.cv_results_['params']
print("The results from each run in the gridsearch:")
for mean, stdev, param in zip(means, stds, params):
print("roc_auc = %f (%f) with: %r" % (mean, stdev, param))
#Visualizing the results from each run in the gridsearch:
scores = np.array(means).reshape(len(list_param1), len(list_param2))
for i, value in enumerate(list_param1):
plt.plot(list_param2, scores[i], label= str(name_param1) + ': ' + str(value))
plt.legend()
plt.xlabel(str(name_param2))
plt.ylabel('ROC AUC')
plt.show()
# Checking the best performing model:
print("\n")
print("Best model: roc_auc = %f using %s" % (gridsearch.best_score_, gridsearch.best_params_))
Aki starts with searching for the optimum parameters for the learning rate and the number of estimators (n_estimators). She begins her search with the commonly used starting value of 0.8 for subsample and colsample_bytree and keeps all other parameters at its default.
In the code below you see that the GridsearchCV contains a couple of parameters. First of all, it uses the created XGBoost pipeline: ‘xgb_pipeline’. Second, it uses a specified grid that will be tested in the gridsearch: ‘param_grid’. A third parameter that is specified for the gridsearch is n_jobs. N_jobs is the number of jobs to run in parallel. Aki has set it equal to -1, meaning that the grid search will use all available processors. Fourth, Aki uses 5-fold cross validation to tune the hyperparameters. Finally, the performance evaluation metric is set equal to the ROC AUC score for the cross-validation with the scoring parameter.
# Step 1: Searching for the optimum parameters for the learning rate and the number of estimators:
# Defining the parameter grid to be used in GridSearch:
param_grid = {"model__subsample": [0.8], "model__colsample_bytree": [0.8]
, "model__learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
, "model__n_estimators": range(50,500,50)
}
#instantiate the Grid Search:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=7)
grid_cv1 = GridSearchCV(xgb_pipeline
, param_grid
, n_jobs= -1
, cv = cv
, scoring="roc_auc")
# Fit
_ = grid_cv1.fit(X_train, y_train)
# Checking the results from each run in the gridsearch:
print_results_gridsearch(gridsearch=grid_cv1, list_param1 = param_grid["model__learning_rate"], list_param2 = param_grid["model__n_estimators"]
, name_param1 = 'learning_rate' , name_param2 = 'n_estimators')
So, with only changing the number of estimators and the learning rate, Aki already improves the ROC AUC score from 0.7915 to 0.8037 compared to Meta’s XGBoost with default settings.
Next up, with the best values identified for the number of estimators and the learning rate, Aki continues with optimizing the parameters: max_depth and min_child_weight.
# Step 2: Searching for the optimum parameters for max_depth and min_child_weight:
# Defining the parameter grid to be used in GridSearch:
param_grid = {"model__subsample": [0.8], "model__colsample_bytree": [0.8], "model__learning_rate": [0.01], "model__n_estimators": [250]
, 'model__max_depth': range(3,10,2)
, 'model__min_child_weight': range(1,6,2)
}
#instantiate the Grid Search:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=7)
grid_cv2 = GridSearchCV(xgb_pipeline
, param_grid
, n_jobs=-1
, cv=cv
, scoring="roc_auc")
# Fit
_ = grid_cv2.fit(X_train, y_train)
# Checking the results from each run in the gridsearch:
print_results_gridsearch(gridsearch=grid_cv2, list_param1 = param_grid["model__max_depth"], list_param2 = param_grid["model__min_child_weight"]
, name_param1 = 'max_depth' , name_param2 = 'min_child_weight')
The search for optimal values for the maximum depth of a tree and the minimum child weight resulted in only changing the maximum depth from six to five and keeping the minimum child weight at the default of 1. Due to only a small change to the parameters, the performance of the new best performing model is comparable to the previous one. The mean ROC AUC from the cross validation in this gridsearch is even slightly lower, it decreased from 0.8037 to 0.8036.
Third, Aki tries to improve the parameters: subsample and colsample_bytree.
# Step 3: Searching for the optimum parameters for subsample and colsample_bytree:
# Defining the parameter grid to be used in GridSearch:
param_grid = {"model__learning_rate": [0.01], "model__n_estimators": [250], 'model__max_depth': [5], 'model__min_child_weight': [1]
, 'model__subsample':[i/10.0 for i in range(4,10)]
, 'model__colsample_bytree':[i/10.0 for i in range(4,10)]
}
#instantiate the Grid Search:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=7)
grid_cv3 = GridSearchCV(xgb_pipeline
, param_grid
, n_jobs=-1
, cv=cv
, scoring="roc_auc")
# Fit
_ = grid_cv3.fit(X_train, y_train)
# Checking the results from each run in the gridsearch:
print_results_gridsearch(gridsearch=grid_cv3, list_param1 = param_grid["model__subsample"], list_param2 = param_grid["model__colsample_bytree"]
, name_param1 = 'subsample' , name_param2 = 'colsample_bytree')
The grid search above resulted in the optimum values: 0.5 for the subsample and 0.9 for the colsample_bytree. In other words, the grid search shows that the best model performance is achieved by constructing each tree based on half of the records and 90% of the features. Tuning these parameters improve the model performance from an AUC ROC score of 0.8036 to 0.8039.
Finally, Aki tries to improve the model even further by tuning the gamma and the lambda.
# Step 4: Searching for the optimum parameters for gamma and lambda:
# Defining the parameter grid to be used in GridSearch:
param_grid = {"model__learning_rate": [0.01], "model__n_estimators": [250], 'model__max_depth': [5], 'model__min_child_weight': [1], 'model__subsample':[0.5], 'model__colsample_bytree':[0.9]
, "model__gamma": [i/10.0 for i in range(0,6)]
, "model__reg_lambda": [0, 0.5, 1, 1.5, 2, 3, 4.5]
}
#instantiate the Grid Search:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=7)
grid_cv4 = GridSearchCV(xgb_pipeline
, param_grid
, n_jobs=-1
, cv=cv
, scoring="roc_auc")
# Fit
_ = grid_cv4.fit(X_train, y_train)
# Checking the results from each run in the gridsearch:
print_results_gridsearch(gridsearch=grid_cv4, list_param1 = param_grid["model__gamma"], list_param2 = param_grid["model__reg_lambda"]
, name_param1 = 'Gamma' , name_param2 = 'Lambda')
This final grid search shows that the default values should not be changed. Therefore, Aki keeps Gamma equal to 0 and Lamda equal to 1, resulting in the same and final AUC ROC score of 0.8039. By tuning the model in four steps and searching for the optimal values for eight different hyperparameters, Aki manages to improve Meta’s default XGBoost from a ROC AUC score of 0.791519 to 0.8039. This results in the best set of hyperparameters, which are shown below.
grid_cv4.best_params_
Now both Meta and Aki found their final parameters for their XGBoost algorithms we can evaluate their models on the testset:
# Predict with Aki's final XGBoost with the best parameters resulting from the GridSearch:
y_pred_aki = grid_cv4.predict(X_test)
y_pred_prob_aki = grid_cv4.predict_proba(X_test)[::,1]
# Evaluate:
print("roc_auc_score:",metrics.roc_auc_score(y_test, y_pred_aki))
# Fit Meta's default XGBoost pipeline:
xgb_pipeline.fit(X_train, y_train)
# Predict:
y_pred_meta = xgb_pipeline.predict(X_test)
y_pred_prob_meta = xgb_pipeline.predict_proba(X_test)[::,1]
# Evaluate:
print("roc_auc_score:",metrics.roc_auc_score(y_test, y_pred_meta))
# Compute False postive rate, and True positive rate
fpr1 , tpr1, thresholds1 = roc_curve(y_test, y_pred_prob_aki)
fpr2 , tpr2, thresholds2 = roc_curve(y_test, y_pred_prob_meta)
# Calculate Area under the curve to display on the plot
roc_auc_aki = metrics.roc_auc_score(y_test, y_pred_aki)
roc_auc_meta = metrics.roc_auc_score(y_test, y_pred_meta)
# Now, plot the computed values
plt.plot([0,1],[0,1], 'k--')
plt.plot(fpr1, tpr1, label= '%s ROC (area = %0.2f)' % ("XGBoost Aki", roc_auc_aki)) #"XGBoost Aki")
plt.plot(fpr2, tpr2, label= '%s ROC (area = %0.2f)' % ("XGBoost Meta", roc_auc_meta)) #label= "XGBoost Meta")
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.show()
As on the trainset, Aki’s tuned XGBoost outperforms Meta’s default XGBoost. Aki’s tuning resulted in an improved ROC AUC score of 0.7149 compared to Meta’s ROC AUC score of 0.6993. As we did for Meta in the previous article, we also save Aki’s model to be able to compare results later on. Just to be sure, we quickly test if her model is saved correctly.
#Saving Aki's final XGBoost pipeline:
best_pipe_aki = grid_cv4.best_estimator_
joblib.dump(best_pipe_aki, 'best_pipe_aki.joblib')
#Testing if Aki's model is correctly saved:
# Load the models:
upload_pipe_aki = joblib.load('best_pipe_aki.joblib')
# Use it to make the same predictions:
print(upload_pipe_aki.predict(X_test))
I hope you enjoyed reading this article and getting to know Aki. Her approach didn’t really focus on prepping the features, she focused on improving Meta’s XGBoost by tuning the hyperparameters. In the upcoming articles we investigate if we can improve her data preparations. Topics that we will look into are common data problems, dealing with high cardinality, and dealing with missing data. We will not only focus on improving model performance, but also how to improve the interpretability and explainability of the models. This is something both Andy Stand and Eqaan Librium value very much when practicing data science.
Overview of links to blogs: