Model Selection, Validation, and Hyperparameter Tuning

13 min readAug 1, 2020

In practice, a majority of the time dedicated to any data science project (unless you’re lucky) is consumed by data cleaning and wrangling. However, once you’ve completed you’re data mining, cleaning, exploration, and feature engineering, generally, the next step is to do some machine learning. The ML process is pretty standard regardless of the algorithm you choose, it’ll always require some model selection, model validation, and hyperparameter tuning. One of the easiest ways you can wrap your mind around the process is through trial and error.

Logistic Regression

Logistic regression is a very well-known supervised binary classification algorithm. Unlike linear regression where you’re predicting a continuous value, logistic regression is a binary algorithm that outputs a prediction of either a 0 or a 1, or a probability (there is also softmax regression an extension to logistic regression for more than 2 classes). Although logistic regression isn’t as fancy as neural networks or natural language processing, it is still a significantly useful tool in any data scientist’s toolbelt. Just like other classification algorithms, it can be used for:

Predicting Bank Loan Worthiness
Detecting Credit Card Fraud
Detecting Email Spam
And Many More

To get a grasp on how model selection, model validation, and hyperparameter tuning work we’ll run through an example with a simple dataset. For demonstration, we’ll be using a dataset from the UCI Machine Learning Repository.

Blood Transfusion Service Center Data Set

The dataset we’ll be using can be found in the UCI Machine Learning Repository by clicking here. Below is some information about the data for those who would like to know, it was taken directly from the UCI Machine Learning Repository.

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes its blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build an FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency — months since the last donation), F (Frequency — total number of donation), M (Monetary — total blood donated in c.c.), T (Time — months since the first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Attribute Information:

R (Recency — months since the last donation)
F (Frequency — total number of donation)
M (Monetary — total blood donated in c.c.)
T (Time — months since the first donation)
a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood)

Original Owner and Donor
Prof. I-Cheng Yeh
Department of Information Management
Chung-Hua University,
Hsin Chu, Taiwan 30067, R.O.C.
e-mail: icyeh ‘at’ chu.edu.tw
TEL:886–3–5186511

Citation Request:

Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence, “Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).

Exploratory Data Analysis

It’s always a good habit after extracting and cleaning your data to perform some EDA to get a grasp of the main characteristics of the data you’ll be working with. It provides you with visuals that immensely assist in the preprocessing steps later on.

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve
from sklearn.preprocessing import StandardScalerfrom imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline%matplotlib inline
sns.set(font_scale=1.1)SEED = 890432# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data"
df = pd.read_csv(url)
df.columns = ["r", "f", "m", "t", "y"]  # rename the columns for brevity
df.info()---------------------------- Output --------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   r       748 non-null    int64
 1   f       748 non-null    int64
 2   m       748 non-null    int64
 3   t       748 non-null    int64
 4   y       748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB

In our case, the dataset is typed correctly and void of any null values since a majority of the processing has already been done, leaving just EDA and modeling to us. Since this is a classification challenge, one of the best things to look at immediately is the class distribution of your output. This will give us insight into whether we should implement some downsampling/upsampling/hybrid-sampling to adjust for imbalance.

# plot class distribution
plt.figure(figsize=(16, 9))
sns.countplot(x="y", data=df)# adjust figure 
ticks, _ = plt.xticks()
plt.xticks(ticks, ["Didn't donate in 2007 (0)", "Did donate in 2007 (1)"])
plt.xlabel("")
plt.ylabel("# of individuals")
plt.title("Class Distribution of Dependent Variable")plt.show()

There is for sure a class imbalance which we’ll have to account for later in our modeling, but this highlights why it’s always important to visually explore your data. Next, we should also determine if there exists any correlation amongst our independent variables, if there is we could implement some dimensionality reduction or remove some redundant variables. We’ll inspect this by plotting the pairwise relationship between the variables, and also viewing a heatmap of pairwise Pearson correlation values.

# view the correlation between variables
sns.pairplot(df, vars=["r", "f", "m", "t"], aspect=1.3, corner=True)
plt.show()

Simply looking at the histogram of each variable we see they’re all positively skewed, which is another thing we’ll have to adjust for before we do our modeling by scaling our data. Along with that, the variables “f” and “m” appear to be positively correlated, forming basically a straight line. This makes sense as from our attribute information we know that frequency is “the total number of donation and monetary is “the total blood donated in c.c.”. We should therefore expect that as frequency increases, monetary will also increase.

# obtain the pairwise pearson correlation
correlation = df[["r", "f", "m", "t"]].corr()
mask = np.triu(np.ones_like(correlation))  # for removing upper diagonal of information# plot heatmap of pearson correlation
plt.figure(figsize=(16, 9))
labels = ["Recency", "Frequency", "Monetary", "Time"]
sns.heatmap(correlation, annot=True, square=True, xticklabels=labels, yticklabels=labels, mask=mask)plt.title("Pairwise Correlation Heatmap")
plt.show()

Our suspicions are confirmed, and by the looks of it, our Frequency and Monetary variables are perfectly positively correlated with each other. We’ll remove one of these columns as it’ll interfere with our modeling efforts down the line. It also appears that Frequency/Monetary is somewhat positively correlated with Time, my heuristic is to leave the variable if the Pearson correlation is |x| < .75, so in this case, we’ll simply stick to removing either frequency or monetary.

# drop the monetary column
df.drop(columns=["m"], inplace=True, errors="ignore")

Since this is a classification task, another good thing to do is look at the distribution of values in each category. To get a quick summation of this information, we can use a box and whisker plot.

# plot categorical distribution of values
plt.figure(figsize=(16, 9))plt.subplot(131)
sns.boxplot(data=df, x="y", y="r", showmeans=True)
plt.title("Recency")plt.subplot(132)
sns.boxplot(data=df, x="y", y="f", showmeans=True)
plt.title("Frequency")plt.subplot(133)
sns.boxplot(data=df, x="y", y="t", showmeans=True)
plt.title("Time")plt.show()

As noted before, we will be scaling our data within the same range. It is quite noticeable that the range for each of the above 3 plots is significantly substantially. Other than that what stands out to me is the somewhat similar characteristics between the classes, particularly the time variable. Between the Recency and Frequency variables, we see a decent amount of outliers that could dampen the ability of our logit model from detecting the difference between the two classes. After generating a vanilla model we’ll assess its performance and whether we want to drop our outlier observations.

Last but certainly not least, we’ll look at the descriptive statistics of our variables. This is typically also helpful at the beginning of any EDA, as you should notice any suspicious facts about your data relatively immediately.``

# output descriptive statistics
df.describe()

Nothing unusual here, as we’ve uncovered the majority of our information from our previous visualizations.

+-------+------------+------------+------------+------------+
|       |     r      |     f      |     t      |     y      |
+-------+------------+------------+------------+------------+
| count | 748.000000 | 748.000000 | 748.000000 | 748.000000 |
| mean  |   9.506684 |   5.514706 |  34.282086 |   0.237968 |
| std   |   8.095396 |   5.839307 |  24.376714 |   0.426124 |
| min   |   0.000000 |   1.000000 |   2.000000 |   0.000000 |
| 25%   |   2.750000 |   2.000000 |  16.000000 |   0.000000 |
| 50%   |   7.000000 |   4.000000 |  28.000000 |   0.000000 |
| 75%   |  14.000000 |   7.000000 |  50.000000 |   0.000000 |
| max   |  74.000000 |  50.000000 |  98.000000 |   1.000000 |
+-------+------------+------------+------------+------------+

Preprocessing

This step is short and just involves setting our data up for modeling by splitting it into training and testing sets and doing any additional scaling/manipulation. Typically it’s best to use a pipeline for scaling/manipulation of your data, as it’ll reduce the headache down the line, and provide you with a simple interface for modeling.

# create our data pipeline
imbalanced_pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(solver="liblinear", random_state=SEED)
)skfold = StratifiedKFold(shuffle=True, random_state=SEED)

Our pipeline is quite simple but given a different dataset, with different characteristics, we’d have to use something else. One thing to note is, we are doing nothing to account for class imbalance in our vanilla pipeline, but based on our assessment we will see if we need to.

# separate our data into training and testing sets
X = df[["r", "f", "t"]]
y = df["y"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=SEED, stratify=y)

We use the stratify keyword argument, to make sure we separate a proportionate amount of our classes into our training and testing set. So 75% of our class 0 and class 1 respectively will be in the training set, and 25% will be in the test set respectively.

Model Validation

To validate the results of our training, we’ll be using cross-validation with kfolds to ascertain generally the performance of our model.

# perform CV on our data and output f1 results
f1_score = cross_val_score(imbalanced_pipeline, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score
---------------------------- Output --------------------------------
'0.2470'

Our current model performs horribly, this may be due to several reasons, and could even be that our logistic regression algorithm just isn’t suited to this task. However, we should take this baseline score, and try to improve.

# fit our model and output precision recall curve
imbalanced_pipeline.fit(X_train, y_train)fig, ax = plt.subplots(figsize=(16, 9))
plot_precision_recall_curve(imbalanced_pipeline, X_train, y_train, ax=ax)
plt.title("Precision-Recall Curve Baseline Logit")
plt.show()

The precision-recall curve for our classifier really tells us that this model worse than pretty terrible, and definitely wouldn’t be something going into production.

# output classification report
print(classification_report(y_train, imbalanced_pipeline.predict(X_train)))---------------------------- Output --------------------------------
+-------------------------------------------------------+
|               precision    recall  f1-score   support |
+-------------------------------------------------------+
|            0       0.79      0.98      0.88       399 |
|            1       0.72      0.17      0.27       124 |
|     accuracy                           0.79       523 |
|    macro avg       0.76      0.57      0.58       523 |
| weighted avg       0.78      0.79      0.73       523 |
+-------------------------------------------------------+

It appears we are severely being hurt by the imbalance in our classes, next we’ll use synthetic oversampling, and random undersampling to better our model. Although this isn’t hyperparameter tuning, it is important to tune your data just as much as you’d tune your model, as what you get out is only as good as what you put in.

# make balanced pipeline
balanced_pipeline = make_pipeline(
    StandardScaler(),
    SMOTE(random_state=SEED),  # completely balance the two classes
    LogisticRegression(solver="liblinear", random_state=SEED)
)

Our new pipeline integrates the imbalanced learn libraries Synthetic Minority Oversampling Technique to upsample our minority class. It does this by generating synthetic data points using our minority class data. The result is an updated dataset with a balanced class distribution of data. Since we’ll have a balanced dataset for training, we’ll be able to use a ROC/AUC curve to assess performance.

# perform CV on our data and output f1 results
f1_score = cross_val_score(balanced_pipeline, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score
---------------------------- Output --------------------------------
'0.5153'

We’ve already doubled our baseline f1_score which is an excellent sign, we desperately needed to implement that upsampling.

# fit our model and output ROC curve, now that our pipeline is balancing our dataset
balanced_pipeline.fit(X_train, y_train)fig, ax = plt.subplots(figsize=(16, 9))
plot_roc_curve(balanced_pipeline, X_train, y_train, ax=ax)
plt.title("Precision-Recall Curve Balanced Logit")
plt.plot(np.linspace(0, 1), np.linspace(0, 1), "--")
plt.show()

Here we can see that our model performs alright, it isn’t anything special, but it does have the ability to somewhat distinguish between the two classes.

# output classification report
print(classification_report(y_train, balanced_pipeline.predict(X_train)))---------------------------- Output --------------------------------+-------------------------------------------------------+
|               precision    recall  f1-score   support |
+-------------------------------------------------------+
|            0       0.90      0.63      0.74       399 |
|            1       0.39      0.77      0.52       124 |
|     accuracy                           0.66       523 |
|    macro avg       0.64      0.70      0.63       523 |
| weighted avg       0.78      0.66      0.69       523 |
+-------------------------------------------------------+

Likewise, our classification report shows that the recall score for our minority class has dramatically gone up by .60. This has come with the diminishment of other numbers though.

Next, we will be looking into some more model validation, and finally, hyperparameter tuning to optimize our model some more.

Tuning

Now that we have our data pipeline working how we want it to, we’ll look into tuning our logit classifier by altering some of its parameters, called hyperparameter tuning in the biz. The easiest way to run through a finite set of possible parameters is to use GridSearchCV, if you have a large set of parameters to search through it is a lot better to use the RandomizedSearchCV, which will only create n number of models you specify.

# create parameter grid and grid search
param_grid = {
    "logisticregression__penalty": ["l1", "l2"],
    "logisticregression__C": np.linspace(1e-10, 1),
    "logisticregression__fit_intercept": [True, False],
    
}gs = GridSearchCV(balanced_pipeline, param_grid, scoring="f1", n_jobs=-1, cv=skfold)
gs.fit(X_train, y_train)
"%.4f" % gs.best_score_
---------------------------- Output --------------------------------'0.5153'

It looks like through our grid search we have no improvement. However, this is a good example of what not to do if you have a very large parameter space to search through. Instead, you should check out the RandomizedSearchCV, which randomly selects a parameter set to use in each iteration.

# perform CV on our data and output f1 results
f1_score = cross_val_score(gs.best_estimator_, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score---------------------------- Output --------------------------------'0.5153'# fit our model and output ROC curve, now that our pipeline is balancing our dataset gs.best_estimator_.fit(X_train, y_train)  fig, ax = plt.subplots(figsize=(16, 9)) plot_roc_curve(gs.best_estimator_, X_train, y_train, ax=ax) plt.title("Precision-Recall Curve Grid Search Balanced Logit") plt.plot(np.linspace(0, 1), np.linspace(0, 1), "--") plt.show()

# output classification report
print(classification_report(y_train, gs.best_estimator_.predict(X_train)))---------------------------- Output --------------------------------+-------------------------------------------------------+
|               precision    recall  f1-score   support |
+-------------------------------------------------------+
|            0       0.90      0.63      0.74       399 |
|            1       0.39      0.77      0.52       124 |
|     accuracy                           0.66       523 |
|    macro avg       0.64      0.70      0.63       523 |
| weighted avg       0.78      0.66      0.69       523 |
+-------------------------------------------------------+# test data predictions
test_pred = gs.best_estimator_.predict(X_test)
# output classification report
print(classification_report(y_test, test_pred))
---------------------------- Output --------------------------------+-------------------------------------------------------+
|               precision    recall  f1-score   support |
+-------------------------------------------------------+
|            0       0.90      0.57      0.70       171 |
|            1       0.37      0.80      0.50        54 |
|     accuracy                           0.62       225 |
|    macro avg       0.63      0.68      0.60       225 |
| weighted avg       0.77      0.62      0.65       225 |
+-------------------------------------------------------+# plot confusion matrix
fig, ax = plt.subplots(figsize=(16, 9))
plot_confusion_matrix(gs.best_estimator_, X_test, y_test, ax=ax)
plt.show()

By the looks of it, we’ve optimized a model that from the beginning does very poorly. However, the lesson still stands, and the process of validation and tuning is still very much the same.

Conclusion

When working on any machine learning problem you’re data is invaluable, sometimes you have a lot and sometimes you have very little. Regardless of the volume you have, you must always ensure there is no data leakage, which at best leads to false confidence, and at worse leads to being unaware that your models are terrible. Ensuring there is no data leakage is often simple, just split up your data into two groups one for training and another for testing. However, do make sure there is no order dependence in your data or any other dependence on your data.

After splitting your data, it’s always good to use cross-validation, it helps you verify using your training data, that you’re going in the right direction. You really only have to use cross-validation on your training data, and never let your model touch your testing/hold-out data (don’t even let yourself see it) until you are reasonably sure that you want to use your final model. The problem that usually arises is developers will use their testing data to verify if their model is good, but then optimize their model on the testing data, which isn’t any good. You want your model to generalize well and that means it can’t see your testing data at all.

Lastly, hyperparameter tuning, not everyone remembers exactly what each parameter does for each function call. That’s why documentation exists, make sure to go and check out the docs for whichever algorithm you are using, and determine how to manipulate the hyperparameters. Typically you can identify some good values for your hyperparameters during EDA, but if you can’t you can use something like GridSearchCV or RandomizedSearchCV to help you optimize and search through large parameter spaces.

Originally published at https://edwardamor.xyz on August 1, 2020.