Using a Keras model in combination with sklearn preprocessing, pipelines, grid search and cross-validation

Dec 8, 2019 · 900 words · 5 minute read

In this post I want to give a short intro on how to wrap a (sequential) Keras deep learning model into a sklearn estimator to enable usage of all the nice standard sklearn tools like pipelines and grid search.

Sklearn pipelines will allow us to easily use sklearn’s home remedies for imputation and scaling, a necessity for most deep learning models. Via sklearn’s grid search we can tune some of the neural network model parameters (like number of units in a layer or dropout rate) as well as some fitting parameters (like epochs or batch size).

As toy data we will use the popular boston house-prices dataset, and our toy problem will be regression of house prices using the given features.

Update from 2022: The Boston housing prices dataset has an ethical problem, which is discussed in more detail here, and which unfortunately is not uncommon in data science. Since we are not peeking into the data itself, I keep this post as it is, but I strongly suggest to not use this dataset any more.

## Built with python 3.7.13 with following packages:
## keras==2.3.1
## tensorflow==1.15.0
## scikit-learn==0.22

from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

We will make heavy use of many sklearn functions.

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

import numpy as np

For our final evaluation we will split data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

We remove around 5% of the data to make some preprocessing (imputation of missing values in our case) necessary and more interesting:

np.random.seed(12)
X_train[np.random.rand(*X_train.shape) > .95] = np.nan

## array([[1.07930e-01, 0.00000e+00, 8.56000e+00, ..., 2.09000e+01,
##         3.93490e+02, 1.30000e+01],
##        [6.29760e-01, 0.00000e+00, 8.14000e+00, ..., 2.10000e+01,
##         3.96900e+02, 8.26000e+00],
##        [1.30751e+01, 0.00000e+00, 1.81000e+01, ..., 2.02000e+01,
##         3.96900e+02, 1.47600e+01],
##        ...,
##        [1.50100e-02, 8.00000e+01, 2.01000e+00, ..., 1.70000e+01,
##         3.90940e+02, 5.99000e+00],
##        [1.13290e-01, 3.00000e+01, 4.93000e+00, ..., 1.66000e+01,
##         3.91250e+02, 1.13800e+01],
##        [        nan, 2.50000e+01, 4.86000e+00, ..., 1.90000e+01,
##         3.90640e+02, 7.51000e+00]])

Now let us create a linear baseline model pipeline. It will consist of a simple imputation step using SimpleImputer, scaling via StandardScaler, and finally LinearRegression. We use the Pipeline which allows us to name the steps:

pipeline_baseline = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])

We can do grid search over some of the parameters now, for example the imputation strategies of the SimpleImputer, mean or median. Parameters of the imputer can be set using the common sklearn syntax with double-underscores, i.e. imputer__strategy to set the strategy parameter of the imputer step. Moreover we can actually grid search not only over parameters, but also various transformers within the pipeline, for ex. the scaling: no scaling (i.e. passthrough), standard scaling (centering and unit variance) or min-max scaling (scaling to 0-to-1 range).

param_grid_baseline={
    'imputer__strategy': ['mean', 'median'],
    'scaler': ['passthrough', StandardScaler(), MinMaxScaler()]
}

gridsearch_baseline = GridSearchCV(
    pipeline_baseline,
    param_grid=param_grid_baseline,
    scoring='neg_mean_squared_error',
    cv=2,
    refit=True
)

gridsearch_baseline.fit(X_train, y_train);

Let us check the best parameters found in the grid search, and evaluate our model again using test mean squared error and test mean absolute error.

y_pred_baseline = gridsearch_baseline.predict(X_test)

print(f'Best params: {gridsearch_baseline.best_params_}')
## Best params: {'imputer__strategy': 'median', 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True)}
print(f'Best score: {gridsearch_baseline.best_score_}')
## Best score: -28.31580957064025
print(f'Test MSE: {mean_squared_error(y_true=y_test, y_pred=y_pred_baseline)}')
## Test MSE: 22.222365529533167
print(f'Test MAE: {mean_absolute_error(y_true=y_test, y_pred=y_pred_baseline)}')
## Test MAE: 3.273252059145886

Ideally we would now simply replace our LinearRegression estimator with a KerasRegressor. Luckily that is exactly what Keras’ sklearn wrappers will allow us to do: we just need to create a factory function for Keras models, which can then be wrapped using keras.wrappers.scikit_learn.KerasRegressor.

Given we have very little training data, we will create a very small deep learning model. Our sequential model will have two hidden layers with a default of 64 units. To add some bells and whistles and mitigate the overfitting we will add L2 regularization to some network weights as well as a dropout layer. These parameters can be tuned via grid search later on. The last layer will have one unit and linear activation to do scalar regression. As a loss function we will use mean squared error again.

from keras.models import Sequential

## Using TensorFlow backend.

from keras.layers import Dense, Dropout
from keras.regularizers import l2

def create_keras_model(layer1_units=64, layer2_units=64, dropout_rate=0, l2_regularization=0):
    model = Sequential()
    model.add(Dense(units=layer1_units, activation='relu', input_dim=13))
    model.add(Dense(units=layer2_units, activation='relu', kernel_regularizer=l2(l2_regularization)))
    model.add(Dropout(rate=dropout_rate))
    model.add(Dense(units=1, activation='linear'))
    model.compile(optimizer='adam', loss='mse')
    return model

We can now use KerasRegressor to wrap our factory and enable sklearn API compatibility to use Pipeline and GridSearchCV. KerasRegressor simply receives our factory as a first argument and the model parameters (like number of units or dropout rate) and fitting parameters (like epochs) as further arguments.

from keras.wrappers.scikit_learn import KerasRegressor

regressor_keras = KerasRegressor(
    create_keras_model, 
    batch_size=32, 
    layer1_units=64,
    layer2_units=64,
    dropout_rate=0,
    l2_regularization=0,
    epochs=10, 
    verbose=False
)

And there we are. We can now construct a new pipeline with our Keras model and set up our grid search:

pipeline_keras = Pipeline([
    ('imputer', SimpleImputer()), 
    ('scaler', StandardScaler()), 
    ('model', regressor_keras)
])

param_grid_keras ={
    'imputer__strategy': ['mean', 'median'],
    'scaler': [MinMaxScaler(), StandardScaler()],
    'model__layer1_units': [32, 64],
    'model__dropout_rate': [0, 0.2, 0.5],
    'model__epochs': [40, 60]
}

gridsearch_keras = GridSearchCV(
    pipeline_keras,
    param_grid=param_grid_keras,
    scoring='neg_mean_squared_error',
    cv=2, 
    refit=True,
    verbose=True
)

gridsearch_keras.fit(X_train, y_train)

y_pred_keras = gridsearch_keras.predict(X_test)

print(f'Best params: {gridsearch_keras.best_params_}')
## Best params: {'imputer__strategy': 'mean', 'model__dropout_rate': 0, 'model__epochs': 60, 'model__layer1_units': 64, 'scaler': StandardScaler(copy=True, with_mean=True, with_std=True)}
print(f'Best score: {gridsearch_keras.best_score_}')
## Best score: -23.78173567908125
print(f'Test MSE: {mean_squared_error(y_true=y_test, y_pred=y_pred_keras)}')
## Test MSE: 14.671843620506703
print(f'Test MAE: {mean_absolute_error(y_true=y_test, y_pred=y_pred_keras)}')
## Test MAE: 2.620115729204313