Writing your own sklearn transformer: DataFrames, feature scaling and ColumnTransformer (writing your own sklearn functions, part 2)

· 835 words · 4 minute read

Since scikit-learn added DataFrame support to the API a while ago it became even easier to modify and write your own transformers - and the workflow has become a lot easier.

Many of sklearns home remedies still work with numpy arrays internally or return arrays, which often makes a lot of sense when it comes to performance. Performance can especially be important in pipelines, as it can quickly introduce bottlenecks if one transformer is much slower than the others. This is especially bad when predictions are time-critical, for example when serving the model as an endpoint for live predictions. If performance is not critical or carefully evaluated many of the transformers can be adjusted and improved to work with and return DataFrames, which have some advantages: they are a very natural part of a data science workflow, they can contain different data types and store column names.

One example is feature standardization, which might be critical if you use linear models our neural networks:

## Built with python 3.7.13 with following packages:
## pandas==0.25.3
## scikit-learn==0.23.2
import pandas as pd
import numpy as np

data = pd.DataFrame({
  'num1': [1.0, 2.0, 10.0, 1.0, 3.0, 0.0],
  'num2': [2.0, 3.0, 20.0, -3.0, 5.0, 0.5],
})
data
##    num1  num2
## 0   1.0   2.0
## 1   2.0   3.0
## 2  10.0  20.0
## 3   1.0  -3.0
## 4   3.0   5.0
## 5   0.0   0.5

The StandardScaler can “standardize features by removing the mean and scaling to unit variance” according to the docs. During the fit it learns mean and standard deviation of the training data, which can be used to normalize the features during the transform. By default the transformer coerces the transformed data to a numpy array, and therefore the column names are dropped:

from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()
standard_scaler.fit(data);
standard_scaler.transform(data)
## array([[-0.54931379, -0.35307151],
##        [-0.24968808, -0.21639867],
##        [ 2.14731753,  2.10703964],
##        [-0.54931379, -1.03643571],
##        [ 0.04993762,  0.05694702],
##        [-0.84893949, -0.55808077]])

We can easily change the transformer to return DataFrames, either by inheriting from the existing transformer or by encapsulating it:

from sklearn.preprocessing import StandardScaler

class AnotherStandardScaler(StandardScaler):
    def fit(self, X, y=None, **kwargs):
        self.feature_names_ = X.columns
        return super().fit(X, y, **kwargs)
    
    def transform(self, X, **kwargs):
        return pd.DataFrame(data=super().transform(X, **kwargs), 
                            columns=self.feature_names_)
another_standard_scaler = AnotherStandardScaler()
another_standard_scaler.fit_transform(data)
##        num1      num2
## 0 -0.549314 -0.353072
## 1 -0.249688 -0.216399
## 2  2.147318  2.107040
## 3 -0.549314 -1.036436
## 4  0.049938  0.056947
## 5 -0.848939 -0.558081

We can modify it further to accept a cols argument to only target specific columns:

from sklearn.preprocessing import StandardScaler

class ColumnStandardScaler(StandardScaler):
    def __init__(self, cols):
        self.cols = cols
        super().__init__()

    def fit(self, X, y=None, **kwargs):
        self.feature_names_ = X.columns
        return super().fit(X[self.cols], y, **kwargs)
    
    def transform(self, X, **kwargs):
        x_scaled = pd.DataFrame(data=super().transform(X[self.cols], **kwargs), 
                                columns=self.cols)
        x_unscaled = X[[col for col in X.columns if col not in self.cols]]
        return pd.concat([x_scaled, x_unscaled], axis=1)
column_standard_scaler = ColumnStandardScaler(cols=['num1'])
column_standard_scaler.fit_transform(data)
##        num1  num2
## 0 -0.549314   2.0
## 1 -0.249688   3.0
## 2  2.147318  20.0
## 3 -0.549314  -3.0
## 4  0.049938   5.0
## 5 -0.848939   0.5

The encapsulated version looks as follows, we still inherit from BaseEstimator and TransformerMixin, as BaseEstimator gives us get_params and set_params for free and TransformerMixin provides fit_transform. Excuse the stupid name AnotherColumnStandardScaler:

from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from typing import List

class AnotherColumnStandardScaler(BaseEstimator, TransformerMixin):
    def __init__(self, cols: List[str]):
        self.cols = cols
        self.standard_scaler = StandardScaler()
        
    def fit(self, X, y=None, **kwargs):
        self.feature_names_ = X.columns
        self.standard_scaler.fit(X[self.cols])
        return self
        
    def transform(self, X, **kwargs):
        x_scaled = pd.DataFrame(data=self.standard_scaler.transform(X[self.cols]), 
                                columns=self.cols)
        x_unscaled = X[[col for col in X.columns if col not in self.cols]]
        return pd.concat([x_scaled, x_unscaled], axis=1)[self.feature_names_]
another_column_standard_scaler = AnotherColumnStandardScaler(cols=['num1'])
another_column_standard_scaler.fit_transform(data)
##        num1  num2
## 0 -0.549314   2.0
## 1 -0.249688   3.0
## 2  2.147318  20.0
## 3 -0.549314  -3.0
## 4  0.049938   5.0
## 5 -0.848939   0.5

If we want to develop our own transformer instead of modifying or encapsulating the existing one, we can create it as follows:

from sklearn.base import BaseEstimator, TransformerMixin
from typing import List

class CustomStandardScaler(BaseEstimator, TransformerMixin):
    def __init__(self, cols: List[str]):
        self.cols = cols

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        self.means_ = {col: np.mean(X[col]) for col in self.cols}
        self.stds_ = {col: np.std(X[col]) for col in self.cols}
        return self

    def transform(self, X: pd.DataFrame):
        X = X.copy()
        for col in self.cols:
            X[col] = (X[col] - self.means_[col]) / self.stds_[col]
        return X
custom_standard_scaler = CustomStandardScaler(cols=['num1'])
custom_standard_scaler.fit_transform(data)
##        num1  num2
## 0 -0.549314   2.0
## 1 -0.249688   3.0
## 2  2.147318  20.0
## 3 -0.549314  -3.0
## 4  0.049938   5.0
## 5 -0.848939   0.5

The result is the same as for the normal sklearn scaler:

custom_standard_scaler.transform(data)['num1'].values == standard_scaler.transform(data)[:,0]
## array([ True,  True,  True,  True,  True,  True])

Instead of writing our own transformer we could also use sklearns ColumnTransformer to apply different transformers to different columns (and keep the others via passing passthrough). But again, this one will return arrays and therefore drop the column names:

from sklearn.compose import ColumnTransformer

column_transformer = ColumnTransformer(
    transformers=[
        ('scaler', AnotherStandardScaler(), ['num1']),
    ],
    remainder='passthrough')

column_transformer.fit_transform(data)
## array([[-0.54931379,  2.        ],
##        [-0.24968808,  3.        ],
##        [ 2.14731753, 20.        ],
##        [-0.54931379, -3.        ],
##        [ 0.04993762,  5.        ],
##        [-0.84893949,  0.5       ]])