Since scikit-learn
added DataFrame
support to the API a while ago it became even easier to modify and write your own transformers - and the workflow has become a lot easier.
Many of sklearns
home remedies still work with numpy
arrays internally or return arrays, which often makes a lot of sense when it comes to performance. Performance can especially be important in pipelines, as it can quickly introduce bottlenecks if one transformer is much slower than the others. This is especially bad when predictions are time-critical, for example when serving the model as an endpoint for live predictions. If performance is not critical or carefully evaluated many of the transformers can be adjusted and improved to work with and return DataFrames
, which have some advantages: they are a very natural part of a data science workflow, they can contain different data types and store column names.
One example is feature standardization, which might be critical if you use linear models our neural networks:
## Built with python 3.7.13 with following packages:
## pandas==0.25.3
## scikit-learn==0.23.2
import pandas as pd
import numpy as np
data = pd.DataFrame({
'num1': [1.0, 2.0, 10.0, 1.0, 3.0, 0.0],
'num2': [2.0, 3.0, 20.0, -3.0, 5.0, 0.5],
})
data
## num1 num2
## 0 1.0 2.0
## 1 2.0 3.0
## 2 10.0 20.0
## 3 1.0 -3.0
## 4 3.0 5.0
## 5 0.0 0.5
The StandardScaler
can “standardize features by removing the mean and scaling to unit variance” according to the docs. During the fit
it learns mean and standard deviation of the training data, which can be used to normalize the features during the transform
. By default the transformer coerces the transformed data to a numpy
array, and therefore the column names are dropped:
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
standard_scaler.fit(data);
standard_scaler.transform(data)
## array([[-0.54931379, -0.35307151],
## [-0.24968808, -0.21639867],
## [ 2.14731753, 2.10703964],
## [-0.54931379, -1.03643571],
## [ 0.04993762, 0.05694702],
## [-0.84893949, -0.55808077]])
We can easily change the transformer to return DataFrames
, either by inheriting from the existing transformer or by encapsulating it:
from sklearn.preprocessing import StandardScaler
class AnotherStandardScaler(StandardScaler):
def fit(self, X, y=None, **kwargs):
self.feature_names_ = X.columns
return super().fit(X, y, **kwargs)
def transform(self, X, **kwargs):
return pd.DataFrame(data=super().transform(X, **kwargs),
columns=self.feature_names_)
another_standard_scaler = AnotherStandardScaler()
another_standard_scaler.fit_transform(data)
## num1 num2
## 0 -0.549314 -0.353072
## 1 -0.249688 -0.216399
## 2 2.147318 2.107040
## 3 -0.549314 -1.036436
## 4 0.049938 0.056947
## 5 -0.848939 -0.558081
We can modify it further to accept a cols
argument to only target specific columns:
from sklearn.preprocessing import StandardScaler
class ColumnStandardScaler(StandardScaler):
def __init__(self, cols):
self.cols = cols
super().__init__()
def fit(self, X, y=None, **kwargs):
self.feature_names_ = X.columns
return super().fit(X[self.cols], y, **kwargs)
def transform(self, X, **kwargs):
x_scaled = pd.DataFrame(data=super().transform(X[self.cols], **kwargs),
columns=self.cols)
x_unscaled = X[[col for col in X.columns if col not in self.cols]]
return pd.concat([x_scaled, x_unscaled], axis=1)
column_standard_scaler = ColumnStandardScaler(cols=['num1'])
column_standard_scaler.fit_transform(data)
## num1 num2
## 0 -0.549314 2.0
## 1 -0.249688 3.0
## 2 2.147318 20.0
## 3 -0.549314 -3.0
## 4 0.049938 5.0
## 5 -0.848939 0.5
The encapsulated version looks as follows, we still inherit from BaseEstimator
and TransformerMixin
, as BaseEstimator
gives us get_params
and set_params
for free and TransformerMixin
provides fit_transform
. Excuse the stupid name AnotherColumnStandardScaler
:
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from typing import List
class AnotherColumnStandardScaler(BaseEstimator, TransformerMixin):
def __init__(self, cols: List[str]):
self.cols = cols
self.standard_scaler = StandardScaler()
def fit(self, X, y=None, **kwargs):
self.feature_names_ = X.columns
self.standard_scaler.fit(X[self.cols])
return self
def transform(self, X, **kwargs):
x_scaled = pd.DataFrame(data=self.standard_scaler.transform(X[self.cols]),
columns=self.cols)
x_unscaled = X[[col for col in X.columns if col not in self.cols]]
return pd.concat([x_scaled, x_unscaled], axis=1)[self.feature_names_]
another_column_standard_scaler = AnotherColumnStandardScaler(cols=['num1'])
another_column_standard_scaler.fit_transform(data)
## num1 num2
## 0 -0.549314 2.0
## 1 -0.249688 3.0
## 2 2.147318 20.0
## 3 -0.549314 -3.0
## 4 0.049938 5.0
## 5 -0.848939 0.5
If we want to develop our own transformer instead of modifying or encapsulating the existing one, we can create it as follows:
from sklearn.base import BaseEstimator, TransformerMixin
from typing import List
class CustomStandardScaler(BaseEstimator, TransformerMixin):
def __init__(self, cols: List[str]):
self.cols = cols
def fit(self, X: pd.DataFrame, y: pd.Series = None):
self.means_ = {col: np.mean(X[col]) for col in self.cols}
self.stds_ = {col: np.std(X[col]) for col in self.cols}
return self
def transform(self, X: pd.DataFrame):
X = X.copy()
for col in self.cols:
X[col] = (X[col] - self.means_[col]) / self.stds_[col]
return X
custom_standard_scaler = CustomStandardScaler(cols=['num1'])
custom_standard_scaler.fit_transform(data)
## num1 num2
## 0 -0.549314 2.0
## 1 -0.249688 3.0
## 2 2.147318 20.0
## 3 -0.549314 -3.0
## 4 0.049938 5.0
## 5 -0.848939 0.5
The result is the same as for the normal sklearn scaler:
custom_standard_scaler.transform(data)['num1'].values == standard_scaler.transform(data)[:,0]
## array([ True, True, True, True, True, True])
Instead of writing our own transformer we could also use sklearns
ColumnTransformer
to apply different transformers to different columns (and keep the others via passing passthrough
). But again, this one will return arrays and therefore drop the column names:
from sklearn.compose import ColumnTransformer
column_transformer = ColumnTransformer(
transformers=[
('scaler', AnotherStandardScaler(), ['num1']),
],
remainder='passthrough')
column_transformer.fit_transform(data)
## array([[-0.54931379, 2. ],
## [-0.24968808, 3. ],
## [ 2.14731753, 20. ],
## [-0.54931379, -3. ],
## [ 0.04993762, 5. ],
## [-0.84893949, 0.5 ]])