Estimating travel times using geospatial feature engineering and tree based models

ยท 2938 words ยท 14 minute read

In the following post I want to give some impressions on how historic travel time data can be used to build a model for travel time estimations. In case you are are collecting your own characteristic travel data the you might actually be able to beat third-party-solutions in terms of accuracy as well as cost.

We will see why tree models might be suitable for this use case and we will do some feature engineering to improve the model performance.

Using tree based models ๐Ÿ”—

Let us look at some homemade toy data first, to see how a tree based model might be able to learn travel times. We assume a simple 2-dimensional space sized 10 times 10, where travelling from one coordinate to another can be done as the crow flies, so the distance in terms of time can be computed using the Pythagorean theorem (euclidean distance).

We store all possible combinations (x1, y1) to (x2, y2) and compute their distances:

## Built with python 3.6.15 with following packages:
## scikit-learn==0.23.2
## pandas==0.25.3
## seaborn==0.8.1
## numpy==1.19.0
## matplotlib==1.5.3
## pyproj==2.6.1
## lightgbm==2.3.1
## xgboost==0.90
import pandas as pd
from itertools import product
import numpy as np


prod = product(range(0, 10), range(0, 10))
xy = pd.DataFrame([p for p in prod], columns=['x', 'y'])

# cross join
xy_copy = xy.copy()
xy['_key'] = 1
xy_copy['_key'] = 1
data = pd.merge(xy, xy_copy, on='_key', suffixes = ('1', '2')).drop('_key', axis=1)

data.head(10)
##    x1  y1  x2  y2
## 0   0   0   0   0
## 1   0   0   0   1
## 2   0   0   0   2
## 3   0   0   0   3
## 4   0   0   0   4
## 5   0   0   0   5
## 6   0   0   0   6
## 7   0   0   0   7
## 8   0   0   0   8
## 9   0   0   0   9
def calculate_euclidean_distance(x1, y1, x2, y2):
    return np.sqrt(np.power(x2-x1, 2) + np.power(y2-y1, 2))
    
data['t'] = calculate_euclidean_distance(data['x1'], data['y1'], data['x2'], data['y2'])

To be able to visualize the travel time let us look at all trips from (0, 0) to all other locations. We first extract the relevant rows and then create a pivot table to be able to plot the data as a heatmap:

data_0_0 = data.loc[(data['x1'] == 0) & (data['y1'] == 0),:].copy()
import matplotlib
matplotlib.use('TkAgg')

import seaborn as sns
import matplotlib.pyplot as plt

data_0_0_piv = data_0_0.pivot('y2','x2','t')

ax = sns.heatmap(data_0_0_piv, 
                 cmap='Blues',
                 robust=True)
ax = ax.invert_yaxis()

So if we want to travel from (0, 0) to (3, 3) it will cost around 4.2 (sqrt(3^2+3^2)), and so on. The pattern is essentially a quarter-circle around the origin.

Now as you know, decision trees are pretty much made to learn patterns as above, at least that is how they look at data: they partition the feature space (here: x and y, or more specifically x1, x2 and y1, y2) into rectangles (leafs). Let us train a decision tree on the data and look how well it picks up the pattern. For a resilient test error we would need to split our dataset into train and test in a spatial manner, for the visualization we will skip this step though:

from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(max_depth=7, min_samples_leaf=5, random_state=30)

x, y = data[['x1', 'y1', 'x2', 'y2']], data['t']
_ = dtr.fit(x, y)

data_0_0['t_pred'] = dtr.predict(data_0_0[['x1', 'y1', 'x2', 'y2']])
data_0_0_pred = data_0_0.pivot('y2','x2','t_pred')

ax = sns.heatmap(data_0_0_pred, 
                 cmap='Blues',
                 robust=True)
ax = ax.invert_yaxis()
plt.show()

You can see that all areas have at least size 5 (min_samples_leaf). Let us generate a few more trees with different parameters to get a better picture:

import itertools

params = {
    'max_depth': [1,5,10,20],
    'min_samples_leaf': [1,5,10,20],
}

keys = params.keys()
values = (params[key] for key in keys)

params_list = [dict(zip(keys, p)) for p in itertools.product(*values)]

params_list now looks like this:

## [{'max_depth': 1, 'min_samples_leaf': 1}, {'max_depth': 1, 'min_samples_leaf': 5}, {'max_depth': 1, 'min_samples_leaf': 10}, {'max_depth': 1, 'min_samples_leaf': 20}, {'max_depth': 5, 'min_samples_leaf': 1}, {'max_depth': 5, 'min_samples_leaf': 5}, {'max_depth': 5, 'min_samples_leaf': 10}, {'max_depth': 5, 'min_samples_leaf': 20}, {'max_depth': 10, 'min_samples_leaf': 1}, {'max_depth': 10, 'min_samples_leaf': 5}, {'max_depth': 10, 'min_samples_leaf': 10}, {'max_depth': 10, 'min_samples_leaf': 20}, {'max_depth': 20, 'min_samples_leaf': 1}, {'max_depth': 20, 'min_samples_leaf': 5}, {'max_depth': 20, 'min_samples_leaf': 10}, {'max_depth': 20, 'min_samples_leaf': 20}]
data_list = []

for params in params_list:
    dtr = DecisionTreeRegressor(**params, random_state=1234)
    dtr = dtr.fit(x, y)
    
    data_0_0['t_pred'] = dtr.predict(data_0_0[['x1', 'y1', 'x2', 'y2']])
    
    data_0_0_p = data_0_0.pivot('y2','x2','t_pred')
    
    for key, value in params.items():
        data_0_0_p[key] = value
    
    data_list.append(data_0_0_p)

data = pd.concat(data_list)

We can plot all heatmaps using FacetGrid and a helper function:

def heatmap_view(**kwargs):
    kwargs['data'] = kwargs['data'].iloc[0:10,0:10]
    ax = sns.heatmap(
        **kwargs, 
        cmap='Blues', 
        vmin=0, 
        vmax=13, 
        square=True, 
        cbar_kws={"shrink": .5}
    )
    return ax.invert_yaxis()

g = sns.FacetGrid(data, row='max_depth', col='min_samples_leaf')
g = g.map_dataframe(heatmap_view)

Geospatial features: haversine distance and azimuth ๐Ÿ”—

Two plausible geospatial features are haversine distance (shortest distance between 2 points on a sphere, here assuming Earth-radius) and azimuth (describes an angle between 2 coordinates).

We can theoretically compute both metrics using pyproj, but we will see that plain numpy implementations perform better for calls on arrays.

Let us first compute (forward) azimuth, back azimuth and distance between Berlin and a national park in the middle of Germany using pyproj.Geod:

from pyproj import Geod

geodesic = Geod(ellps='clrk66')
azimuth, backward_azimuth, haversine_distance = geodesic.inv(10.5, 51.1, 13.37, 52.5)
azimuth, haversine_distance
## (50.68448458832755, 251888.26670165634)

Azimuth and haversine distance

\

The numpy equivalent functions can be defined as follows, the haversine for example as following way:

def calculate_haversine_distance(lat1, lon1, lat2, lon2):
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    m = 6367 * 1000 * c
    return m

calculate_haversine_distance(51.1, 10.5, 52.5, 13.37) # = 251.xxx
## 251169.9510116781

The following function can be used to compute the azimuth (also called compass bearing) according to this gist:

def calculate_azimuth(lat1, lon1, lat2, lon2):
    lat1_rad = np.radians(lat1)
    lat2_rad = np.radians(lat2)

    lon_delta = np.radians(lon2 - lon1)
    x = np.sin(lon_delta) * np.cos(lat2_rad)
    y = np.cos(lat1_rad) * np.sin(lat2_rad) - (np.sin(lat1_rad)
            * np.cos(lat2_rad) * np.cos(lon_delta))

    initial_bearing_rad = np.arctan2(x, y)
    initial_bearing = np.degrees(initial_bearing_rad)
    compass_bearing = (initial_bearing + 360) % 360
    return compass_bearing
    
calculate_azimuth(51.1, 10.5, 52.5, 13.37) # = 50.6x
## 50.61201583089286

Let us now compare the pyproj and numpy variants for arrays of length 1, 10, 100 and 1000, using timeit. For a fairer comparison we will place the constructor call to Geod in the global scope:

import timeit
from pyproj import Geod

geodesic = Geod(ellps='clrk66')

def calculate_azimuth_haversine_distance_np(lon1, lat1, lon2, lat2):
    return calculate_azimuth(lon1, lat1, lon2, lat2), calculate_haversine_distance(lon1, lat1, lon2, lat2)

def calculate_azimuth_haversine_distance_pyproj(lon1, lat1, lon2, lat2):
    azimuth, _, haversine_distance = geodesic.inv(lon1, lat1, lon2, lat2)
    return azimuth, haversine_distance    
    
def time_func(func, n):
    lat1, lon1 = np.repeat(51.1, n), np.repeat(10.5, n)
    lat2, lon2 = np.repeat(52.5, n), np.repeat(13.37, n)
    number = 100
    # timeit evaluates in s, we return ms
    return timeit.timeit(
        stmt=f'calculate_azimuth_haversine_distance_{func}(lat1, lon1, lat2, lon2)', 
        number=number, 
        setup='from __main__ import calculate_azimuth_haversine_distance_np, calculate_azimuth_haversine_distance_pyproj',
        globals=locals()
    ) / number * 1000
import matplotlib.pyplot as plt
from itertools import product
import pandas as pd
import seaborn as sns

n = np.power(10, range(4))
prod = product(['np', 'pyproj'], n)
df = pd.DataFrame([p for p in prod], columns=['func', 'n'])

df['t'] = df.apply(lambda row: time_func(func=row['func'], n=row['n']), axis=1)

plt.figure(figsize=(4,3))
## <matplotlib.figure.Figure object at 0x128342908>
ax = sns.barplot(x='n', y='t', hue='func', data=df)
plt.show()

If we want to add these computed features to our coordinates now, we could for example create sklearn transformer. We will base our transformers on pandas, although in general, if you want to productionize your code it is good practice to use numpy arrays for performance reasons. In general those are a bit more challenging to work with, because you need to keep state of the column names if you want to extract things like feature importance.

from sklearn.base import BaseEstimator, TransformerMixin
from typing import Callable
import pandas as pd

class GeoTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, 
                 name, 
                 func: Callable[[float, float, float, float], float], 
                 lat1='pickup_latitude', lon1='pickup_longitude', lat2='dropoff_latitude', lon2='dropoff_longitude'):
        self.name = name
        self.func = func
        
        self.lat1 = lat1
        self.lon1 = lon1
        self.lat2 = lat2
        self.lon2 = lon2
    
    def fit(self, X: pd.DataFrame, y: pd.Series=None):
        # nothing to do here
        return self
    
    def transform(self, X: pd.DataFrame, y: pd.Series=None) -> pd.DataFrame:
        df = X.copy()
        df[self.name] = self.func(
            lat1=X[self.lat1],
            lon1=X[self.lon1], 
            lat2=X[self.lat2], 
            lon2=X[self.lon2], 
        )
        return df

Again, let us also define some toy data to see test the transformers:

data = pd.DataFrame({
  'pickup_location': ['National park', 'Museum of Photography', 'Museum of Photography', ],
  'dropoff_location': ['Random location in Berlin', 'Alexanderplatz', 'Treptow Harbour'],
  'pickup_latitude': [51.1, 52.5033067, 52.5033067],
  'pickup_longitude': [10.5, 13.3403941, 13.3403941],
  'dropoff_latitude': [52.5, 52.5236452, 52.4937243],
  'dropoff_longitude': [13.37, 13.4042092, 13.4472089],
  'pickup_datetime': pd.to_datetime(['2020-06-18 12:30:00', '2020-06-21 13:45:00', '2020-06-21 14:00:00'])
})
##          pickup_location  ...     pickup_datetime
## 0          National park  ... 2020-06-18 12:30:00
## 1  Museum of Photography  ... 2020-06-21 13:45:00
## 2  Museum of Photography  ... 2020-06-21 14:00:00
## 
## [3 rows x 7 columns]
hd_transformer = GeoTransformer(name='haversine_distance', func=calculate_haversine_distance)
hd_transformer.fit_transform(data)
##          pickup_location  ... haversine_distance
## 0          National park  ...      251169.951012
## 1  Museum of Photography  ...        4871.679462
## 2  Museum of Photography  ...        7304.160510
## 
## [3 rows x 8 columns]
a_transformer = GeoTransformer(name='azimuth', func=calculate_azimuth)
a_transformer.fit_transform(data)
##          pickup_location  ...    azimuth
## 0          National park  ...  50.612016
## 1  Museum of Photography  ...  62.333692
## 2  Museum of Photography  ...  98.340432
## 
## [3 rows x 8 columns]

In a similar fashion we could also add some datetime parts like hour, weekday, day and month, for example to capture traffic trends and seasonalities during the day or even the year.

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class DatetimeTransformer(BaseEstimator, TransformerMixin):
    cols = ['hour', 'weekday', 'day', 'month']
    
    def __init__(self, col: str):
        self.col = col

    def fit(self, X: pd.DataFrame, y: pd.Series=None):
        # nothing to do here
        return self

    def transform(self, X: pd.DataFrame, y: pd.Series=None) -> pd.DataFrame:
        X.loc[:,'hour'] = X[self.col].dt.hour
        X.loc[:,'weekday'] = X[self.col].dt.weekday
        X.loc[:,'day'] = X[self.col].dt.day
        X.loc[:,'month'] = X[self.col].dt.month
        return X
dt_transformer = DatetimeTransformer(col='pickup_datetime')
dt_transformer.fit_transform(data)
##          pickup_location           dropoff_location  ...  day  month
## 0          National park  Random location in Berlin  ...   18      6
## 1  Museum of Photography             Alexanderplatz  ...   21      6
## 2  Museum of Photography            Treptow Harbour  ...   21      6
## 
## [3 rows x 11 columns]

Lastly we will add a transformer to select specific columns to explore and compare the different features:

from typing import List
import pandas as pd

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, cols: List[str]):
        self.cols = cols
    
    def fit(self, X: pd.DataFrame, y: pd.Series=None):
        return self
        
    def transform(self, X: pd.DataFrame, y: pd.Series=None) -> pd.DataFrame:
        return X[self.cols]

Training a model ๐Ÿ”—

In a previous post we have worked with New York taxi data, containing pickup- and dropoff-coordinates as well as trip durations.

Let us extract a few random trips, we can use the following query, store the result as a csv file and load it using pandas:

WITH raw_data AS
(
  SELECT 
    pickup_datetime,
    pickup_latitude,
    pickup_longitude,
    dropoff_latitude,
    dropoff_longitude,
    DATETIME_DIFF(dropoff_datetime, pickup_datetime, SECOND) AS duration
  FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2016`  
  WHERE 
    pickup_latitude BETWEEN -90 AND 90
    AND pickup_longitude IS NOT NULL
    AND dropoff_latitude BETWEEN -90 AND 90
    AND dropoff_longitude IS NOT NULL
),
filtered_data AS
(
  SELECT
    *
  FROM raw_data
  WHERE 
    duration BETWEEN 60 AND 60 * 60 * 6
    AND ST_DISTANCE(ST_GEOGPOINT(pickup_longitude, pickup_latitude), ST_GEOGPOINT(dropoff_longitude, dropoff_latitude)) BETWEEN 100 AND 25000
)
SELECT
  *
FROM filtered_data
WHERE 
  RAND() < 200000 / (SELECT COUNT(*) FROM filtered_data)
data = pd.read_csv("../../../static/data/bigquery/bigquery_public_data_new_york_taxi_trips_tlc_yellow_trips_2016_sample.csv.gz",
                   parse_dates=['pickup_datetime']).sample(200000)

To evaluate the model we will create we should split the data. We will split the data randomly, which is somewhat naive in our use case, since the model can learn from almost identical coordinates which most likely exist in the data. If we want to evaluate how well the model performs on new areas we could split the data geographically, to estimate how well the model can abstract geographies and apply them to new areas. For that use case the model will perform rather poorly.

To make the model more robust for edge and outlier cases, or in general cases outside of the supported area, it can make sense to create a fallback model as proposed in one of my previous blog posts. In any case features like haversine distance and azimuth (direction) will help the model to to abstract estimates from or to unknown coordinates up to some degree.

Moreover we will define root mean squared error as our error metric, to penalize large errors more, but we will also look at mean absolute error for improved interpretability.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

def rmse(y_true, y_pred):
    return np.round(np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred)), 2)

def mae(y_true, y_pred):
    return np.round(mean_absolute_error(y_true=y_true, y_pred=y_pred), 2)

col_target = 'duration'
cols_features = ['pickup_latitude', 'pickup_longitude', 
                 'dropoff_latitude', 'dropoff_longitude', 
                 'pickup_datetime']

x_train, x_test, y_train, y_test = train_test_split(data[cols_features], data[col_target])

Let us now try to get an estimate of how good a linear model performs on the data, also to compare it to the nonlinear models later. Instead of simply using the pickup and dropoff coordinates as features we can k-bins-discretize them to allow the model to learn some kind of grid pattern:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge

cols_features = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude',
                 'hour', 'weekday', 'day', 'month',
                 'haversine_distance', 'azimuth']
cols_coordinates = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']

k_bins_discretizer = ColumnTransformer([
    ('k_bins_discretizer', KBinsDiscretizer(n_bins=16, encode='onehot-dense'), cols_coordinates),
], remainder='drop')

pipeline = Pipeline([
    ('haversine_distance', GeoTransformer(name='haversine_distance', func=calculate_haversine_distance)),
    ('azimuth', GeoTransformer(name='azimuth', func=calculate_haversine_distance)),
    ('datetime', DatetimeTransformer(col='pickup_datetime')),
    ('column_selector', ColumnSelector(cols=cols_features)),
    ('k_bins_discretizer', k_bins_discretizer),
    ('model', Ridge())
])

_ = pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)
rmse(y_pred=y_pred, y_true=y_test), mae(y_pred=y_pred, y_true=y_test)
## (559.06, 398.98)

We can see that applying the KBinsDiscretizer actually makes the predictions worse:

from sklearn.model_selection import GridSearchCV

cv = GridSearchCV(pipeline, param_grid = {'k_bins_discretizer': ['passthrough', k_bins_discretizer]})
_ = cv.fit(x_train, y_train)
y_pred = cv.predict(x_test)

rmse(y_pred=y_pred, y_true=y_test), mae(y_pred=y_pred, y_true=y_test)
## (401.83, 276.81)
cv.best_params_
## {'k_bins_discretizer': 'passthrough'}

Let us now create a generic pipeline to compare different models, like xgboost, LightGBM and also two native sklearn implementations.

from sklearn.pipeline import Pipeline

def create_pipeline(model):
    cols_features = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude',
                     'hour', 'weekday', 'day', 'month',
                     'haversine_distance', 'azimuth']
    
    pipeline = Pipeline([
        ('haversine_distance', GeoTransformer(name='haversine_distance', func=calculate_haversine_distance)),
        ('azimuth', GeoTransformer(name='azimuth', func=calculate_haversine_distance)),
        ('datetime', DatetimeTransformer(col='pickup_datetime')),
        ('column_selector', ColumnSelector(cols=cols_features)),
        ('model', model)
    ])
    
    return pipeline

LightGBM has the advantage that it can handle categorical features quite well, without the necessity of one-hot-encoding. In general one-hot-encoding should be applied carefully in tree models, especially for variables with higher cardinality - you can find a good explanation in this blog post. In our case we will not apply one-hot-encoding to the categorical variables like weekday or hour. Alternatively cyclical encoding could be applied, to avoid sparsity and independence of the dummified features introduced by naive one-hot-encoding.

from lightgbm.sklearn import LGBMRegressor
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost import XGBRegressor

models = {
    'LGBMRegressor': LGBMRegressor(importance_type='gain'),
    'GradientBoostingRegressor': GradientBoostingRegressor(),
    'RandomForestRegressor': RandomForestRegressor(),
    'XGBRegressor': XGBRegressor(objective='reg:squarederror'),
    'Ridge': Ridge(),
}

Let us now create pipelines for each model and evaluate them using our test data:

import time

pipelines = {}

for model_name, model in models.items():
    pipeline = create_pipeline(model)
    
    t = time.time()
    _ = pipeline.fit(x_train, y_train)
    pipelines[model_name] = pipeline
    
    y_pred = pipeline.predict(x_test)
    
    metrics = {
      'rmse': rmse(y_pred=y_pred, y_true=y_test),
      'mae': mae(y_pred=y_pred, y_true=y_test),
      'fit_time': np.round(time.time() - t, 2)
    }
    
    print(f'{model_name}: {metrics}')
## LGBMRegressor: {'rmse': 299.21, 'mae': 196.23, 'fit_time': 1.18}
## GradientBoostingRegressor: {'rmse': 332.25, 'mae': 219.75, 'fit_time': 34.99}
## RandomForestRegressor: {'rmse': 303.28, 'mae': 197.26, 'fit_time': 140.97}
## XGBRegressor: {'rmse': 332.9, 'mae': 220.31, 'fit_time': 12.15}
## Ridge: {'rmse': 401.83, 'mae': 276.81, 'fit_time': 0.3}

Let us extract (and normalize) and compare the feature importances:

feature_importances = {
    model_name: pipeline.named_steps['model'].feature_importances_ / sum(pipeline.named_steps['model'].feature_importances_)
    for model_name, pipeline in pipelines.items()
    if hasattr(pipeline.named_steps['model'], 'feature_importances_')
}

feature_importances_df = pd.DataFrame(feature_importances, index=cols_features)
feature_importances_df.plot(kind='bar')
## <matplotlib.axes._subplots.AxesSubplot object at 0x127f229e8>

We can quickly take a look at the performance of the full pipelines, to get a rough estimate:

import timeit

n = 100
rows = 10

for model_name, pipeline in pipelines.items():
    t = timeit.timeit(stmt='pipeline.predict(x_test.iloc[0:rows,:])', 
                      globals=globals(), 
                      number=n) / n
    print(f'{model_name}: {t * 1000:.1f}ms')
## LGBMRegressor: 56.1ms
## GradientBoostingRegressor: 31.8ms
## RandomForestRegressor: 61.4ms
## XGBRegressor: 28.9ms
## Ridge: 30.4ms

Now let us take a look at some predictions by visualizing them using Uber’s kepler.gl. We can for example pick a random pickup location and make predictions for many different dropoff locations and map these onto a choropleth map.

We can use pyproj again to map some corner coordinates into a flat space (and back), generate a grid with points of equal distance and then convert the points back.

from pyproj import Transformer
import numpy as np

# define a grid using lat longs of south-west and north-east corners and step size in meters
def create_grid(sw_lon, sw_lat, ne_lon, ne_lat, step_size):
    transformer = Transformer.from_crs("epsg:4326", "epsg:3857", always_xy=True)
    sw_lon_transformed, sw_lat_transformed = transformer.transform(sw_lon, sw_lat)
    ne_lon_transformed, ne_lat_transformed = transformer.transform(ne_lon, ne_lat) 

    # create actual grid
    x = np.arange(sw_lon_transformed, ne_lon_transformed, step_size)
    y = np.arange(sw_lat_transformed, ne_lat_transformed, step_size)
    xx, yy = np.meshgrid(x, y)
    
    lons, lats = transformer.transform(xx.ravel(), yy.ravel(), errcheck=True, direction='inverse')
    return lons, lats
lons, lats = create_grid(sw_lon=-74.031216,  sw_lat=40.741904,  ne_lon=-73.893833, ne_lat=40.807015,  step_size=200)

grid = pd.DataFrame({
    'pickup_datetime': pd.to_datetime('2020-06-20 12:30:00'),
    'pickup_latitude': 40.770880,
    'pickup_longitude': -73.949176,
    'dropoff_latitude': lats,
    'dropoff_longitude': lons,  
})

We can now get our pipeline using the LightGBM regressor and do some predictions:

pipeline = pipelines['LGBMRegressor']
grid['y_pred'] = pipeline.predict(grid)
grid.sample(10)
##          pickup_datetime  pickup_latitude  ...  dropoff_longitude       y_pred
## 3118 2020-06-20 12:30:00         40.77088  ...         -73.962944   701.220479
## 437  2020-06-20 12:30:00         40.77088  ...         -73.937791   755.670820
## 685  2020-06-20 12:30:00         40.77088  ...         -73.907248  1071.559425
## 2315 2020-06-20 12:30:00         40.77088  ...         -74.022233  1881.329286
## 2247 2020-06-20 12:30:00         40.77088  ...         -74.006063  1433.173588
## 488  2020-06-20 12:30:00         40.77088  ...         -73.984504  1193.817610
## 3173 2020-06-20 12:30:00         40.77088  ...         -74.002470  1384.050994
## 526  2020-06-20 12:30:00         40.77088  ...         -73.916232  1017.198240
## 301  2020-06-20 12:30:00         40.77088  ...         -73.905452  1123.225951
## 3564 2020-06-20 12:30:00         40.77088  ...         -73.991690  1354.784517
## 
## [10 rows x 6 columns]

Finally we can export the grid precitions to a file and upload it to kepler.gl. I’ve prepared a animated visualization of the grid, where the grid cells are colored depending on the predicted value (red to yellow, low to high):

New York model isochrones

\

We can see that the model learns a circular pattern, and also some street topology. But can we also incorporate bridges, highways, etc? For example by either using the taxi zones provided as another public dataset, or alternatively by applying clustering first and use the assigned cluster as a feature. Since our data originates from New York we could also use manhatten metric, as the streets are well structured and follow a grid layout. Some of these challenges will be tackled in a second part.

Update from 2020-06-25: In case you are interested, a few days ago an excellent blog post on the same topic was published via medium. Using Gaussian mixtures as clustering is a very good approach, as the cluster assignment can be done very fast.