ds/dx - a data science & ml engineering blog

Managing cloud infrastructure in code with Terraform: spawning your own jupyterhub instance on AWS with mounted S3 bucket

Terraform is an open-source software to manage cloud infrastructure in code. With just a few lines of code we can boot up instances, create buckets, provision databases and many other resources. At the same time Terraform allows to dynamically change and evolve infrastructure in a reproducible way. In our case we will use Terraform to launch our own jupyterhub (using ’the littlest jupyterhub’ distribution) on AWS, utilizing a mounted S3 bucket for permanent storage.

Using BigQuery on hundreds of millions of events: deduplicate, analyse, visualize, find connected intervals and aggregate to timeseries

BigQuery allows us to work on hundreds of millions of data points with ease - deduplicating rows, finding connected intervals of events (island and gap problem) or aggregating events to a timeseries (for ex. to feed them to a forecast model or use them in reporting). By using a few tricks the computation can be done in a cheap, fast and simple manner.

Using neural networks with embedding layers to encode high cardinality categorical variables

or - how can we use categorical features with thousands of different values? There are multiple ways to encode categorical features. If no ordered relation between the categories exists one-hot-encoding is a popular candidate (i.e. adding a binary feature for every category), alongside many others. But one-hot-encoding has some drawbacks - which can be tackled by using embeddings.

Writing your own sklearn transformer: DataFrames, feature scaling and ColumnTransformer (writing your own sklearn functions, part 2)

Since scikit-learn added DataFrame support to the API a while ago it became even easier to modify and write your own transformers - and the workflow has become a lot easier. Many of sklearns home remedies still work with numpy arrays internally or return arrays, which often makes a lot of sense when it comes to performance. Performance can especially be important in pipelines, as it can quickly introduce bottlenecks if one transformer is much slower than the others.

Using a Keras model in combination with sklearn preprocessing, pipelines, grid search and cross-validation

In this post I want to give a short intro on how to wrap a (sequential) Keras deep learning model into a sklearn estimator to enable usage of all the nice standard sklearn tools like pipelines and grid search. Sklearn pipelines will allow us to easily use sklearn’s home remedies for imputation and scaling, a necessity for most deep learning models. Via sklearn’s grid search we can tune some of the neural network model parameters (like number of units in a layer or dropout rate) as well as some fitting parameters (like epochs or batch size).

Combining tree based models with a linear baseline model to improve extrapolation (writing your own sklearn functions, part 1)

This post is a short intro on combining different machine learning models for practical purposes, to find a good balance between their advantages and disadvantages. In our case we will ensemble a random forest, a very powerful non-linear, non-parametric tree-based allrounder, with a classical linear regression model, a model that is very easy to interpret and can be verified using domain knowledge.

Caching multiple R functions to a s3 bucket via memoise and aws.s3

In practice it can makes sense to memoize functions that are called multiple times with the same input. In R this can be done via the memoise package - and instead of simply caching to local disk or memory we can easily cache expensive function calls to cloud storage like s3 or GCS.