ds/dx - a data science & ml engineering blog

Scheduling a Google Cloud Function to periodically backup your personal Spotify Discover Weekly playlist

As an avid fan of Spotify’s Discover Weekly playlist, I always wanted to have a scheduled, automated, self-controlled lean and cheap way of backing up the weekly generated tracks. This toy project uses python (including the spotipy module), Google Cloud Functions, Cloud Sheduler and Secrets as well as Terraform and Github actions to bring the code and infrastructure to life.

Using Terraform to build a serverless Airflow via Amazon Managed Workflows and automatic DAG sync using GitHub Actions

In this post we will set up once more serverless infrastructure via Terraform: an Airflow deployment using Amazon Managed Workflows, plus GitHub Actions to automatically sync the DAG code to S3. As a baseline, we will fork off Claudio Bizzotto’s great repository claudiobizzotto/aws-mwaa-terraform.

Building a serverless, containerized batch prediction model using Google Cloud Run, Pub/Sub, Cloud Storage and Terraform

The goal of this post is a to set up a serverless infrastructure, managed in code, to serve batch predictions of a machine learning model or any other lightweight computation in an asynchronous way: A Google Cloud Run service will listen for new files in a Cloud Storage bucket via Pub/Sub message topic, trigger a computational process and put the resulting data into another bucket.

Building a serverless, containerized machine learning model API using AWS Lambda & API Gateway and Terraform

Goal of this post is a to set up a serverless infrastructure, managed in code, to serve predictions of a containerized machine learning model via Rest API. We will make use of Terraform to manage our infrastructure, including AWS ECR, S3, Lambda and API Gateway.

Simulating the effect of multicollinearity in linear modelling using R, purrr & parallel computing

Working with linear regression is a sometimes under-appreciated trait in data science. As a generalization of very fundamental statistical concepts like t-tests and analysis of variance it has deep ties into the realm of statistics, and can serve as a powerful tool to explain variance in data. Since linear models are only linear in their parameters, they can also describe polynomial and even multiplicative relationships. But due to parametric nature, linear regression is also more vulnerable to extreme values and multicollinearity in the data, of which we want to analyze the latter in more detail, using a simulation.

Speeding up a sklearn model pipeline to serve single predictions with very low latency

a.k.a. witing your own sklearn functions, part 3. If you have worked with sklearn before you certainly came across the struggles between using dataframes or arrays as inputs to your transformers and estimators. Both bring their advantages and disadvantages. But once you deploy your model, for example as a service, in many cases it will serve single predictions. Max Halford has shown some great examples on how to improve various sklearn transformers and estimators to serve single predictions with an extra performance boost and potential responses in low millisecond range! In this short post we will advance these tricks and develop a full pipeline.

Estimating travel times using geospatial feature engineering and tree based models

In the following post I want to give some impressions on how historic travel time data can be used to build a model for travel time estimations. In case you are are collecting your own characteristic travel data the you might actually be able to beat third-party-solutions in terms of accuracy as well as cost. We will see why tree models might be suitable for this use case and we will do some feature engineering to improve the model performance.