Dockerizing a R machine learning model with s3 connection and end-to-end tests on Travis

· 1406 words · 7 minute read

In this post I want to give a short introduction in how to set up a dockerized R process or script which reads and writes data from and to AWS S3 and is tested from end to end via Travis. As an example process we will set up a simple random forest model which we will use to compute the importance of features for a classification problem.

For our end to end test we will use the iris dataset, i.e. the classification problem: species ~ sepal_length + sepal_width + petal_length + petal_width. For the sake of simplicty we will refrain from techniques like cross validation and hyperparameter tuning. We will also serve the model as one big chunk, i.e. we will do the training and “prediction” of feature importance in one step. In most scenarios a machine learning model is served in at least two servings: one to process the training data and train the model and another one to generate predictions in batches or on demand using the previously trained model. This will be presented in another blog post, as it comes along with a more complicated infrastructure.

Running a script within a Docker container offers multiple advantages: isolation, reproducability & immutability of the code and all the used packages and dependencies, but also an almost-independence from our operating system. There are tons of hosted or locally deployable solutions to schedule and run Docker containers. AWS S3 is a simple & fast file storage which can be used as persistant file storage within the Docker container, and Travis offers a platform to build Docker images and enable proper testing and deployment (and much more!).

Instead of setting up the container as a batch prediction model we set up a container which handles training and prediction

#' calculate feature importance of classification where one column (label) is used 
#' as response an all others as explanatory variables
#' @param data a data frame 
#' @param label a column of data used as label
#' @return a data frame with feature importance 
calculate_feature_importance <- function(data, label, verbose=TRUE) {
  if (!label %in% colnames(data)) {
    stop("Label column missing in data: ", label)
  }
  
  # use label as response and all other columns as explanatory variables
  formula <- as.formula(sprintf("%s ~ .", label))
  
  # use impurity as variable importance mode enable extraction of feature importance
  model <- ranger::ranger(formula, data = data, 
                          importance = "impurity", 
                          seed = 42)
  
  # tidy output
  variable_importance_list <- model$variable.importance
  variable_importance <- dplyr::tibble("name" = names(variable_importance_list), 
                                       "importance" = variable_importance_list)
  variable_importance <- dplyr::arrange(variable_importance, dplyr::desc(importance))
  
  if (verbose) {
    message("Feature importance:")
    for (i in 1:nrow(variable_importance)) {
      row <- variable_importance[i,]
      message(sprintf("%s: %s%%", row$name, round(row$importance, 4)))
    }
  }
  return(variable_importance)
}

Around this function we will set up a script which will take care of reading and writing the input respectively output data from S3 (for example from a data lake) or alternatively a local folder. Reading from local data sources enables us to write tests on local data. The S3 R client from aws.s3 tries to read the AWS credentials from the enviroment (AWS_ACCESS_KEY_ID, AWS_SECRET_ACESS_KEY and possibly AWS_SESSION_TOKEN).

#' read file from folder (if it exists) or s3 bucket and parse to data frame
#' @param file a file name
#' @param folder a folder name or s3 bucket name
#' @return parsed csv file from folder (if exists, else: s3 bucket) 
read_data <- function(file, folder) {
  if (dir.exists(folder)) {
    message("Reading from local folder: ", folder)
    path <- file.path(folder, file)
    data <- readr::read_csv(path)
  } else {
    message("Reading from s3 bucket: ", folder)
    data <- aws.s3::s3read_using(FUN = readr::read_csv, 
                                 object = file, 
                                 bucket = folder)
  }
  return(data)
}

#' write data frame to csv file in folder (if it exists) or s3 bucket
#' @param data a dataframe
#' @param file a file name
#' @param folder a folder name or s3 bucket name
#' @return written data frame invisibly
write_data <- function(data, file, folder) {
  if (dir.exists(folder)) {
    message("Writing to local folder: ", folder)
    path <- file.path(folder, file)
    readr::write_csv(data, path)
  } else {
    message("Writing to s3 bucket: ", folder)
    aws.s3::s3write_using(x = data, 
                          FUN = readr::write_csv,
                          object = file,
                          bucket = folder)
  }
  invisible(data)
}

Having defined our simple process calculate_feature_importance and our data read and write functions we can setup the script in the following way: The script will read the command line arguments which we will use to pass parameters like the input file and folder as well as output file and folder. This can be achieved using commandArgs. Afterwards the data is digested via read_data, processed via calculate_feature_importance and the result data frame then saved again by write_data. Finally a simple response file with the elapsed processing time is written.

args <- commandArgs(TRUE)
arg1 <- args[1]

# if arg is file read from file & parse json, else just parse json
if (file.exists(arg1)) {
  params_str <- readr::read_file(arg1)
  params <- jsonlite::fromJSON(params_str)
} else {
  params <- jsonlite::fromJSON(arg1)
}

source("utils/utils.R")
source("utils/process.R")

start_time <- Sys.time()

input_data <- read_data(params$input_file, params$input_folder)

output_data <- calculate_feature_importance(input_data, label = params$label)

write_data(output_data, params$output_file, params$output_folder)

# write response file
processing_time <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))
response <- jsonlite::toJSON(x = list(processing_time = processing_time), 
                             pretty = TRUE, 
                             auto_unbox = TRUE)
readr::write_file(x = response, 
                  path = file.path(params$output_folder, "response.json"))

Now we will define the Dockerfile to build our container. As a base image can use rocker/tidyverse, in practice it makes sense to pin the version, i.e. rocker/tidyverse:3.6. We can add the additionally required packages aws.s3 (our R S3 client) and ranger (our random forest model). For these it also makes sense to pin the version, we can use the CRAN read-only mirror on GitHub for this and install a specific version using remotes::install_github("user/repo@version""). In that case it makes sense to add a GitHub token to the build step to not hit the GitHub API limits from the Travis machines. Alternatively we could install specific versions using remotes::install_version.

In another layer we will add our utility functions in utils and the script script.R itself. Moreover we will add some tests to enable end-to-end testing of the container, for example on travis. Finally we set script.R as entrypoint of the image.

FROM rocker/tidyverse:3.6

ARG GITHUB_PAT

RUN Rscript -e 'remotes::install_github("cran/aws.s3@0.3.12", quiet=TRUE)' \
	&& Rscript -e 'remotes::install_github("cran/ranger@0.11.2", quiet=TRUE)'

ADD utils utils
ADD script.R script.R

ADD tests tests

ENTRYPOINT ["Rscript", "script.R"]

The image can be built, tagged and run via docker build and docker run, where we will pass our parameters json as an enviroment variable to Rscript:

docker build -t feature_importance .
docker run -t feature_importance {"input_folder": "...", ...}

If we want to access the output (or input) or data we can mount the current working directory to the container:

docker run \
  -v $(pwd)/output:/tests/data/output \
  -t feature_importance \
  tests/params.json

This enables us to run the container on test data and evaluate the output, a test from end to end: we can check if the container exits properly and the output data and response file is as expected. By running the test on the container instead of the plain R code we can make sure that the image is built properly. In theory we should also add unit tests to test our functions read_data, generate_feature_importance, etc. - for ex. via testthat.

To build the image and run the tests we can use Travis in combination with pytest, using the following travis.yml, which tells Travis what to do:

language: python
python:
  - "3.6"
services:
  - docker

install:
  - pip install -r requirements.txt

script:
  - docker build --build-arg GITHUB_PAT=${GITHUB_PAT} -t feature_importance .
  - pytest

In the tests folder we can now add the following tests, in our case written in python, which will be picked up and run by pytest in the script step, right after the image is built:

  • run docker image with test params and data via subprocess
  • verify exit code
  • verify output data using pandasassert_frame_equal
  • verify response file json
import subprocess
import pandas as pd
from pandas.testing import assert_frame_equal
import json


def test_end_to_end():
    cmd = "docker run -v $(pwd)/output:/tests/data/output -t feature_importance tests/params.json"
    exit_code = subprocess.call(cmd, shell=True)

    # check exit code
    assert exit_code == 0

    output_file = 'output/output.csv'
    output_data = pd.read_csv(output_file)
    output_data_expected = pd.DataFrame({
        'name': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
        'importance': [9.57, 2.34, 43.52, 43.81]
    })
    assert_frame_equal(output_data, output_data_expected, check_exact=False, check_less_precise=1)

    response_file = 'output/response.json'
    with open(response_file) as json_file:
        response = json.load(json_file)
    assert response['processing_time'] < 10

The code can be found on GitHub, the builds on Travis. Travis can also be used to deploy the container to AWS ECR, you can find nice post about this here. The remote execution of the container can then for example be done via Airflow or kubernetes.