Caching multiple R functions to a s3 bucket via memoise and aws.s3

Sep 11, 2019 · 634 words · 3 minute read

In practice it can makes sense to memoize functions that are called multiple times with the same input. In R this can be done via the memoise package - and instead of simply caching to local disk or memory we can easily cache expensive function calls to cloud storage like s3 or gcs.

In R it is as simple as this:

wait <- function(s = 0) {
  Sys.sleep(s)
  return(s)
}

mwait <- memoise::memoise(wait)

If we now call our wait function with the s=3 for the first time, it will be executed, as it is the first call with this argument.

system.time({mwait(3)})

##    user  system elapsed 
##   0.001   0.000   3.004

Now calling the function again with the same argument, it can return the result from the cache:

system.time({mwait(3)})

##    user  system elapsed 
##   0.032   0.002   0.034

We can also reset the cache by calling memoise::forget(mwait).

In the default case the results are cached in memory. To persist the cache it makes sense to cache to disk instead, and for distributed processes we could even cache to cloud storage like s3. Luckily this functionality already exists in the memoise package, we can simply cache to a s3 bucket:

cache <- memoise::cache_s3("our_bucket_name")
mwait <- memoise::memoise(wait, cache = cache)

If we want to cache multiple functions we would need to set up multiple buckets. In theory we could cache multiple functions to the same bucket, because hash key collisions are very, very rare, but then resetting the cache of one function would also reset the cache of the others.

If we still want to cache our functions to the same bucket using subfolders (or prefixes using the s3 terminology) we have to slightly modify the original cache_s3 function, which provides the caching interface. The original cache_keys for example fetches all keys of the bucket via aws.s3::get_bucket, whereas in our case we only want to get keys with specific prefixes. This can be easily achieved using aws.s3::get_bucket’s prefix argument. For the other functions this is analogous.

#' similar to memoise::cache_s3, but a prefix can be used to have a structure within the bucket and
#' enable caching of multiple functions to the same bucket without interferences of caches (for ex. when performing cache resets)
#' @param bucket a bucket name
#' @param prefix a prefix name, for ex. the name of the function this cache should be used for
cache_s3_folder <- function(bucket, folder) {
  if (!(requireNamespace("aws.s3"))) {
    stop("Package `aws.s3` must be installed for `cache_s3_folder()`.")
  }

  # construct prefix from folder name by adding trailing slash if needed
  if (substr(folder, start = nchar(folder), stop = nchar(folder)) != "/") {
    prefix <- paste0(folder, "/")
  } else {
    prefix <- folder
  }

  if (!(aws.s3::bucket_exists(bucket))) {
    aws.s3::put_bucket(bucket)
  }

  # instead of checking key directly we will check for prefix/key
  get_object_name <- function(key) paste0(prefix, key)

  path <- tempfile("memoise-")
  dir.create(path)

  cache_reset <- function() {
    keys <- cache_keys()
    lapply(keys, aws.s3::delete_object, bucket = bucket)
  }

  cache_set <- function(key, value) {
    temp_file <- file.path(path, key)
    on.exit(unlink(temp_file))
    saveRDS(value, file = temp_file, compress = FALSE)
    aws.s3::put_object(temp_file, object = get_object_name(key), bucket = bucket)
  }

  cache_get <- function(key) {
    temp_file <- file.path(path, key)
    httr::with_config(httr::write_disk(temp_file, overwrite = TRUE), {
      aws.s3::get_object(object = get_object_name(key), bucket = bucket)
    })
    readRDS(temp_file)
  }

  cache_has_key <- function(key) {
    suppressMessages(aws.s3::head_object(object = get_object_name(key), bucket = bucket))
  }

  cache_drop_key <- function(key) {
    aws.s3::delete_bucket(get_object_name(key), bucket = bucket)
  }

  cache_keys <- function() {
    items <- lapply(aws.s3::get_bucket(bucket = bucket, prefix = prefix), `[[`, "Key")
    unlist(Filter(Negate(is.null), items))
  }

  list(
    digest = function(...) digest::digest(..., algo = "sha512"),
    reset = cache_reset,
    set = cache_set,
    get = cache_get,
    has_key = cache_has_key,
    drop_key = cache_drop_key,
    keys = cache_keys
  )
}

Using this function we can cache multiple functions to the same bucket, using different subfolders, for ex. our_bucket_name/wait and our_bucket_name/rnorm:

cache1 <- cache_s3_folder("our_bucket_name", "wait")
mwait <- memoise::memoise(wait, cache = cache1)

cache2 <- cache_s3_folder("our_bucket_name", "rnorm")
mrnorm <- memoise::memoise(rnorm, cache = cache2)