B Cautionary notes

This chapter addresses drake’s known edge cases, pitfalls, and weaknesses that might not be fixed in future releases. For the most up-to-date information on unhandled edge cases, please visit the issue tracker, where you can submit your own bug reports as well. Be sure to search the closed issues too, especially if you are not using the most up-to-date development version of drake. For a guide to debugging and testing drake projects, please refer to the separate guide to debugging and testing drake projects.

B.1 Projects built with drake <= 4.4.0 are not back compatible with drake > 4.4.0.

The cache internals changed between 4.4.0 and 5.0.0. Unfortunately, you will need to make() your project all over again.

B.2 Workflow plans

B.2.1 Externalizing commands in R script files

It is common practice to divide the work of a project into multiple R files, but if you do this, you will not get the most out of drake. Please see the chapter on organizing your files for more details.

B.2.2 Commands are NOT perfectly flexible.

In your workflow plan data frame (produced by drake_plan() and accepted by make()), your commands can usually be flexible R expressions.

drake_plan(
  target1 = 1 + 1 - sqrt(sqrt(3)),
  target2 = my_function(web_scraped_data) %>% my_tidy
)
## # A tibble: 2 x 2
##   target  command                                  
##   <chr>   <chr>                                    
## 1 target1 1 + 1 - sqrt(sqrt(3))                    
## 2 target2 my_function(web_scraped_data) %>% my_tidy

However, please try to avoid formulas and function definitions in your commands. You may be able to get away with drake_plan(f = function(x){x + 1}) or drake_plan(f = y ~ x) in some use cases, but be careful. It is generally to define functions and formulas in your workspace and then let make() import them. (Alternatively, use the envir argument to make() to tightly control which imported functions are available.) Use the check_plan() function to help screen and quality-control your workflow plan data frame, use tracked() to see the items that are reproducibly tracked, and use vis_drake_graph() and build_drake_graph() to see the dependency structure of your project.

B.3 Execution

B.3.1 Install drake properly.

You must properly install drake using install.packages(), devtools::install_github(), or a similar approach. Functions like devtools::load_all() are insufficient, particularly for parallel computing functionality in which separate new R sessions try to require(drake).

B.3.2 Install all your packages.

Your workflow may depend on external packages such as ggplot2, dplyr, and MASS. Such packages must be formally installed with install.packages(), devtools::install_github(), devtools::install_local(), or a similar command. If you load uninstalled packages with devtools::load_all(), results may be unpredictable and incorrect.

B.3.3 A note on tidy evaluation

Running commands in your R console is not always exactly like running them with make(). That’s because make() uses tidy evaluation as implemented in the rlang package.

## This workflow plan uses rlang's quasiquotation operator `!!`.
my_plan <- drake_plan(list = c(
  little_b = "\"b\"",
  letter = "!!little_b"
))
my_plan
## # A tibble: 2 x 2
##   target   command   
##   <chr>    <chr>     
## 1 little_b "\"b\""   
## 2 letter   !!little_b
make(my_plan)
## target little_b
## target letter
readd(letter)
## [1] "b"

For the commands you specify the free-form ... argument, drake_plan() also supports tidy evaluation. For example, it supports quasiquotation with the !! argument. Use tidy_evaluation = FALSE or the list argument to suppress this behavior.

my_variable <- 5

drake_plan(
  a = !!my_variable,
  b = !!my_variable + 1,
  list = c(d = "!!my_variable")
)
## # A tibble: 3 x 2
##   target command      
##   <chr>  <chr>        
## 1 a      5            
## 2 b      5 + 1        
## 3 d      !!my_variable

drake_plan(
  a = !!my_variable,
  b = !!my_variable + 1,
  list = c(d = "!!my_variable"),
  tidy_evaluation = FALSE
)
## # A tibble: 3 x 2
##   target command          
##   <chr>  <chr>            
## 1 a      !!my_variable    
## 2 b      !!my_variable + 1
## 3 d      !!my_variable

For instances of !! that remain in the workflow plan, make() will run these commands in tidy fashion, evaluating the !! operator using the environment you provided.

B.3.4 Find and diagnose your errors.

When make() fails, use failed() and diagnose() to debug. Try the following out yourself.

## Targets with available diagnostic metadata, incluing errors, warnings, etc.
diagnose()
## [1] "letter"   "little_b"

f <- function(){
  stop("unusual error")
}

bad_plan <- drake_plan(target = f())

withr::with_message_sink(
  stdout(),
  make(bad_plan)
)
## target target
## fail target
## Error: Target `target` failed. Call `diagnose(target)` for details. Error message:
##   unusual error
## Warning: No message sink to remove.

failed() # From the last make() only
## [1] "target"

error <- diagnose(target)$error # See also warnings and messages.

error$message
## [1] "unusual error"

error$call
## f()

error$calls # View the traceback.
## [[1]]
## local({
##     f()
## })
## 
## [[2]]
## eval.parent(substitute(eval(quote(expr), envir)))
## 
## [[3]]
## eval(expr, p)
## 
## [[4]]
## eval(expr, p)
## 
## [[5]]
## eval(quote({
##     f()
## }), new.env())
## 
## [[6]]
## eval(quote({
##     f()
## }), new.env())
## 
## [[7]]
## f()
## 
## [[8]]
## stop("unusual error")

B.3.5 Refresh the drake_config() list early and often.

The master configuration list returned by drake_config() is important to drake’s internals, and you will need it for functions like outdated() and vis_drake_graph(). The config list corresponds to a single call to make(), and you should not modify it by hand afterwards. For example, modifying the targets element post-hoc will have no effect because the graph element will remain the same. It is best to just call drake_config() again.

B.3.6 Workflows as R packages.

The R package structure is a great way to organize the files of your project. Writing your own package to contain your data science workflow is a good idea, but you will need to

  1. Use expose_imports() to properly account for all your nested function dependencies, and
  2. If you load the package with devtools::load_all(), set the prework argument of make(): e.g. make(prework = "devtools::load_all()").

See the file organization chapter and ?expose_imports for detailed explanations. Thanks to Jasper Clarkberg for the workaround.

B.3.7 The lazy_load flag does not work with "parLapply" parallelism.

Ordinarily, drake prunes the execution environment at every parallelizable stage. In other words, it loads all the dependencies and unloads anything superfluous for entire batches of targets. This approach may require too much memory for some use cases, so there is an option to delay the loading of dependencies using the lazy_load argument to make() (powered by delayedAssign()). There are two major risks.

  1. make(..., lazy_load = TRUE, parallelism = "parLapply", jobs = 2) does not work. If you want to use local multisession parallelism with multiple jobs and lazy loading, try "future_lapply" parallelism instead.

    library(future)
    future::plan(multisession)
    load_mtcars_example() # Get the code with drake_example("mtcars").
    make(my_plan, lazy_load = TRUE, parallelism = "future_lapply")
  2. Delayed evaluation may cause the same dependencies to be loaded multiple times, and these duplicated loads could be slow.

B.3.8 Timeouts may be unreliable.

You can call make(..., timeout = 10) to time out all each target after 10 seconds. However, timeouts rely on R.utils::withTimeout(), which in turn relies on setTimeLimit(). These functions are the best that R can offer right now, but they have known issues, and timeouts may fail to take effect for certain environments.

B.4 Dependencies

B.4.1 Objects that contain functions may rebuild too often

For example, an R6 class changes whenever a new R6 object of that class is created.

library(digest)
library(R6)
circle_class <- R6Class(
  "circle_class",
  private = list(radius = NULL),
  public = list(
    initialize = function(radius){
      private$radius <- radius
    },
    area = function(){
      pi * private$radius ^ 2
    }
  )
)
digest(circle_class)
## [1] "d98ba1a8435470978c8ba150fe36fe7e"
circle <- circle_class$new(radius = 5)
digest(circle_class) # example_class changed
## [1] "f798207ca44525875bddce32e4c923ad"
rm(circle)

Ordinarily, drake overreacts to this change and builds targets repeatedly.

clean()
plan <- drake_plan(
  circle = circle_class$new(radius = 10),
  area = circle$area()
)
make(plan) # `circle_class` changes because it is referenced.
## target circle
## target area
make(plan) # Builds `circle` again because `circle_class` changed.
## target circle

The solution is to define your R6 class inside a function. drake does the right thing when it comes to tracking changes to functions.

clean()
new_circle <- function(radius){
  circle_class <- R6Class(
    "circle_class",
    private = list(radius = NULL),
    public = list(
      initialize = function(radius){
        private$radius <- radius
      },
      area = function(){
        pi * private$radius ^ 2
      }
    )
  )
  circle_class$new(radius = radius)
}
plan <- drake_plan(
  circle = new_circle(radius = 10),
  area = circle$area()
)
make(plan)
## target circle
## target area
make(plan)
## All targets are already up to date.

B.4.2 The magrittr dot (.) is always ignored.

It is common practice to use a literal dot (.) to carry an object through a magrittr pipeline. Due to some tricky limitations in static code analysis, drake never treats the dot (.) as a dependency, even if you use it as an ordinary variable outside of a tidyverse context.

deps_code("sqrt(x + y + .)")
## $globals
## [1] "sqrt" "x"    "y"
deps_code("dplyr::filter(complete.cases(.))")
## $namespaced
## [1] "dplyr::filter"
## 
## $globals
## [1] "complete.cases"

B.4.3 Dependencies are not tracked in some edge cases.

You should explicitly learn the items in your workflow and the dependencies of your targets.

?deps
?tracked
?vis_drake_graph

drake can be fooled into skipping objects that should be treated as dependencies. For example:

f <- function(){
  b <- get("x", envir = globalenv()) # x is incorrectly ignored
  digest::digest(file_dependency)
}

deps_code(f)
## $globals
## [1] "get"             "globalenv"       "file_dependency"
## 
## $namespaced
## [1] "digest::digest"

command <- "x <- digest::digest(file_in(\"input_file.rds\")); assign(\"x\", 1); x" # nolint
deps_code(command)
## $globals
## [1] "assign"
## 
## $namespaced
## [1] "digest::digest"
## 
## $file_in
## [1] "\"input_file.rds\""

drake takes special precautions so that a target/import does not depend on itself. For example, deps_code(f) might return "f" if f() is a recursive function, but make() just ignores this conflict and runs as expected. In other words, make() automatically removes all self-referential loops in the dependency network.

B.4.4 Dependencies of knitr reports

If you have knitr reports, you can use knitr_report() in your commands so that your reports are refreshed every time one of their dependencies changes. See drake_example("mtcars") and the ?knitr_in() help file examples for demonstrations. Dependencies are detected if you call loadd() or readd() in your code chunks. But beware: an empty call to loadd() does not account for any dependencies even though it loads all the available targets into your R session.

B.4.5 File outputs in imported functions.

Do not call file_out() inside imported functions that you write. Only targets in your workflow plan data frame should have file outputs.

## toally_fine() will depend on the imported data.csv file.
## But make sure data.csv is an imported file and not a file target.
totally_okay <- function(x, y, z){
  read.csv(file_in("data.csv"))
}

## file_out() is for file targets,
## so `drake` will ignore it in imported functions.
avoid_this <- function(x, y, z){
  read.csv(file_out("data.csv"))
}

B.4.6 Functions produced by Vectorize()

With functions produced by Vectorize(), detecting dependencies is especially hard because the body of every such function is

args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
names <- if (is.null(names(args)))
    character(length(args)) else names(args)
dovec <- names %in% vectorize.args
do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]),
    SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))

Thus, if f is constructed with Vectorize(g, ...), drake searches g() for dependencies, not f(). In fact, if drake sees that environment(f)[["FUN"]] exists and is a function, then environment(f)[["FUN"]] will be analyzed instead of f(). Furthermore, if f() is the output of Vectorize(), then drake reproducibly tracks environment(f)[["FUN"]] rather than f() itself. Thus, if the configuration settings of vectorization change (such as which arguments are vectorized), but the core element-wise functionality remains the same, then make() will not react. Also, if you hover over the f node in vis_drake_graph(hover = TRUE), then you will see the body of environment(f)[["FUN"]], not the body of f().

B.4.7 Compiled code is not reproducibly tracked.

Some R functions use .Call() to run compiled code in the backend. The R code in these functions is tracked, but not the compiled object called with .Call(), nor its C/C++/Fortran source.

B.4.8 Directories (folders) are not reproducibly tracked.

In your workflow plan, you can use file_in(), file_out(), and knitr_in() to assert that some targets/imports are external files. However, entire directories (i.e. folders) cannot be reproducibly tracked this way. Please see issue 12 for a discussion.

B.4.9 Packages are not tracked as dependencies.

drake may import functions from packages, but the packages themselves are not tracked as dependencies. For this, you will need other tools that support reproducibility beyond the scope of drake. Packrat creates a tightly-controlled local library of packages to extend the shelf life of your project. And with Docker, you can execute your project on a virtual machine to ensure platform independence. Together, packrat and Docker can help others reproduce your work even if they have different software and hardware.

B.5 High-performance computing

B.5.1 The practical utility of parallel computing

drake claims that it can

  1. Build and cache your targets in parallel (in stages).
  2. Build and cache your targets in the correct order, finishing dependencies before starting targets that depend on them.
  3. Deploy your targets to the parallel backend of your choice.

However, the practical efficiency of the parallel computing functionality remains to be verified rigorously. Serious performance studies will be part of future work that has not yet been conducted at the time of writing. In addition, each project has its own best parallel computing set up, and the user needs to optimize it on a case-by-case basis. Some general considerations include the following.

  • The high overhead and high scalability of distributed computing versus the low overhead and low scalability of local multicore computing.
  • The high memory usage of local multicore computing, especially "mclapply" parallelism, as opposed to distributed computing, which can spread the memory demands over the available nodes on a cluster.
  • The marginal gains of increasing the number of jobs indefinitely, especially in the case of local multicore computing if the number of cores is low.

B.5.2 Maximum number of simultaneous jobs

Be mindful of the maximum number of simultaneous parallel jobs you deploy. At best, too many jobs is poor etiquette on a system with many users and limited resources. At worst, too many jobs will crash a system. The jobs argument to make() sets the maximum number of simultaneous jobs in most cases, but not all.

For most of drake’s parallel backends, jobs sets the maximum number of simultaneous parallel jobs. However, there are ways to break the pattern. For example, make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4") uses at most 2 jobs for the imports and at most 4 jobs for the targets. (In make(), args overrides jobs for the targets). For make(..., parallelism = "future_lapply_staged"), the jobs argument is ignored altogether. Instead, you should set the workers argument where it is available (for example, future::plan(mutlisession(workers = 2)) or future::plan(future.batchtools::batchtools_local(workers = 2))) in the preparations before make(parallelism = "future_lapply_staged"). Alternatively, you might limit the max number of jobs by setting options(mc.cores = 2) before calling make(parallelism = "future_lapply_staged"). Depending on the future backend you select with future::plan() or future::plan(), you might make use of one of the other environment variables listed in ?future::future.options.

B.5.3 Parallel computing on Windows

On Windows, do not use make(..., parallelism = "mclapply", jobs = n) with n greater than 1. You could try, but jobs will just be demoted to 1. Instead, please replace "mclapply" with one of the other parallelism_choices() or let drake choose the parallelism backend for you. For make(..., parallelism = "Makefile"), Windows users need to download and install Rtools.

B.5.4 Configuring future/batchtools parallelism for clusters

The "future_lapply" backend unlocks a large array of distributed computing options on serious computing clusters. However, it is your responsibility to configure your workflow for your specific job scheduler. In particular, special batchtools *.tmpl configuration files are required, and the technique is described in the documentation of batchtools. You can find some examples of these files in the inst/templates folders of the batchtools and future.batchtools GitHub repositories. drake has some prepackaged example workflows. See drake_examples() to view your options, and then drake_example() to write the files for an example.

drake_example("sge")    # Sun/Univa Grid Engine workflow and supporting files
drake_example("slurm")  # SLURM
drake_example("torque") # TORQUE

To write just *.tmpl files from these examples, see the drake_batchtools_tmpl_file() function.

Unfortunately, there is no one-size-fits-all *.tmpl configuration file for any job scheduler, so we cannot guarantee that the above examples will work for you out of the box. To learn how to configure the files to suit your needs, you should make sure you understand your job scheduler and batchtools.

B.5.5 Proper Makefiles are not standalone.

The Makefile generated by make(myplan, parallelism = "Makefile") is not standalone. Do not run it outside of drake::make(). drake uses dummy timestamp files to tell the Makefile what to do, and running make in the terminal will most likely give incorrect results.

B.5.6 Makefile-level parallelism for imported objects and files

Makefile-level parallelism is only used for targets in your workflow plan data frame, not imports. To process imported objects and files, drake selects the best local parallel backend for your system and uses the jobs argument to make(). To use at most 2 jobs for imports and at most 4 jobs for targets, run

make(..., parallelism = "Makefile", jobs = 2, args = "--jobs=4")

B.5.7 Zombie processes

Some parallel backends, particularly mclapply and future::multicore, may create zombie processes. Zombie children are not usually harmful, but you may wish to kill them yourself. The following function by Carl Boneri should work on Unix-like systems. For a discussion, see drake issue 116.

fork_kill_zombies <- function(){
  require(inline)
  includes <- "#include <sys/wait.h>"
  code <- "int wstat; while (waitpid(-1, &wstat, WNOHANG) > 0) {};"

  wait <- inline::cfunction(
    body = code,
    includes = includes,
    convention = ".C"
  )

  invisible(wait())
}

B.6 Storage

B.6.1 Projects hosted on Dropbox and similar platforms

If download a drake project from Dropbox, you may get an error like the one in issue 198:


cache pathto/.drake
connect 61 imports: ...
connect 200 targets: ...
Error in rawToChar(as.raw(x)) : 
  embedded nul in string: 'initial_drake_version\0\0\x9a\x9d\xdc\0J\xe9\0\0\0(\x9d\xf9brם\0\xca)\0\0\xb4\xd7\0\0\0\0\xb9'
In addition: Warning message:
In rawToChar(as.raw(x)) :
  out-of-range values treated as 0 in coercion to raw

This is probably because Dropbox generates a bunch of “conflicted copy” files when file transfers do not go smoothly. This confuses storr, drake’s caching backend.


keys/config/aG9vaw (Sandy Sum's conflicted copy 2018-01-31)
keys/config/am9icw (Sandy Sum's conflicted copy 2018-01-31)
keys/config/c2VlZA (Sandy Sum's conflicted copy 2018-01-31)

Just remove these files using drake_gc() and proceed with your work.

cache <- get_cache()
drake_gc(cache)

B.6.2 Cache customization is limited

The storage guide describes how storage works in drake. As explained near the end of that chapter, you can plug custom storr caches into make(). However, non-RDS caches such as storr_dbi() may not work with most forms of parallel computing. The storr::storr_dbi() cache and many others are not thread-safe. Either

  1. Set parallelism = "clustermq_staged" in make(), or
  2. Set parallelism = "future" with caching = "master" in make(), or
  3. Use no parallel computing at all.

B.6.3 Runtime predictions

In predict_runtime() and rate_limiting_times(), drake only accounts for the targets with logged build times. If some targets have not been timed, drake throws a warning and prints the untimed targets.

Copyright Eli Lilly and Company