B Cautionary notes

This chapter addresses drake’s known edge cases, pitfalls, and weaknesses that might not be fixed in future releases. For the most up-to-date information on unhandled edge cases, please visit the issue tracker, where you can submit your own bug reports as well. Be sure to search the closed issues too, especially if you are not using the most up-to-date development version of drake. For a guide to debugging and testing drake projects, please refer to the separate guide to debugging and testing drake projects.

B.1 Workflow plans

B.1.1 Externalizing commands in R script files

It is common practice to divide the work of a project into multiple R files, but if you do this, you will not get the most out of drake. Please see the chapter on organizing your files for more details.

B.1.2 Commands are NOT perfectly flexible.

In your drake plan (produced by drake_plan() and accepted by make()), your commands can usually be flexible R expressions.

However, please try to avoid formulas and function definitions in your commands. You may be able to get away with drake_plan(f = function(x){x + 1}) or drake_plan(f = y ~ x) in some use cases, but be careful. It is generally to define functions and formulas in your workspace and then let make() import them. (Alternatively, use the envir argument to make() to tightly control which imported functions are available.) Use the check_plan() function to help screen and quality-control your drake plan, use tracked() to see the items that are reproducibly tracked, and use vis_drake_graph() and build_drake_graph() to see the dependency structure of your project.

B.2 Execution

B.2.1 Install drake properly.

You must properly install drake using install.packages(), devtools::install_github(), or a similar approach. Functions like devtools::load_all() are insufficient, particularly for parallel computing functionality in which separate new R sessions try to require(drake).

B.2.2 Install all your packages.

Your workflow may depend on external packages such as ggplot2, dplyr, and MASS. Such packages must be formally installed with install.packages(), devtools::install_github(), devtools::install_local(), or a similar command. If you load uninstalled packages with devtools::load_all(), results may be unpredictable and incorrect.

B.2.3 A note on tidy evaluation

Running commands in your R console is not always exactly like running them with make(). That’s because make() uses tidy evaluation as implemented in the rlang package.

## This `drake` plan uses rlang's quasiquotation operator `!!`.
my_plan <- drake_plan(list = c(
  little_b = "\"b\"",
  letter = "!!little_b"
#> Warning in drake_plan(list = c(little_b = "\"b\"", letter = "!!little_b")):
#> The `list` argument of `drake_plan()` is deprecated. Use the interface
#> described at https://ropenscilabs.github.io/drake-manual/plans.html#large-
#> plans.
#> Error in enexpr(expr): object 'little_b' not found
#> # A tibble: 15 x 2
#>    target              command                                             
#>    <chr>               <expr>                                              
#>  1 report              knit(knitr_in("report.Rmd"), file_out("report.md"),…
#>  2 small               simulate(48)                                       …
#>  3 large               simulate(64)                                       …
#>  4 regression1_small   reg1(small)                                        …
#>  5 regression1_large   reg1(large)                                        …
#>  6 regression2_small   reg2(small)                                        …
#>  7 regression2_large   reg2(large)                                        …
#>  8 summ_regression1_s… suppressWarnings(summary(regression1_small$residual…
#>  9 summ_regression1_l… suppressWarnings(summary(regression1_large$residual…
#> 10 summ_regression2_s… suppressWarnings(summary(regression2_small$residual…
#> 11 summ_regression2_l… suppressWarnings(summary(regression2_large$residual…
#> 12 coef_regression1_s… suppressWarnings(summary(regression1_small))$coeffi…
#> 13 coef_regression1_l… suppressWarnings(summary(regression1_large))$coeffi…
#> 14 coef_regression2_s… suppressWarnings(summary(regression2_small))$coeffi…
#> 15 coef_regression2_l… suppressWarnings(summary(regression2_large))$coeffi…
#> Warning: knitr/rmarkdown report 'report.Rmd' does not exist and cannot be
#> inspected for dependencies.
#> Warning: missing input files:
#>   report.Rmd
#> target report
#> Warning: Missing files for target report:
#>   report.md
#> Warning: target report warnings:
#>   cannot open file 'report.Rmd': No such file or directory
#> fail report
#> Error: Target `report` failed. Call `diagnose(report)` for details. Error message:
#>   cannot open the connection
#> Error: key 'letter' ('objects') not found

For the commands you specify the free-form ... argument, drake_plan() also supports tidy evaluation. For example, it supports quasiquotation with the !! argument. Use tidy_evaluation = FALSE or the list argument to suppress this behavior.

For instances of !! that remain in the drake plan, make() will run these commands in tidy fashion, evaluating the !! operator using the environment you provided.

B.2.5 Refresh the drake_config() list early and often.

The master configuration list returned by drake_config() is important to drake’s internals, and you will need it for functions like outdated() and vis_drake_graph(). The config list corresponds to a single call to make(), and you should not modify it by hand afterwards. For example, modifying the targets element post-hoc will have no effect because the graph element will remain the same. It is best to just call drake_config() again.

B.2.6 Workflows as R packages.

The R package structure is a great way to organize the files of your project. Writing your own package to contain your data science workflow is a good idea, but you will need to

  1. Use expose_imports() to properly account for all your nested function dependencies, and
  2. If you load the package with devtools::load_all(), set the prework argument of make(): e.g. make(prework = "devtools::load_all()").

See the file organization chapter and ?expose_imports for detailed explanations. Thanks to Jasper Clarkberg for the workaround.

B.2.7 The lazy_load flag does not work with "parLapply" parallelism.

Ordinarily, drake prunes the execution environment at every parallelizable stage. In other words, it loads all the dependencies and unloads anything superfluous for entire batches of targets. This approach may require too much memory for some use cases, so there is an option to delay the loading of dependencies using the lazy_load argument to make() (powered by delayedAssign()). There are two major risks.

  1. make(..., lazy_load = TRUE, parallelism = "parLapply", jobs = 2) does not work. If you want to use local multisession parallelism with multiple jobs and lazy loading, try "future_lapply" parallelism instead.

  2. Delayed evaluation may cause the same dependencies to be loaded multiple times, and these duplicated loads could be slow.

B.2.8 Timeouts may be unreliable.

You can call make(..., timeout = 10) to time out all each target after 10 seconds. However, timeouts rely on R.utils::withTimeout(), which in turn relies on setTimeLimit(). These functions are the best that R can offer right now, but they have known issues, and timeouts may fail to take effect for certain environments.

B.3 Dependencies

B.3.2 Dependencies are not tracked in some edge cases.

You should explicitly learn the items in your workflow and the dependencies of your targets.

drake can be fooled into skipping objects that should be treated as dependencies. For example:

drake takes special precautions so that a target/import does not depend on itself. For example, deps_code(f) might return "f" if f() is a recursive function, but make() just ignores this conflict and runs as expected. In other words, make() automatically removes all self-referential loops in the dependency network.

B.3.3 Dependencies of knitr reports

If you have knitr reports, you can use knitr_in() in your commands so that your reports are refreshed every time one of their dependencies changes. See drake_example("mtcars") and the ?knitr_in() help file examples for demonstrations. Dependencies are detected if you call loadd() or readd() in your code chunks. But beware: an empty call to loadd() does not account for any dependencies even though it loads all the available targets into your R session.

B.3.4 S3 and generic methods

If you reference S3 methods, only the generic method is tracked as a dependency.

But print() itself is not actually very helpful. Because of S3, print.data.frame() is actually doing the work. If you were to write your own S3 system and change a specific method like print.data.frame(), changes would not be reproducibly tracked because drake only finds the generic function.

This is unavoidable because drake uses static code analysis to detect dependencies. It finds generics like print(), but it has no way of knowing in advance what method will actually be called.

B.3.6 Functions produced by Vectorize()

With functions produced by Vectorize(), detecting dependencies is especially hard because the body of every such function is

Thus, if f is constructed with Vectorize(g, ...), drake searches g() for dependencies, not f(). In fact, if drake sees that environment(f)[["FUN"]] exists and is a function, then environment(f)[["FUN"]] will be analyzed instead of f(). Furthermore, if f() is the output of Vectorize(), then drake reproducibly tracks environment(f)[["FUN"]] rather than f() itself. Thus, if the configuration settings of vectorization change (such as which arguments are vectorized), but the core element-wise functionality remains the same, then make() will not react. Also, if you hover over the f node in vis_drake_graph(hover = TRUE), then you will see the body of environment(f)[["FUN"]], not the body of f().

B.3.7 Compiled code is not reproducibly tracked.

Some R functions use .Call() to run compiled code in the backend. The R code in these functions is tracked, but not the compiled object called with .Call(), nor its C/C++/Fortran source.

B.3.8 Directories (folders) are not reproducibly tracked.

In your drake plan, you can use file_in(), file_out(), and knitr_in() to assert that some targets/imports are external files. However, entire directories (i.e. folders) cannot be reproducibly tracked this way. Please see issue 12 for a discussion.

B.3.9 Packages are not tracked as dependencies.

drake may import functions from packages, but the packages themselves are not tracked as dependencies. For this, you will need other tools that support reproducibility beyond the scope of drake. Packrat creates a tightly-controlled local library of packages to extend the shelf life of your project. And with Docker, you can execute your project on a virtual machine to ensure platform independence. Together, packrat and Docker can help others reproduce your work even if they have different software and hardware.

B.4 High-performance computing

B.4.1 Calling mclapply() within targets

The following workflow fails because make() locks your environment and mclapply() tries to add new variables to it.

But there are plenty of workarounds, including make(plan, lock_envir = FALSE) and other parallel computing functions like parLapply() or furrr::future_map(). See this comment and the ensuing discussion.

B.4.2 Zombie processes

Some parallel backends, particularly make(parallelism = "future") with future::multicore, may create zombie processes. Zombie children are not usually harmful, but you may wish to kill them yourself. The following function by Carl Boneri should work on Unix-like systems. For a discussion, see drake issue 116.

B.5 Storage

B.5.1 Projects hosted on Dropbox and similar platforms

If download a drake project from Dropbox, you may get an error like the one in issue 198:

cache pathto/.drake
connect 61 imports: ...
connect 200 targets: ...
Error in rawToChar(as.raw(x)) : 
  embedded nul in string: 'initial_drake_version\0\0\x9a\x9d\xdc\0J\xe9\0\0\0(\x9d\xf9brם\0\xca)\0\0\xb4\xd7\0\0\0\0\xb9'
In addition: Warning message:
In rawToChar(as.raw(x)) :
  out-of-range values treated as 0 in coercion to raw

This is probably because Dropbox generates a bunch of “conflicted copy” files when file transfers do not go smoothly. This confuses storr, drake’s caching backend.

keys/config/aG9vaw (Sandy Sum's conflicted copy 2018-01-31)
keys/config/am9icw (Sandy Sum's conflicted copy 2018-01-31)
keys/config/c2VlZA (Sandy Sum's conflicted copy 2018-01-31)

Just remove these files using drake_gc() and proceed with your work.

B.5.2 Cache customization is limited

The storage guide describes how storage works in drake. As explained near the end of that chapter, you can plug custom storr caches into make(). However, non-RDS caches such as storr_dbi() may not work with most forms of parallel computing. The storr::storr_dbi() cache and many others are not thread-safe. Either

  1. Set parallelism = "clustermq_staged" in make(), or
  2. Set parallelism = "future" with caching = "master" in make(), or
  3. Use no parallel computing at all.

B.5.3 Runtime predictions

In predict_runtime() and rate_limiting_times(), drake only accounts for the targets with logged build times. If some targets have not been timed, drake throws a warning and prints the untimed targets.

Copyright Eli Lilly and Company