Chapter 5 drake projects

drake’s design philosophy is extremely R-focused. It embraces in-memory configuration, in-memory dependencies, interactivity, and flexibility.

5.1 Code files

The names and locations of the files are entirely up to you, but this pattern is particularly useful to start with.

make.R
R/
├── packages.R
├── functions.R
└── plan.R

Here, make.R is a master script that

  1. Loads your packages, functions, and other in-memory data.
  2. Creates the drake plan.
  3. Calls make().

Let’s consider the main example, which you can download with drake_example("main"). Here, our master script is called make.R:

We have an R folder containing our supporting files, including packages.R:

functions.R:

and plan.R:

To run the example project above,

  1. Start a clean new R session.
  2. Run the make.R script.

On Mac and Linux, you can do this by opening a terminal and entering R CMD BATCH --no-save make.R. On Windows, restart your R session and call source("make.R") in the R console.

Note: this part of drake does not inherently focus on your script files. There is nothing magical about the names make.R, packages.R, functions.R, or plan.R. Different projects may require different file structures.

drake has other functions to inspect your results and examine your workflow. Before invoking them interactively, it is best to start with a clean new R session.

5.2 Safer interactivity

5.2.1 Motivation

A serious drake workflow should be consistent and reliable, ideally with the help of a master R script. Before it builds your targets, this script should begin in a fresh R session and load your packages and functions in a dependable manner. Batch mode makes sure all this goes according to plan.

If you use a single persistent interactive R session to repeatedly invoke make() while you develop the workflow, then over time, your session could grow stale and accidentally invalidate targets. For example, if you interactively tinker with a new version of create_plot(), targets hist and report will fall out of date without warning, and the next make() will build them again. Even worse, the outputs from hist and report will be wrong if they depend on a half-finished create_plot().

The quickest workaround is to restart R and source() your setup scripts all over again. However, a better solution is to use r_make() and friends. r_make() runs make() in a new transient R session so that accidental changes to your interactive environment do not break your workflow.

5.2.2 Usage

To use r_make(), you need a configuration R script. Unless you supply a custom file path (e.g. r_make(source = "your_file.R") or options(drake_source = "your_file.R")) drake assumes this configuration script is called _drake.R. (So the file name really is magical in this case). The suggested file structure becomes:

_drake.R
R/
├── packages.R
├── functions.R
└── plan.R

Like our previous make.R script, _drake.R runs all our pre-make() setup steps. But this time, rather than calling make(), it ends with a call to drake_config(). Example _drake.R:

Here is what happens when you call r_make().

  1. drake launches a new transient R session using callr::r(). The remaining steps all happen within this transient session.
  2. Run the configuration script (e.g. _drake.R) to
    1. Load the packages, functions, global options, drake plan, etc. into the session’s environnment, and
    2. Run the call to drake_config()and store the results in a variable called config.
  3. Execute make(config = config)

The purpose of drake_config() is to collect and sanitize all the parameters and settings that make() needs to do its job. In fact, if you do not set the config argument explicitly, then make() invokes drake_config() behind the scenes. make(plan, parallelism = "clustermq", jobs = 2, verbose = 6) is equivalent to

There are many more r_*() functions besides r_make(), each of which launches a fresh session and runs an inner drake function on the config object from _drake.R.

Outer function call Inner function call
r_make() make(config = config)
r_drake_build(...) drake_build(config, ...)
r_outdated(...) outdated(config, ...)
r_missed(...) missed(config, ...)
r_vis_drake_graph(...) vis_drake_graph(config, ...)
r_sankey_drake_graph(...) sankey_drake_graph(config, ...)
r_drake_ggraph(...) drake_ggraph(config, ...)
r_drake_graph_info(...) drake_graph_info(config, ...)
r_predict_runtime(...) predict_runtime(config, ...)
r_predict_workers(...) predict_workers(config, ...)

Remarks:

  • You can run r_make() in an interactive session, but the transient process it launches will not be interactive. Thus, any browser() statements in the commands in your drake plan will be ignored.
  • You can select and configure the underlying callr function using arguments r_fn and r_args, respectively.
  • For example code, you can download the updated main example (drake_example("main")) and experiment with files _drake.R and interactive.R.

5.3 Script file pitfalls

Despite the above discussion of R scripts, drake plans rely more on in-memory functions. You might be tempted to write a plan like the following, but then drake cannot tell that my_analysis depends on my_data.

When it comes to plans, use functions instead.

5.4 Workflows as R packages

The R package structure is a great way to organize and quality-control a data analysis project. If you write a drake workflow as a package, you will need

  1. Use expose_imports() to properly account for all your nested function dependencies, and
  2. If you load the package with devtools::load_all(), set the prework argument of make(): e.g. make(prework = "devtools::load_all()").

For a minimal example, see Tiernan Martin’s drakepkg.

5.5 Other tools

drake enhances reproducibility, but not in all respects. Local library managers, containerization, and session management tools offer more robust solutions in their respective domains. Reproducibility encompasses a wide variety of tools and techniques all working together. Comprehensive overviews:

Copyright Eli Lilly and Company