Chapter 4 drake plans

4.2 Plans are similar to R scripts.

And plan_to_notebook() turns plans into R notebooks.

4.3 So why do we use plans?

If you have ever waited more than 10 minutes for an R script to finish, then you know the frustration of having to rerun the whole thing every time you make a change. Plans make life easier.

4.3.1 Plans chop up the work into pieces.

Some targets may need an update while others may not. In the walkthrough, make() was smart enough to skip the data cleaning step and just rebuild the plot and report. drake and its plans compartmentalize the work, and this can save you from wasted effort in the long run.

4.3.2 drake uses plans to schedule you work.

make() automatically learns the build order of your targets and how to run them in parallel. The underlying magic is static code analysis, which automatically detects the dependencies of each target without having to run its command.

Because of the dependency relationships, row order does not matter once the plan is fully defined. The following plan declares file before plot.

But file actually depends on plot.

So make() builds plot first.

4.4 Special custom columns in your plan

You can add other columns besides the required target and command.

Within drake_plan(), target() lets you create any custom column except target, command, and transform, the last of which has a special meaning.

The following columns have special meanings for make().

  • format: set a storage format to save big targets more efficiently. Most formats are faster than ordinary storage, and they consume far less memory. Available formats:
    • "fst": save big data frames fast. Requirements:
      1. The fst package must be installed.
      2. The target’s value must be a plain data frame. If it is not a plain data frame (for example, a tibble or data.table) then drake will coerce it to a plain data frame with All non-data-frame-specific attributes are lost when drake saves the target.
    • "fst_dt": Like "fst" format, but for data.table objects. Requirements:
      1. The data.table and fst packages must be installed.
      2. The target’s value must be a data.table object. If it is not a data.table object (for example, a data frame or tibble) then drake will coerce it to a data.table object using All non-data-table-specific attributes are lost when drake saves the target.
      • "keras": save Keras models as HDF5 files. Requires the keras package.
      • "rds": save any object. This is similar to the default storage except we avoid creating a serialized copy of the entire target in memory. Requires R >= 3.5.0 so drake can use ALTREP.
  • trigger: rule to decide whether a target needs to run. See the trigger chapter to learn more.
  • elapsed and cpu: number of seconds to wait for the target to build before timing out (elapsed for elapsed time and cpu for CPU time).
  • hpc: logical values (TRUE/FALSE/NA) whether to send each target to parallel workers. Click here to learn more.
  • resources: target-specific lists of resources for a computing cluster. See the advanced options in the parallel computing chapter for details.
  • caching: overrides the caching argument of make() for each target individually. Only supported in drake version and above. Possible values:
    • “master”: tell the master process to store the target in the cache.
    • “worker”: tell the HPC worker to store the target in the cache.
    • NA: default to the caching argument of make().
  • retries: number of times to retry building a target in the event of an error.
  • seed: pseudo-random number generator (RNG) seed for each target. drake usually computes its own unique reproducible target-specific seeds using the target name and the global seed (the seed argument of make() and drake_config()). Any non-missing seeds in the seed column override drake’s default target seeds.

4.5 Large plans

drake version 7.0.0 introduced new syntax to make it easier to create plans. To try it out before the next CRAN release, install the current development version from GitHub.

4.5.1 How to create large plans

Ordinarily, drake_plan() requires you to write out all the targets one-by-one. This is a literal pain.

Transformations reduce typing, especially when combined with tidy evaluation (!!).

Behind the scenes during a transformation, drake_plan() creates new columns to track what is happening. You can see them with trace = TRUE.

Because of those columns, you can chain transformations together in complex pipelines.

plan1 <- drake_plan(
  small = get_small_data(),
  large = get_large_data(),
  analysis = target( # Analyze each dataset once with a different mean.
    analyze(data, mean),
    transform = map(data = c(small, large), mean = c(1, 2))
  # Calculate 2 different performance metrics on every model fit.
  metric = target(
    # mse = mean squared error, mae = mean absolute error.
    # Assume these are functions you write.
    transform = cross(metric_fun = c(mse, mae), analysis)
  # Summarize the performance metrics for each dataset.
  summ_data = target(
    transform = combine(metric, .by = data)
  # Same, but for each metric type.
  summ_metric = target(
    transform = combine(metric, .by = metric_fun)

#> # A tibble: 12 x 2
#>    target                 command                                          
#>    <chr>                  <expr>                                           
#>  1 small                  get_small_data()                                …
#>  2 large                  get_large_data()                                …
#>  3 analysis_small_1       analyze(small, 1)                               …
#>  4 analysis_large_2       analyze(large, 2)                               …
#>  5 metric_mse_analysis_s… mse(analysis_small_1)                           …
#>  6 metric_mae_analysis_s… mae(analysis_small_1)                           …
#>  7 metric_mse_analysis_l… mse(analysis_large_2)                           …
#>  8 metric_mae_analysis_l… mae(analysis_large_2)                           …
#>  9 summ_data_large        summary(metric_mse_analysis_large_2, metric_mae_…
#> 10 summ_data_small        summary(metric_mse_analysis_small_1, metric_mae_…
#> 11 summ_metric_mae        summary(metric_mae_analysis_small_1, metric_mae_…
#> 12 summ_metric_mse        summary(metric_mse_analysis_small_1, metric_mse_…

config1 <- drake_config(plan1)

And you can write the transformations in any order. The following plan is equivalent to plan1 despite the rearranged rows.

plan2 <- drake_plan(
  # Calculate 2 different performance metrics on every model fit.
  summ_metric = target(
    transform = combine(metric, .by = metric_fun)
  metric = target(
    # mse = mean squared error, mae = mean absolute error.
    # Assume these are functions you write.
    transform = cross(metric_fun = c(mse, mae), analysis)
  small = get_small_data(),
  analysis = target( # Analyze each dataset once with a different mean.
    analyze(data, mean),
    transform = map(data = c(small, large), mean = c(1, 2))
  # Summarize the performance metrics for each dataset.
  summ_data = target(
    transform = combine(metric, .by = data)
  large = get_large_data()
  # Same, but for each metric type.

#> # A tibble: 12 x 2
#>    target                 command                                          
#>    <chr>                  <expr>                                           
#>  1 summ_metric_mae        summary(metric_mae_analysis_small_1, metric_mae_…
#>  2 summ_metric_mse        summary(metric_mse_analysis_small_1, metric_mse_…
#>  3 metric_mse_analysis_s… mse(analysis_small_1)                           …
#>  4 metric_mae_analysis_s… mae(analysis_small_1)                           …
#>  5 metric_mse_analysis_l… mse(analysis_large_2)                           …
#>  6 metric_mae_analysis_l… mae(analysis_large_2)                           …
#>  7 small                  get_small_data()                                …
#>  8 analysis_small_1       analyze(small, 1)                               …
#>  9 analysis_large_2       analyze(large, 2)                               …
#> 10 summ_data_large        summary(metric_mse_analysis_large_2, metric_mae_…
#> 11 summ_data_small        summary(metric_mse_analysis_small_1, metric_mae_…
#> 12 large                  get_large_data()                                …

config2 <- drake_config(plan2)

4.5.2 Start small

Some plans are too large to deploy right away.

It is extremely difficult to understand, visualize, test, and debug a workflow with so many targets. make(), drake_config(), and vis_drake_graph() simply take too long if you are not ready for production.

To speed up initial testing and experimentation, you may want to limit the number of extra targets created by the map() and cross() transformations. Simply set max_expand in drake_plan().

With a downsized plan, we can inspect the graph to make sure the dependencies line up correctly.

We can even run make(plan) and check a strategic subset of targets.

When we are ready to scale back up, we simply remove max_expand from the call to drake_plan(). Nothing else needs to change.

4.5.3 The types of transformations

drake supports four types of transformations: map(), cross(), split() (unsupported in drake <= 7.3.0), and combine(). These are not actual functions, but you can treat them as functions when you use them in drake_plan(). Each transformation takes after a function from the Tidyverse.

drake Tidyverse analogue
map() pmap() from purrr
cross() crossing() from tidyr
split() group_map() from dplyr
combine() summarize() from dplyr Special considerations in map()

map() column-binds variables together to create a grid. The lengths of those variables need to be conformable just as with data.frame().

Sometimes, the results are sensible when grouping variable lengths are multiples of each other, but be careful.

Things get tricker when drake reuses grouping variables from previous transformations. For example, below, each x_* target has an associated nrow value. So if you write transform = map(x), then nrow goes along for the ride.

But if other targets have centers’s of their own, drake_plan() may not know what to do with them.

The problems is that there are 4 values of center and only two x_* targets (and two y_* targets). Even if you explicitly supply center to the transformation, map() can only takes the first two values.

So please inspect the plan before you run it with make(). Once you have a drake_config() object, vis_drake_graph() and deps_target() can help. split()

split() is not supported in drake 7.3.0 and below. It should reach the next CRAN release in June 2019.

The split() transformation distributes a dataset as uniformly as possible across multiple targets.

Here, drake_slice() takes a single subset of the data at runtime.

drake_slice() supports data frames, matrices, and arbitrary arrays, and you can subset on any margin (rows, columns, etc). Even better, you can split up ordinary vectors and lists. Instead of taking slices of the actual dataset, you can split up a set of indices. Combined with high-performance computing, this should help you avoid loading an entire big data file into memory on a single compute node. combine()

In combine(), you can insert multiple targets into individual commands. The closest comparison is the unquote-splice operator !!! from the Tidyverse.

You can different groups of targets in the same command.

And as with group_by() from dplyr, you can create a separate aggregate for each combination of levels of the arguments. Just pass a symbol or vector of symbols to the optional .by argument of combine().

In your post-processing, you may need the values of x and y that underly data_1_3 and data_2_4. Solution: get the trace and the target names. We define a new plan

and a new function

4.5.4 Grouping variables

A grouping variable is an argument to map(), cross(), or combine() that identifies a sub-collection of target names. Grouping variables can be either literals or symbols. Symbols can be scalars or vectors, and you can pass them to transformations with or without argument names. Literal arguments

When you pass a grouping variable of literals, you must use an explicit argument name. One does not simply write map(c(1, 2)).

And if you supply integer sequences the usual way, you may notice some rows are missing.

Tidy evaluation and as.numeric() make sure all the data points show up.

Character vectors usually work without a hitch, and quotes are converted into dots to make valid target names. Named symbol arguments

Symbols passed with explicit argument names define new groupings of existing targets on the fly, and only the map() and cross() transformations can accept them this ways. To generate long symbol lists, use the syms() function from the rlang package. Remember to use the tidy evaluation operator !! inside the transformation.

The new groupings carry over to downstream targets by default, which you can see with trace = TRUE. Below, the rows for targets w_x and w_y have entries in the and z column.

However, this is incorrect because w does not depend on z_x or z_y. So for w, you should write map(val = c(x, y)) instead of map(val) to tell drake to clear the trace. Then, you will see NAs in the z column for w_x and w_y, which is right and proper.

4.5.5 Tags

Tags are special optional grouping variables. They are ignored while the transformation is happening and then added to the plan to help subsequent transformations. There are two types of tags:

  1. In-tags, which contain the target name you start with, and
  2. Out-tags, which contain the target names generated by the transformations.

Subsequent transformations can use tags as grouping variables and add to existing tags.

4.5.6 Target names

All transformations have an optional .id argument to control the names of targets. Use it to select the grouping variables that go into the names, as well as the order they appear in the suffixes.

Set .id to FALSE to ignore the grouping variables altogether.

Finally, drake supports a special .id_chr symbol in commands to let you refer to the name of the current target as a character string.

  1. Only supported in drake version 7.5.0 and above.

  2. Only supported in drake version 7.4.0 and above.

Copyright Eli Lilly and Company