Chapter 6 drake plans

6.1 What is a drake plan?

Adrake plan is a data frame with columns named target and command. Each target is an R object, and each command is an expression to produce it.1 The drake_plan() function is the best way to set up plans.2 Recall the plan from our previous example:

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)
plan
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 hist     create_plot(data)                                              …
#> 4 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

drake_plan() does not run the workflow, it only creates the plan. To build the actual targets, we need to run make(). Creating the plan is like writing an R script, and running make(your_plan) is like calling source("your_script.R").

6.2 Plans are similar to R scripts.

Your drake plan is like a top-level R script that runs everything from end to end. In fact, you can convert back and forth between plans and scripts using functions plan_to_code() and code_to_plan() (with some caveats).

plan_to_code(plan, "new_script.R")
#> Loading required namespace: styler
cat(readLines("new_script.R"), sep = "\n")
#> raw_data <- readxl::read_excel(file_in("raw_data.xlsx"))
#> data <- raw_data %>% mutate(Species = forcats::fct_inorder(Species))
#> fit <- lm(Sepal.Width ~ Petal.Width + Species, data)
#> hist <- create_plot(data)
#> report <- rmarkdown::render(knitr_in("report.Rmd"),
#>   output_file = file_out("report.html"),
#>   quiet = TRUE
#> )

code_to_plan("new_script.R")
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 4 hist     create_plot(data)                                              …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

And plan_to_notebook() turns plans into R notebooks.

plan_to_notebook(plan, "new_notebook.Rmd")
cat(readLines("new_notebook.Rmd"), sep = "\n")
#> ---
#> title: "My Notebook"
#> output: html_notebook
#> ---
#> 
#> ```{r my_code}
#> raw_data <- readxl::read_excel(file_in("raw_data.xlsx"))
#> data <- raw_data %>% mutate(Species = forcats::fct_inorder(Species))
#> fit <- lm(Sepal.Width ~ Petal.Width + Species, data)
#> hist <- create_plot(data)
#> report <- rmarkdown::render(knitr_in("report.Rmd"),
#>   output_file = file_out("report.html"),
#>   quiet = TRUE
#> )
#> ```

code_to_plan("new_notebook.Rmd")
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 4 hist     create_plot(data)                                              …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

6.3 So why do we use plans?

If you have ever waited more than 10 minutes for an R script to finish, then you know the frustration of having to rerun the whole thing every time you make a change. Plans make life easier.

6.3.1 Plans chop up the work into pieces.

Some targets may need an update while others may not. In our first example, make() was smart enough to skip the data cleaning step and just rebuild the plot and report. drake and its plans compartmentalize the work, and this can save you from wasted effort in the long run.

6.3.2 drake uses plans to schedule you work.

make() automatically learns the build order of your targets and how to run them in parallel. The underlying magic is static code analysis, which automatically detects the dependencies of each target without having to run its command.

create_plot <- function(data) {
  ggplot(data, aes_string(x = "Petal.Width", fill = "Species")) +
    geom_histogram(bins = 20)
}

deps_code(create_plot)
#> # A tibble: 3 x 2
#>   name           type   
#>   <chr>          <chr>  
#> 1 geom_histogram globals
#> 2 ggplot         globals
#> 3 aes_string     globals

deps_code(quote(create_plot(datasets::iris)))
#> # A tibble: 2 x 2
#>   name           type      
#>   <chr>          <chr>     
#> 1 create_plot    globals   
#> 2 datasets::iris namespaced

Because of the dependency relationships, row order does not matter once the plan is fully defined. The following plan declares file before plot.

small_plan <- drake_plan(
  file = ggsave(file_out("plot.png"), plot, width = 7, height = 5),
  plot = create_plot(datasets::iris)
)

But file actually depends on plot.

small_config <- drake_config(small_plan)
vis_drake_graph(small_config)

So make() builds plot first.

library(ggplot2)
make(small_plan)
#> target plot
#> target file

6.4 Special custom columns in your plan.

You can add other columns besides the required target and command.

cbind(small_plan, cpu = c(1, 2))
#>   target                                                   command cpu
#> 1   file ggsave(file_out("plot.png"), plot, width = 7, height = 5)   1
#> 2   plot                               create_plot(datasets::iris)   2

Within drake_plan(), target() lets you create any custom column except target, command, and transform, the last of which has a special meaning.

drake_plan(
  file = target(
    ggsave(file_out("plot.png"), plot),
    elapsed = 10
  ),
  create_plot(datasets::iris)
)
#> # A tibble: 2 x 3
#>   target         command                            elapsed
#>   <chr>          <expr>                               <dbl>
#> 1 file           ggsave(file_out("plot.png"), plot)      10
#> 2 drake_target_1 create_plot(datasets::iris)             NA

The following columns have special meanings for make().

  • elapsed and cpu: number of seconds to wait for the target to build before timing out (elapsed for elapsed time and cpu for CPU time).
  • priority: for parallel computing, optionally rank the targets according to priority in the scheduler.
  • resources: target-specific lists of resources for a computing cluster. See the advanced options in the parallel computing chapter for details.
  • retries: number of times to retry building a target in the event of an error.
  • trigger: rule to decide whether a target needs to run. See the trigger chapter to learn more.

6.5 Large plans

drake version 7.0.0 will introduce new experimental syntax to make it easier to create plans. To try it out before the next CRAN release, install the current development version from GitHub.

install.packages("remotes")
library(remotes)
install_github("ropensci/drake")

6.5.1 How to create large plans

Ordinarily, drake_plan() requires you to write out all the targets one-by-one. This is a literal pain.

drake_plan(
  data = get_data(),
  analysis_1_1 = fit_model_x(data, mean = 1, sd = 1),
  analysis_2_1 = fit_model_x(data, mean = 2, sd = 1),
  analysis_5_1 = fit_model_x(data, mean = 5, sd = 1),
  analysis_10_1 = fit_model_x(data, mean = 10, sd = 1),
  analysis_100_1 = fit_model_x(data, mean = 100, sd = 1),
  analysis_1000_1 = fit_model_x(data, mean = 1000, sd = 1),
  analysis_1_2 = fit_model_x(data, mean = 1, sd = 2),
  analysis_2_2 = fit_model_x(data, mean = 2, sd = 2),
  analysis_5_2 = fit_model_x(data, mean = 5, sd = 2),
  analysis_10_2 = fit_model_x(data, mean = 10, sd = 2),
  analysis_100_2 = fit_model_x(data, mean = 100, sd = 2),
  analysis_1000_2 = fit_model_x(data, mean = 1000, sd = 2),
  # UUUGGGHH my wrists are cramping! :( ...
)

Transformations reduce typing, especially when combined with tidy evaluation (!!).

lots_of_sds <- as.numeric(1:1e3)

drake_plan(
  data = get_data(),
  analysis = target(
    fun(data, mean = mean_val, sd = sd_val),
    transform = cross(mean_val = c(2, 5, 10, 100, 1000), sd_val = !!lots_of_sds)
  )
)
#> # A tibble: 5,001 x 2
#>    target          command                       
#>    <chr>           <expr>                        
#>  1 data            get_data()                    
#>  2 analysis_2_1    fun(data, mean = 2, sd = 1)   
#>  3 analysis_5_1    fun(data, mean = 5, sd = 1)   
#>  4 analysis_10_1   fun(data, mean = 10, sd = 1)  
#>  5 analysis_100_1  fun(data, mean = 100, sd = 1) 
#>  6 analysis_1000_1 fun(data, mean = 1000, sd = 1)
#>  7 analysis_2_2    fun(data, mean = 2, sd = 2)   
#>  8 analysis_5_2    fun(data, mean = 5, sd = 2)   
#>  9 analysis_10_2   fun(data, mean = 10, sd = 2)  
#> 10 analysis_100_2  fun(data, mean = 100, sd = 2) 
#> # … with 4,991 more rows

Behind the scenes during a transformation, drake_plan() creates new columns to track what is happening. You can see them with trace = TRUE.

drake_plan(
  data = get_data(),
  analysis = target(
    analyze(data, mean, sd),
    transform = map(mean = c(3, 4), sd = c(1, 2))
  ),
  trace = TRUE
)
#> # A tibble: 3 x 5
#>   target       command             mean  sd    analysis    
#>   <chr>        <expr>              <chr> <chr> <chr>       
#> 1 data         get_data()          <NA>  <NA>  <NA>        
#> 2 analysis_3_1 analyze(data, 3, 1) 3     1     analysis_3_1
#> 3 analysis_4_2 analyze(data, 4, 2) 4     2     analysis_4_2

Because of those columns, you can chain transformations together in complex pipelines.

plan1 <- drake_plan(
  small = get_small_data(),
  large = get_large_data(),
  analysis = target( # Analyze each dataset once with a different mean.
    analyze(data, mean),
    transform = map(data = c(small, large), mean = c(1, 2))
  ),
  # Calculate 2 different performance metrics on every model fit.
  metric = target(
    metric_fun(analysis),
    # mse = mean squared error, mae = mean absolute error.
    # Assume these are functions you write.
    transform = cross(metric_fun = c(mse, mae), analysis)
  ),
  # Summarize the performance metrics for each dataset.
  summ_data = target(
    summary(metric),
    transform = combine(metric, .by = data)
  ),
  # Same, but for each metric type.
  summ_metric = target(
    summary(metric),
    transform = combine(metric, .by = metric_fun)
  )
)

plan1
#> # A tibble: 12 x 2
#>    target                 command                                          
#>    <chr>                  <expr>                                           
#>  1 small                  get_small_data()                                …
#>  2 large                  get_large_data()                                …
#>  3 analysis_small_1       analyze(small, 1)                               …
#>  4 analysis_large_2       analyze(large, 2)                               …
#>  5 metric_mse_analysis_l… mse(analysis_large_2)                           …
#>  6 metric_mae_analysis_l… mae(analysis_large_2)                           …
#>  7 metric_mse_analysis_s… mse(analysis_small_1)                           …
#>  8 metric_mae_analysis_s… mae(analysis_small_1)                           …
#>  9 summ_data_large        summary(metric_mse_analysis_large_2, metric_mae_…
#> 10 summ_data_small        summary(metric_mse_analysis_small_1, metric_mae_…
#> 11 summ_metric_mae        summary(metric_mae_analysis_large_2, metric_mae_…
#> 12 summ_metric_mse        summary(metric_mse_analysis_large_2, metric_mse_…

config1 <- drake_config(plan1)
vis_drake_graph(config1)

And you can write the transformations in any order. The following plan is equivalent to plan1 despite the rearranged rows.

plan2 <- drake_plan(
  # Calculate 2 different performance metrics on every model fit.
  summ_metric = target(
    summary(metric),
    transform = combine(metric, .by = metric_fun)
  ),
  metric = target(
    metric_fun(analysis),
    # mse = mean squared error, mae = mean absolute error.
    # Assume these are functions you write.
    transform = cross(metric_fun = c(mse, mae), analysis)
  ),
  small = get_small_data(),
  analysis = target( # Analyze each dataset once with a different mean.
    analyze(data, mean),
    transform = map(data = c(small, large), mean = c(1, 2))
  ),
  # Summarize the performance metrics for each dataset.
  summ_data = target(
    summary(metric),
    transform = combine(metric, .by = data)
  ),
  large = get_large_data()
  # Same, but for each metric type.
)

plan2
#> # A tibble: 12 x 2
#>    target                 command                                          
#>    <chr>                  <expr>                                           
#>  1 summ_metric_mae        summary(metric_mae_analysis_large_2, metric_mae_…
#>  2 summ_metric_mse        summary(metric_mse_analysis_large_2, metric_mse_…
#>  3 metric_mse_analysis_l… mse(analysis_large_2)                           …
#>  4 metric_mae_analysis_l… mae(analysis_large_2)                           …
#>  5 metric_mse_analysis_s… mse(analysis_small_1)                           …
#>  6 metric_mae_analysis_s… mae(analysis_small_1)                           …
#>  7 small                  get_small_data()                                …
#>  8 analysis_small_1       analyze(small, 1)                               …
#>  9 analysis_large_2       analyze(large, 2)                               …
#> 10 summ_data_large        summary(metric_mse_analysis_large_2, metric_mae_…
#> 11 summ_data_small        summary(metric_mse_analysis_small_1, metric_mae_…
#> 12 large                  get_large_data()                                …

config2 <- drake_config(plan2)
vis_drake_graph(config2)

6.5.2 The types of transformations

drake supports three types of transformations: map(), cross(), and combine(). These are not actual functions, but you can treat them as functions when you use them in drake_plan(). Each transformation takes after a function from the Tidyverse.

drake Tidyverse analogue
map() pmap() from purrr
cross() crossing() from tidyr
combine() summarize() from dplyr

6.5.2.1 map()

map() creates a new target for each row in a grid.

drake_plan(
  x = target(
    simulate_data(center, scale),
    transform = map(center = c(2, 1, 0), scale = c(3, 2, 1))
  )
)
#> # A tibble: 3 x 2
#>   target command            
#>   <chr>  <expr>             
#> 1 x_2_3  simulate_data(2, 3)
#> 2 x_1_2  simulate_data(1, 2)
#> 3 x_0_1  simulate_data(0, 1)

You can supply your own custom grid using the .data argument. Note the use of !! below.

my_grid <- tibble(
  sim_function = c("rnrom", "rt", "rcauchy"),
  title = c("Normal", "Student t", "Cauchy")
)
my_grid$sim_function <- rlang::syms(my_grid$sim_function)

drake_plan(
  x = target(
    simulate_data(sim_function, title, center, scale),
    transform = map(
      center = c(2, 1, 0),
      scale = c(3, 2, 1),
      .data = !!my_grid,
      .id = sim_function # for pretty target names
    )
  )
)
#> # A tibble: 3 x 2
#>   target    command                               
#>   <chr>     <expr>                                
#> 1 x_rnrom   simulate_data(rnrom, "Normal", 2, 3)  
#> 2 x_rt      simulate_data(rt, "Student t", 1, 2)  
#> 3 x_rcauchy simulate_data(rcauchy, "Cauchy", 0, 1)

6.5.2.2 Special considerations in map()

map() column-binds variables together to create a grid. The lengths of those variables need to be conformable just as with data.frame().

drake_plan(
  x = target(
    simulate_data(center, scale),
    transform = map(center = c(2, 1, 0), scale = c(3, 2))
  )
)
#> Error: Failed to make a grid of grouping variables for map().
#> Grouping variables in map() must have suitable lengths for coercion to a data frame.
#> Possibly uneven groupings detected in map(center = c(2, 1, 0), scale = c(3, 2)):
#>   c("2", "1", "0")
#>   c("3", "2")

Sometimes, the results are sensible when grouping variable lengths are multiples of each other, but be careful.

drake_plan(
  x = target(
    simulate_data(center, scale),
    transform = map(center = c(2, 1, 0), scale = 4)
  )
)
#> # A tibble: 3 x 2
#>   target command            
#>   <chr>  <expr>             
#> 1 x_2_4  simulate_data(2, 4)
#> 2 x_1_4  simulate_data(1, 4)
#> 3 x_0_4  simulate_data(0, 4)

Things get tricker when drake reuses grouping variables from previous transformations. For example, below, each x_* target has an associated nrow value. So if you write transform = map(x), then nrow goes along for the ride.

drake_plan(
  x = target(
    simulate_data(center),
    transform = map(center = c(1, 2))
  ),
  y = target(
    process_data(x, center),
    transform = map(x)
  ),
  trace = TRUE # Adds extra columns for the grouping variables.
)
#> # A tibble: 4 x 5
#>   target command              center x     y    
#>   <chr>  <expr>               <chr>  <chr> <chr>
#> 1 x_1    simulate_data(1)     1      x_1   <NA> 
#> 2 x_2    simulate_data(2)     2      x_2   <NA> 
#> 3 y_x_1  process_data(x_1, 1) 1      x_1   y_x_1
#> 4 y_x_2  process_data(x_2, 2) 2      x_2   y_x_2

But if other targets have centers’s of their own, drake_plan() may not know what to do with them.

drake_plan(
  w = target(
    simulate_data(center),
    transform = map(center = c(3, 4))
  ),
  x = target(
    simulate_data_2(center),
    transform = map(center = c(1, 2))
  ),
  y = target(
    process_data(w, x, center),
    transform = map(w, x)
  ),
  trace = TRUE
)
#> # A tibble: 6 x 6
#>   target    command                    center w     x     y        
#>   <chr>     <expr>                     <chr>  <chr> <chr> <chr>    
#> 1 w_3       simulate_data(3)           3      w_3   <NA>  <NA>     
#> 2 w_4       simulate_data(4)           4      w_4   <NA>  <NA>     
#> 3 x_1       simulate_data_2(1)         1      <NA>  x_1   <NA>     
#> 4 x_2       simulate_data_2(2)         2      <NA>  x_2   <NA>     
#> 5 y_w_3_x_1 process_data(w_3, x_1, NA) <NA>   w_3   x_1   y_w_3_x_1
#> 6 y_w_4_x_2 process_data(w_4, x_2, NA) <NA>   w_4   x_2   y_w_4_x_2

The problems is that there are 4 values of center and only two x_* targets (and two y_* targets). Even if you explicitly supply center to the transformation, map() can only takes the first two values.

drake_plan(
  w = target(
    simulate_data(center),
    transform = map(center = c(3, 4))
  ),
  x = target(
    simulate_data_2(center),
    transform = map(center = c(1, 2))
  ),
  y = target(
    process_data(w, x, center),
    transform = map(w, x, center)
  ),
  trace = TRUE
)
#> # A tibble: 6 x 6
#>   target      command                   center w     x     y          
#>   <chr>       <expr>                    <chr>  <chr> <chr> <chr>      
#> 1 w_3         simulate_data(3)          3      w_3   <NA>  <NA>       
#> 2 w_4         simulate_data(4)          4      w_4   <NA>  <NA>       
#> 3 x_1         simulate_data_2(1)        1      <NA>  x_1   <NA>       
#> 4 x_2         simulate_data_2(2)        2      <NA>  x_2   <NA>       
#> 5 y_w_3_x_1_3 process_data(w_3, x_1, 3) 3      w_3   x_1   y_w_3_x_1_3
#> 6 y_w_4_x_2_4 process_data(w_4, x_2, 4) 4      w_4   x_2   y_w_4_x_2_4

So please inspect the plan before you run it with make(). Once you have a drake_config() object, vis_drake_graph() and deps_target() can help.

6.5.2.3 cross()

cross() creates a new target for each combination of argument values.

drake_plan(
  x = target(
    simulate_data(nrow, ncol),
    transform = cross(nrow = c(1, 2, 3), ncol = c(4, 5))
  )
)
#> # A tibble: 6 x 2
#>   target command            
#>   <chr>  <expr>             
#> 1 x_1_4  simulate_data(1, 4)
#> 2 x_2_4  simulate_data(2, 4)
#> 3 x_3_4  simulate_data(3, 4)
#> 4 x_1_5  simulate_data(1, 5)
#> 5 x_2_5  simulate_data(2, 5)
#> 6 x_3_5  simulate_data(3, 5)

6.5.2.4 combine()

In combine(), you can insert multiple targets into individual commands. The closest comparison is the unquote-splice operator !!! from the Tidyverse.

plan <- drake_plan(
  data = target(
    sim_data(mean = x, sd = y),
    transform = map(x = c(1, 2), y = c(3, 4))
  ),
  larger = target(
    bind_rows(data, .id = "id") %>%
      arrange(sd) %>%
      head(n = 400),
    transform = combine(data)
  )
)

plan
#> # A tibble: 3 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 data_1_3 sim_data(mean = 1, sd = 3)                                     …
#> 2 data_2_4 sim_data(mean = 2, sd = 4)                                     …
#> 3 larger   bind_rows(data_1_3, data_2_4, .id = "id") %>% arrange(sd) %>%  …

drake_plan_source(plan)
#> drake_plan(
#>   data_1_3 = sim_data(mean = 1, sd = 3),
#>   data_2_4 = sim_data(mean = 2, sd = 4),
#>   larger = bind_rows(data_1_3, data_2_4, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400)
#> )

config <- drake_config(plan)
vis_drake_graph(config)

You can different groups of targets in the same command.

plan <- drake_plan(
  data_group1 = target(
    sim_data(mean = x, sd = y),
    transform = map(x = c(1, 2), y = c(3, 4))
  ),
  data_group2 = target(
    pull_data(url),
    transform = map(url = c("example1.com", "example2.com"))
  ),
  larger = target(
    bind_rows(data_group1, data_group2, .id = "id") %>%
      arrange(sd) %>%
      head(n = 400),
    transform = combine(data_group1, data_group2)
  )
)

drake_plan_source(plan)
#> drake_plan(
#>   data_group1_1_3 = sim_data(mean = 1, sd = 3),
#>   data_group1_2_4 = sim_data(mean = 2, sd = 4),
#>   data_group2_.example1.com. = pull_data("example1.com"),
#>   data_group2_.example2.com. = pull_data("example2.com"),
#>   larger = bind_rows(data_group1_1_3, data_group1_2_4, data_group2_.example1.com.,
#>     data_group2_.example2.com.,
#>     .id = "id"
#>   ) %>%
#>     arrange(sd) %>%
#>     head(n = 400)
#> )

And as with group_by() from dplyr, you can create a separate aggregate for each combination of levels of the arguments. Just pass a symbol or vector of symbols to the optional .by argument of combine().

plan <- drake_plan(
  data = target(
    sim_data(mean = x, sd = y, skew = z),
    transform = cross(x = c(1, 2), y = c(3, 4), z = c(5, 6))
  ),
  combined = target(
    bind_rows(data, .id = "id") %>%
      arrange(sd) %>%
      head(n = 400),
    transform = combine(data, .by = c(x, y))
  )
)

drake_plan_source(plan)
#> drake_plan(
#>   data_1_3_5 = sim_data(mean = 1, sd = 3, skew = 5),
#>   data_2_3_5 = sim_data(mean = 2, sd = 3, skew = 5),
#>   data_1_4_5 = sim_data(mean = 1, sd = 4, skew = 5),
#>   data_2_4_5 = sim_data(mean = 2, sd = 4, skew = 5),
#>   data_1_3_6 = sim_data(mean = 1, sd = 3, skew = 6),
#>   data_2_3_6 = sim_data(mean = 2, sd = 3, skew = 6),
#>   data_1_4_6 = sim_data(mean = 1, sd = 4, skew = 6),
#>   data_2_4_6 = sim_data(mean = 2, sd = 4, skew = 6),
#>   combined_1_3 = bind_rows(data_1_3_5, data_1_3_6, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400),
#>   combined_2_3 = bind_rows(data_2_3_5, data_2_3_6, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400),
#>   combined_1_4 = bind_rows(data_1_4_5, data_1_4_6, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400),
#>   combined_2_4 = bind_rows(data_2_4_5, data_2_4_6, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400)
#> )

In your post-processing, you may need the values of x and y that underly data_1_3 and data_2_4. Solution: get the trace and the target names. We define a new plan

plan <- drake_plan(
  data = target(
    sim_data(mean = x, sd = y),
    transform = map(x = c(1, 2), y = c(3, 4))
  ),
  larger = target(
    post_process(data, plan = ignore(plan)) %>%
      arrange(sd) %>%
      head(n = 400),
    transform = combine(data)
  ),
  trace = TRUE
)

drake_plan_source(plan)
#> drake_plan(
#>   data_1_3 = target(
#>     command = sim_data(mean = 1, sd = 3),
#>     x = "1",
#>     y = "3",
#>     data = "data_1_3"
#>   ),
#>   data_2_4 = target(
#>     command = sim_data(mean = 2, sd = 4),
#>     x = "2",
#>     y = "4",
#>     data = "data_2_4"
#>   ),
#>   larger = target(
#>     command = post_process(data_1_3, data_2_4, plan = ignore(plan)) %>%
#>       arrange(sd) %>%
#>       head(n = 400),
#>     larger = "larger"
#>   )
#> )

and a new function

post_process <- function(..., plan) {
  args <- list(...)
  names(args) <- all.vars(substitute(list(...)))
  trace <- filter(plan, target %in% names(args))
  # Do post-processing with args and trace.
}

6.5.3 Grouping variables

A grouping variable is an argument to map(), cross(), or combine() that identifies a sub-collection of target names. Grouping variables can be either literals or symbols. Symbols can be scalars or vectors, and you can pass them to transformations with or without argument names.

6.5.3.1 Literal arguments

When you pass a grouping variable of literals, you must use an explicit argument name. One does not simply write map(c(1, 2)).

drake_plan(x = target(sqrt(y), transform = map(y = c(1, 2))))
#> # A tibble: 2 x 2
#>   target command
#>   <chr>  <expr> 
#> 1 x_1    sqrt(1)
#> 2 x_2    sqrt(2)

And if you supply integer sequences the usual way, you may notice some rows are missing.

drake_plan(x = target(sqrt(y), transform = map(y = 1:3)))
#> # A tibble: 2 x 2
#>   target command
#>   <chr>  <expr> 
#> 1 x_1    sqrt(1)
#> 2 x_3    sqrt(3)

Tidy evaluation and as.numeric() make sure all the data points show up.

y_vals <- as.numeric(1:3)
drake_plan(x = target(sqrt(y), transform = map(y = !!y_vals)))
#> # A tibble: 3 x 2
#>   target command
#>   <chr>  <expr> 
#> 1 x_1    sqrt(1)
#> 2 x_2    sqrt(2)
#> 3 x_3    sqrt(3)

Character vectors usually work without a hitch, and quotes are converted into dots to make valid target names.

drake_plan(x = target(get_data(y), transform = map(y = c("a", "b", "c"))))
#> # A tibble: 3 x 2
#>   target command      
#>   <chr>  <expr>       
#> 1 x_.a.  get_data("a")
#> 2 x_.b.  get_data("b")
#> 3 x_.c.  get_data("c")
y_vals <- letters
drake_plan(x = target(get_data(y), transform = map(y = !!y_vals)))
#> # A tibble: 26 x 2
#>    target command      
#>    <chr>  <expr>       
#>  1 x_.a.  get_data("a")
#>  2 x_.b.  get_data("b")
#>  3 x_.c.  get_data("c")
#>  4 x_.d.  get_data("d")
#>  5 x_.e.  get_data("e")
#>  6 x_.f.  get_data("f")
#>  7 x_.g.  get_data("g")
#>  8 x_.h.  get_data("h")
#>  9 x_.i.  get_data("i")
#> 10 x_.j.  get_data("j")
#> # … with 16 more rows

6.5.3.2 Named symbol arguments

Symbols passed with explicit argument names define new groupings of existing targets on the fly, and only the map() and cross() transformations can accept them this ways. To generate long symbol lists, use the syms() function from the rlang package. Remember to use the tidy evaluation operator !! inside the transformation.

vals <- rlang::syms(letters)
drake_plan(x = target(get_data(y), transform = map(y = !!vals)))
#> # A tibble: 26 x 2
#>    target command    
#>    <chr>  <expr>     
#>  1 x_a    get_data(a)
#>  2 x_b    get_data(b)
#>  3 x_c    get_data(c)
#>  4 x_d    get_data(d)
#>  5 x_e    get_data(e)
#>  6 x_f    get_data(f)
#>  7 x_g    get_data(g)
#>  8 x_h    get_data(h)
#>  9 x_i    get_data(i)
#> 10 x_j    get_data(j)
#> # … with 16 more rows

The new groupings carry over to downstream targets by default, which you can see with trace = TRUE. Below, the rows for targets w_x and w_y have entries in the and z column.

drake_plan(
  x = abs(mean(rnorm(10))),
  y = abs(mean(rnorm(100, 1))),
  z = target(sqrt(val), transform = map(val = c(x, y))),
  w = target(val + 1, transform = map(val)),
  trace = TRUE
)
#> # A tibble: 6 x 5
#>   target command                  val   z     w    
#>   <chr>  <expr>                   <chr> <chr> <chr>
#> 1 x      abs(mean(rnorm(10)))     <NA>  <NA>  <NA> 
#> 2 y      abs(mean(rnorm(100, 1))) <NA>  <NA>  <NA> 
#> 3 z_x    sqrt(x)                  x     z_x   <NA> 
#> 4 z_y    sqrt(y)                  y     z_y   <NA> 
#> 5 w_x    x + 1                    x     z_x   w_x  
#> 6 w_y    y + 1                    y     z_y   w_y

However, this is incorrect because w does not depend on z_x or z_y. So for w, you should write map(val = c(x, y)) instead of map(val) to tell drake to clear the trace. Then, you will see NAs in the z column for w_x and w_y, which is right and proper.

drake_plan(
  x = abs(mean(rnorm(10))),
  y = abs(mean(rnorm(100, 1))),
  z = target(sqrt(val), transform = map(val = c(x, y))),
  w = target(val + 1, transform = map(val = c(x, y))),
  trace = TRUE
)
#> # A tibble: 6 x 5
#>   target command                  val   z     w    
#>   <chr>  <expr>                   <chr> <chr> <chr>
#> 1 x      abs(mean(rnorm(10)))     <NA>  <NA>  <NA> 
#> 2 y      abs(mean(rnorm(100, 1))) <NA>  <NA>  <NA> 
#> 3 z_x    sqrt(x)                  x     z_x   <NA> 
#> 4 z_y    sqrt(y)                  y     z_y   <NA> 
#> 5 w_x    x + 1                    x     <NA>  w_x  
#> 6 w_y    y + 1                    y     <NA>  w_y

6.5.4 Tags

Tags are special optional grouping variables. They are ignored while the transformation is happening and then added to the plan to help subsequent transformations. There are two types of tags:

  1. In-tags, which contain the target name you start with, and
  2. Out-tags, which contain the target names generated by the transformations.
drake_plan(
  x = target(
    command,
    transform = map(y = c(1, 2), .tag_in = from, .tag_out = c(to, out))
  ),
  trace = TRUE
)
#> # A tibble: 2 x 7
#>   target command y     x     from  to    out  
#>   <chr>  <expr>  <chr> <chr> <chr> <chr> <chr>
#> 1 x_1    command 1     x_1   x     x_1   x_1  
#> 2 x_2    command 2     x_2   x     x_2   x_2

Subsequent transformations can use tags as grouping variables and add to existing tags.

plan <- drake_plan(
  prep_work = do_prep_work(),
  local = target(
    get_local_data(n, prep_work),
    transform = map(n = c(1, 2), .tag_in = data_source, .tag_out = data)
  ),
  online = target(
    get_online_data(n, prep_work, port = "8080"),
    transform = map(n = c(1, 2), .tag_in = data_source, .tag_out = data)
  ),
  summary = target(
    summarize(bind_rows(data, .id = "data")),
    transform = combine(data, .by = data_source)
  ),
  munged = target(
    munge(bind_rows(data, .id = "data")),
    transform = combine(data, .by = n)
  )
)

plan
#> # A tibble: 9 x 2
#>   target         command                                               
#>   <chr>          <expr>                                                
#> 1 prep_work      do_prep_work()                                        
#> 2 local_1        get_local_data(1, prep_work)                          
#> 3 local_2        get_local_data(2, prep_work)                          
#> 4 online_1       get_online_data(1, prep_work, port = "8080")          
#> 5 online_2       get_online_data(2, prep_work, port = "8080")          
#> 6 summary_local  summarize(bind_rows(local_1, local_2, .id = "data"))  
#> 7 summary_online summarize(bind_rows(online_1, online_2, .id = "data"))
#> 8 munged_1       munge(bind_rows(local_1, online_1, .id = "data"))     
#> 9 munged_2       munge(bind_rows(local_2, online_2, .id = "data"))

config <- drake_config(plan)
vis_drake_graph(config)


6.6 Create large plans the old way

drake provides several older utility that increase the flexibility of plan creation.

  • drake_plan()
  • map_plan()
  • evaluate_plan()
  • expand_plan()
  • gather_by()
  • reduce_by()
  • gather_plan()
  • reduce_plan()

6.6.1 map_plan()

purrr-like functional programming is like looping, but cleaner. The idea is to iterate the same computation over multiple different data points. You write a function to do something once, and a map()-like helper invokes it on each point in your dataset. drake’s version of map() — or more precisely, pmap_df() — is map_plan().

In the following example, we want to know how well each pair covariates in the mtcars dataset can predict fuel efficiency (in miles per gallon). We will try multiple pairs of covariates using the same statistical analysis, so it is a great time for drake-flavored functional programming with map_plan().

As with its cousin, pmap_df(), map_plan() needs

  1. A function.
  2. A grid of function arguments.

Our function fits a fuel efficiency model given a single pair of covariate names x1 and x2.

my_model_fit <- function(x1, x2, data){
  lm(as.formula(paste("mpg ~", x1, "+", x2)), data = data)
}

Our grid of function arguments is a data frame of possible values for x1, x2, and data.

covariates <- setdiff(colnames(mtcars), "mpg") # Exclude the response variable.
args <- t(combn(covariates, 2)) # Take all possible pairs.
colnames(args) <- c("x1", "x2") # The column names must be the argument names of my_model_fit()
args <- tibble::as_tibble(args) # Tibbles are so nice.
args$data <- "mtcars"

args
#> # A tibble: 45 x 3
#>    x1    x2    data  
#>    <chr> <chr> <chr> 
#>  1 cyl   disp  mtcars
#>  2 cyl   hp    mtcars
#>  3 cyl   drat  mtcars
#>  4 cyl   wt    mtcars
#>  5 cyl   qsec  mtcars
#>  6 cyl   vs    mtcars
#>  7 cyl   am    mtcars
#>  8 cyl   gear  mtcars
#>  9 cyl   carb  mtcars
#> 10 disp  hp    mtcars
#> # … with 35 more rows

Each row of args corresponds to a call to my_model_fit(). To actually write out all those function calls, we use map_plan().

map_plan(args, my_model_fit)
#> # A tibble: 45 x 2
#>    target                command                                           
#>    <chr>                 <expr>                                            
#>  1 my_model_fit_501e051c my_model_fit(x1 = "cyl", x2 = "disp", data = "mtc…
#>  2 my_model_fit_d5de1d57 my_model_fit(x1 = "cyl", x2 = "hp", data = "mtcar…
#>  3 my_model_fit_eac6cd8b my_model_fit(x1 = "cyl", x2 = "drat", data = "mtc…
#>  4 my_model_fit_3900ef48 my_model_fit(x1 = "cyl", x2 = "wt", data = "mtcar…
#>  5 my_model_fit_a2d797f6 my_model_fit(x1 = "cyl", x2 = "qsec", data = "mtc…
#>  6 my_model_fit_f5c0ac7a my_model_fit(x1 = "cyl", x2 = "vs", data = "mtcar…
#>  7 my_model_fit_507d2929 my_model_fit(x1 = "cyl", x2 = "am", data = "mtcar…
#>  8 my_model_fit_b5f9a8a3 my_model_fit(x1 = "cyl", x2 = "gear", data = "mtc…
#>  9 my_model_fit_8c4c5d9d my_model_fit(x1 = "cyl", x2 = "carb", data = "mtc…
#> 10 my_model_fit_f9bb916e my_model_fit(x1 = "disp", x2 = "hp", data = "mtca…
#> # … with 35 more rows

We now have a plan, but it has a couple issues.

  1. The data argument should be a symbol. In other words, we want my_model_fit(data = mtcars), not my_model_fit(data = "mtcars"). So we use the syms() function from the rlang package turn args$data into a list of symbols.
  2. The default argument names are ugly, so we can add a new "id" column to args (or select one with the id argument of map_plan()).
# Fixes (1)
args$data <- rlang::syms(args$data)

# Alternative if each element of `args$data` is code with multiple symbols:
# args$data <- purrr::map(args$data, rlang::parse_expr)

# Fixes (2)
args$id <- paste0("fit_", args$x1, "_", args$x2)

args
#> # A tibble: 45 x 4
#>    x1    x2    data   id          
#>    <chr> <chr> <list> <chr>       
#>  1 cyl   disp  <sym>  fit_cyl_disp
#>  2 cyl   hp    <sym>  fit_cyl_hp  
#>  3 cyl   drat  <sym>  fit_cyl_drat
#>  4 cyl   wt    <sym>  fit_cyl_wt  
#>  5 cyl   qsec  <sym>  fit_cyl_qsec
#>  6 cyl   vs    <sym>  fit_cyl_vs  
#>  7 cyl   am    <sym>  fit_cyl_am  
#>  8 cyl   gear  <sym>  fit_cyl_gear
#>  9 cyl   carb  <sym>  fit_cyl_carb
#> 10 disp  hp    <sym>  fit_disp_hp 
#> # … with 35 more rows

Much better.

plan <- map_plan(args, my_model_fit)
plan
#> # A tibble: 45 x 2
#>    target       command                                             
#>    <chr>        <expr>                                              
#>  1 fit_cyl_disp my_model_fit(x1 = "cyl", x2 = "disp", data = mtcars)
#>  2 fit_cyl_hp   my_model_fit(x1 = "cyl", x2 = "hp", data = mtcars)  
#>  3 fit_cyl_drat my_model_fit(x1 = "cyl", x2 = "drat", data = mtcars)
#>  4 fit_cyl_wt   my_model_fit(x1 = "cyl", x2 = "wt", data = mtcars)  
#>  5 fit_cyl_qsec my_model_fit(x1 = "cyl", x2 = "qsec", data = mtcars)
#>  6 fit_cyl_vs   my_model_fit(x1 = "cyl", x2 = "vs", data = mtcars)  
#>  7 fit_cyl_am   my_model_fit(x1 = "cyl", x2 = "am", data = mtcars)  
#>  8 fit_cyl_gear my_model_fit(x1 = "cyl", x2 = "gear", data = mtcars)
#>  9 fit_cyl_carb my_model_fit(x1 = "cyl", x2 = "carb", data = mtcars)
#> 10 fit_disp_hp  my_model_fit(x1 = "disp", x2 = "hp", data = mtcars) 
#> # … with 35 more rows

We may also want to retain information about the constituent function arguments of each target. With map_plan(trace = TRUE), we can append the columns of args alongside the usual "target" and "command" columns of our plan.

map_plan(args, my_model_fit, trace = TRUE)
#> # A tibble: 45 x 6
#>    target     command                           x1    x2    data  id       
#>    <chr>      <expr>                            <chr> <chr> <exp> <chr>    
#>  1 fit_cyl_d… my_model_fit(x1 = "cyl", x2 = "d… cyl   disp  mtca… fit_cyl_…
#>  2 fit_cyl_hp my_model_fit(x1 = "cyl", x2 = "h… cyl   hp    mtca… fit_cyl_…
#>  3 fit_cyl_d… my_model_fit(x1 = "cyl", x2 = "d… cyl   drat  mtca… fit_cyl_…
#>  4 fit_cyl_wt my_model_fit(x1 = "cyl", x2 = "w… cyl   wt    mtca… fit_cyl_…
#>  5 fit_cyl_q… my_model_fit(x1 = "cyl", x2 = "q… cyl   qsec  mtca… fit_cyl_…
#>  6 fit_cyl_vs my_model_fit(x1 = "cyl", x2 = "v… cyl   vs    mtca… fit_cyl_…
#>  7 fit_cyl_am my_model_fit(x1 = "cyl", x2 = "a… cyl   am    mtca… fit_cyl_…
#>  8 fit_cyl_g… my_model_fit(x1 = "cyl", x2 = "g… cyl   gear  mtca… fit_cyl_…
#>  9 fit_cyl_c… my_model_fit(x1 = "cyl", x2 = "c… cyl   carb  mtca… fit_cyl_…
#> 10 fit_disp_… my_model_fit(x1 = "disp", x2 = "… disp  hp    mtca… fit_disp…
#> # … with 35 more rows

In any case, we can now fit our models.

make(plan, verbose = FALSE)

And inspect the output.

readd(fit_cyl_disp)
#> 
#> Call:
#> lm(formula = as.formula(paste("mpg ~", x1, "+", x2)), data = data)
#> 
#> Coefficients:
#> (Intercept)          cyl         disp  
#>    34.66099     -1.58728     -0.02058

6.6.2 Wildcard templating

In drake, you can write plans with wildcards. These wildcards are placeholders for text in commands. By iterating over the possible values of a wildcard, you can easily generate plans with thousands of targets. Let’s say you are running a simulation study, and you need to generate sets of random numbers from different distributions.

plan <- drake_plan(
  t  = rt(1000, df = 5),
  normal = runif(1000, mean = 0, sd = 1)
)

If you need to generate many datasets with different means, you may wish to write out each target individually.

drake_plan(
  t  = rt(1000, df = 5),
  normal_0 = runif(1000, mean = 0, sd = 1),
  normal_1 = runif(1000, mean = 1, sd = 1),
  normal_2 = runif(1000, mean = 2, sd = 1),
  normal_3 = runif(1000, mean = 3, sd = 1),
  normal_4 = runif(1000, mean = 4, sd = 1),
  normal_5 = runif(1000, mean = 5, sd = 1),
  normal_6 = runif(1000, mean = 6, sd = 1),
  normal_7 = runif(1000, mean = 7, sd = 1),
  normal_8 = runif(1000, mean = 8, sd = 1),
  normal_9 = runif(1000, mean = 9, sd = 1)
)

But writing all that code manually is a pain and prone to human error. Instead, use evaluate_plan()

plan <- drake_plan(
  t  = rt(1000, df = 5),
  normal = runif(1000, mean = mean__, sd = 1)
)
evaluate_plan(plan, wildcard = "mean__", values = 0:9)
#> # A tibble: 11 x 2
#>    target   command                      
#>    <chr>    <expr>                       
#>  1 t        rt(1000, df = 5)             
#>  2 normal_0 runif(1000, mean = 0, sd = 1)
#>  3 normal_1 runif(1000, mean = 1, sd = 1)
#>  4 normal_2 runif(1000, mean = 2, sd = 1)
#>  5 normal_3 runif(1000, mean = 3, sd = 1)
#>  6 normal_4 runif(1000, mean = 4, sd = 1)
#>  7 normal_5 runif(1000, mean = 5, sd = 1)
#>  8 normal_6 runif(1000, mean = 6, sd = 1)
#>  9 normal_7 runif(1000, mean = 7, sd = 1)
#> 10 normal_8 runif(1000, mean = 8, sd = 1)
#> 11 normal_9 runif(1000, mean = 9, sd = 1)

You can specify multiple wildcards at once. If multiple wildcards appear in the same command, you will get a new target for each unique combination of values.

plan <- drake_plan(
  t  = rt(1000, df = df__),
  normal = runif(1000, mean = mean__, sd = sd__)
)
evaluate_plan(
  plan,
  rules = list(
    mean__ = c(0, 1),
    sd__ = c(3, 4),
    df__ = 5:7
  )
)
#> # A tibble: 7 x 2
#>   target     command                      
#>   <chr>      <expr>                       
#> 1 t_5        rt(1000, df = 5)             
#> 2 t_6        rt(1000, df = 6)             
#> 3 t_7        rt(1000, df = 7)             
#> 4 normal_0_3 runif(1000, mean = 0, sd = 3)
#> 5 normal_0_4 runif(1000, mean = 0, sd = 4)
#> 6 normal_1_3 runif(1000, mean = 1, sd = 3)
#> 7 normal_1_4 runif(1000, mean = 1, sd = 4)

Wildcards for evaluate_plan() do not need to have the double-underscore suffix. Any valid symbol will do.

plan <- drake_plan(
  t  = rt(1000, df = .DF.),
  normal = runif(1000, mean = `{MEAN}`, sd = ..sd)
)
evaluate_plan(
  plan,
  rules = list(
    "`{MEAN}`" = c(0, 1),
    ..sd = c(3, 4),
    .DF. = 5:7
  )
)
#> # A tibble: 7 x 2
#>   target     command                      
#>   <chr>      <expr>                       
#> 1 t_5        rt(1000, df = 5)             
#> 2 t_6        rt(1000, df = 6)             
#> 3 t_7        rt(1000, df = 7)             
#> 4 normal_0_3 runif(1000, mean = 0, sd = 3)
#> 5 normal_0_4 runif(1000, mean = 0, sd = 4)
#> 6 normal_1_3 runif(1000, mean = 1, sd = 3)
#> 7 normal_1_4 runif(1000, mean = 1, sd = 4)

Set expand to FALSE to disable expansion.

plan <- drake_plan(
  t  = rpois(samples__, lambda = mean__),
  normal = runif(samples__, mean = mean__)
)
evaluate_plan(
  plan,
  rules = list(
    samples__ = c(50, 100),
    mean__ = c(1, 5)
  ),
  expand = FALSE
)
#> # A tibble: 2 x 2
#>   target command              
#>   <chr>  <expr>               
#> 1 t      rpois(50, lambda = 1)
#> 2 normal runif(100, mean = 5)

Wildcard templating can sometimes be tricky. For example, suppose your project is to analyze school data, and your workflow checks several metrics of several schools. The idea is to write a drake plan with your metrics and let the wildcard templating expand over the available schools.

hard_plan <- drake_plan(
  credits = check_credit_hours(school__),
  students = check_students(school__),
  grads = check_graduations(school__),
  public_funds = check_public_funding(school__)
)

evaluate_plan(
  hard_plan,
  rules = list(school__ = c("schoolA", "schoolB", "schoolC"))
)
#> # A tibble: 12 x 2
#>    target               command                      
#>    <chr>                <expr>                       
#>  1 credits_schoolA      check_credit_hours(schoolA)  
#>  2 credits_schoolB      check_credit_hours(schoolB)  
#>  3 credits_schoolC      check_credit_hours(schoolC)  
#>  4 students_schoolA     check_students(schoolA)      
#>  5 students_schoolB     check_students(schoolB)      
#>  6 students_schoolC     check_students(schoolC)      
#>  7 grads_schoolA        check_graduations(schoolA)   
#>  8 grads_schoolB        check_graduations(schoolB)   
#>  9 grads_schoolC        check_graduations(schoolC)   
#> 10 public_funds_schoolA check_public_funding(schoolA)
#> 11 public_funds_schoolB check_public_funding(schoolB)
#> 12 public_funds_schoolC check_public_funding(schoolC)

But what if some metrics do not make sense? For example, what if schoolC is a completely privately-funded school? With no public funds, check_public_funds(schoolC) may quit in error if we are not careful. This is where setting up drake plans requires a little creativity. In this case, we recommend that you use two wildcards: one for all the schools and another for just the public schools. The new plan has no twelfth row.

plan_template <- drake_plan(
  school = get_school_data("school__"),
  credits = check_credit_hours(all_schools__),
  students = check_students(all_schools__),
  grads = check_graduations(all_schools__),
  public_funds = check_public_funding(public_schools__)
)
evaluate_plan(
  plan = plan_template,
  rules = list(
    school__ = c("A", "B", "C"),
    all_schools__ =  c("school_A", "school_B", "school_C"),
    public_schools__ = c("school_A", "school_B")
  )
)
#> # A tibble: 14 x 2
#>    target                command                       
#>    <chr>                 <expr>                        
#>  1 school_A              get_school_data("A")          
#>  2 school_B              get_school_data("B")          
#>  3 school_C              get_school_data("C")          
#>  4 credits_school_A      check_credit_hours(school_A)  
#>  5 credits_school_B      check_credit_hours(school_B)  
#>  6 credits_school_C      check_credit_hours(school_C)  
#>  7 students_school_A     check_students(school_A)      
#>  8 students_school_B     check_students(school_B)      
#>  9 students_school_C     check_students(school_C)      
#> 10 grads_school_A        check_graduations(school_A)   
#> 11 grads_school_B        check_graduations(school_B)   
#> 12 grads_school_C        check_graduations(school_C)   
#> 13 public_funds_school_A check_public_funding(school_A)
#> 14 public_funds_school_B check_public_funding(school_B)

Thanks to Alex Axthelm for this use case in issue 235.

6.6.3 Wildcard clusters

With evaluate_plan(trace = TRUE), you can generate columns that show how the targets were generated from the wildcards.

plan_template <- drake_plan(
  school = get_school_data("school__"),
  credits = check_credit_hours(all_schools__),
  students = check_students(all_schools__),
  grads = check_graduations(all_schools__),
  public_funds = check_public_funding(public_schools__)
)
plan <- evaluate_plan(
  plan = plan_template,
  rules = list(
    school__ = c("A", "B", "C"),
    all_schools__ =  c("school_A", "school_B", "school_C"),
    public_schools__ = c("school_A", "school_B")
  ),
  trace = TRUE
)
plan
#> # A tibble: 14 x 8
#>    target command school__ school___from all_schools__ all_schools___f…
#>    <chr>  <expr>  <chr>    <chr>         <chr>         <chr>           
#>  1 schoo… get_sc… A        school        <NA>          <NA>            
#>  2 schoo… get_sc… B        school        <NA>          <NA>            
#>  3 schoo… get_sc… C        school        <NA>          <NA>            
#>  4 credi… check_… <NA>     <NA>          school_A      credits         
#>  5 credi… check_… <NA>     <NA>          school_B      credits         
#>  6 credi… check_… <NA>     <NA>          school_C      credits         
#>  7 stude… check_… <NA>     <NA>          school_A      students        
#>  8 stude… check_… <NA>     <NA>          school_B      students        
#>  9 stude… check_… <NA>     <NA>          school_C      students        
#> 10 grads… check_… <NA>     <NA>          school_A      grads           
#> 11 grads… check_… <NA>     <NA>          school_B      grads           
#> 12 grads… check_… <NA>     <NA>          school_C      grads           
#> 13 publi… check_… <NA>     <NA>          <NA>          <NA>            
#> 14 publi… check_… <NA>     <NA>          <NA>          <NA>            
#> # … with 2 more variables: public_schools__ <chr>,
#> #   public_schools___from <chr>

And then when you visualize the dependency graph, you can cluster nodes based on the wildcard info.

config <- drake_config(plan)
vis_drake_graph(
  config,
  group = "all_schools__",
  clusters = c("school_A", "school_B", "school_C")
)

See the visualization guide for more details.

6.6.4 Non-wildcard functions

6.6.4.1 expand_plan()

Sometimes, you just want multiple replicates of the same targets.

plan <- drake_plan(
  fake_data = simulate_from_model(),
  bootstrapped_data = bootstrap_from_real_data(real_data)
)
expand_plan(plan, values = 1:3)
#> # A tibble: 6 x 2
#>   target              command                            
#>   <chr>               <expr>                             
#> 1 fake_data_1         simulate_from_model()              
#> 2 fake_data_2         simulate_from_model()              
#> 3 fake_data_3         simulate_from_model()              
#> 4 bootstrapped_data_1 bootstrap_from_real_data(real_data)
#> 5 bootstrapped_data_2 bootstrap_from_real_data(real_data)
#> 6 bootstrapped_data_3 bootstrap_from_real_data(real_data)

6.6.4.2 gather_plan() and gather_by()

Other times, you want to combine multiple targets into one.

plan <- drake_plan(
  small = data.frame(type = "small", x = rnorm(25), y = rnorm(25)),
  large = data.frame(type = "large", x = rnorm(1000), y = rnorm(1000))
)
gather_plan(plan, target = "combined")
#> # A tibble: 1 x 2
#>   target   command                           
#>   <chr>    <expr>                            
#> 1 combined list(small = small, large = large)

In this case, small and large are data frames, so it may be more convenient to combine the rows together.

gather_plan(plan, target = "combined", gather = "rbind")
#> # A tibble: 1 x 2
#>   target   command                            
#>   <chr>    <expr>                             
#> 1 combined rbind(small = small, large = large)

See also gather_by() to gather multiple groups of targets based on other columns in the plan (e.g. from evaluate_plan(trace = TRUE)).

6.6.4.3 reduce_plan() and reduce_by()

reduce_plan() is similar to gather_plan(), but it allows you to combine multiple targets together in pairs. This is useful if combining everything at once requires too much time or computer memory, or if you want to parallelize the aggregation.

plan <- drake_plan(
  a = 1,
  b = 2,
  c = 3,
  d = 4
)
reduce_plan(plan)
#> # A tibble: 3 x 2
#>   target   command            
#>   <chr>    <expr>             
#> 1 target_1 a + b              
#> 2 target_2 c + d              
#> 3 target   target_1 + target_2

You can control how each pair of targets gets combined.

reduce_plan(plan, begin = "c(", op = ", ", end = ")")
#> # A tibble: 3 x 2
#>   target   command              
#>   <chr>    <expr>               
#> 1 target_1 c(a, b)              
#> 2 target_2 c(c, d)              
#> 3 target   c(target_1, target_2)

See also reduce_by() to do reductions on multiple groups of targets based on other columns in the plan (e.g. from evaluate_plan(trace = TRUE)).

6.6.5 Custom metaprogramming

The drake plan is just a data frame. There is nothing magic about it, and you can create it any way you want. With your own custom metaprogramming, you don’t even need the drake_plan() function.

The following example could more easily be implemented with map_plan(), but we use other techniques to demonstrate the versatility of custom metaprogramming. Let’s consider a file-based example workflow. Here, our targets execute Linux commands to process input files and create output files.

cat in1.txt > out1.txt
cat in2.txt > out2.txt

The glue package can automatically generate these Linux commands.

library(glue)
glue_data(
  list(
    inputs = c("in1.txt", "in2.txt"), 
    outputs = c("out1.txt", "out2.txt")
  ),
  "cat {inputs} > {outputs}"
)
#> cat in1.txt > out1.txt
#> cat in2.txt > out2.txt

Our drake commands will use system() to execute the Linux commands that glue generates. Technically, we could use drake_plan() if we wanted.

library(tidyverse)
drake_plan(
  glue_data(
    list(
      inputs = file_in(c("in1.txt", "in2.txt")), 
      outputs = file_out(c("out1.txt", "out2.txt"))
    ),
    "cat {inputs} > {outputs}"
  ) %>%
    lapply(FUN = system)
)
#> # A tibble: 1 x 2
#>   target        command                                                    
#>   <chr>         <expr>                                                     
#> 1 drake_target… glue_data(list(inputs = file_in(c("in1.txt", "in2.txt")), …

But what if we want to generate these glue commands instead of writing them literally in our plan? This is a job for custom metaprogramming with tidy evaluation. First, we create a function to generate the drake command of an arbitrary target.

library(rlang) # for tidy evaluation
write_command <- function(cmd, inputs = NULL , outputs = NULL){
  inputs <- enexpr(inputs)
  outputs <- enexpr(outputs)
  expr({
    glue_data(
      list(
        inputs = file_in(!!inputs),
        outputs = file_out(!!outputs)
      ),
      !!cmd
    ) %>%
      lapply(FUN = system)
  }) %>%
    expr_text
}

write_command(
  cmd = "cat {inputs} > {outputs}",
  inputs = c("in1.txt", "in2.txt"),
  outputs = c("out1.txt", "out2.txt")
) %>%
  cat
#> {
#>     glue_data(list(inputs = file_in(c("in1.txt", "in2.txt")), 
#>         outputs = file_out(c("out1.txt", "out2.txt"))), "cat {inputs} > {outputs}") %>% 
#>         lapply(FUN = system)
#> }

Then, we lay out all the arguments we will pass to write_command(). Here, each row corresponds to a separate target.

meta_plan <- tribble(
  ~cmd, ~inputs, ~outputs,
  "cat {inputs} > {outputs}", c("in1.txt", "in2.txt"), c("out1.txt", "out2.txt"),
  "cat {inputs} {inputs} > {outputs}", c("out1.txt", "out2.txt"), c("out3.txt", "out4.txt")
) %>%
  print
#> # A tibble: 2 x 3
#>   cmd                               inputs    outputs  
#>   <chr>                             <list>    <list>   
#> 1 cat {inputs} > {outputs}          <chr [2]> <chr [2]>
#> 2 cat {inputs} {inputs} > {outputs} <chr [2]> <chr [2]>

Finally, we create our drake plan without any built-in drake functions.

plan <- tibble(
  target = paste0("target_", seq_len(nrow(meta_plan))),
  command = pmap_chr(meta_plan, write_command)
) %>%
  print
#> # A tibble: 2 x 2
#>   target   command                                                         
#>   <chr>    <chr>                                                           
#> 1 target_1 "{\n    glue_data(list(inputs = file_in(c(\"in1.txt\", \"in2.tx…
#> 2 target_2 "{\n    glue_data(list(inputs = file_in(c(\"out1.txt\", \"out2.…
writeLines("in1", "in1.txt")
writeLines("in2", "in2.txt")
vis_drake_graph(drake_config(plan))

Alternatively, you could use as.call() instead of tidy evaluation to generate your plan. Use as.call() to construct calls to file_in(), file_out(), and custom functions in your commands.

library(purrr) # pmap_chr() is particularly useful here.

# A function that will be called in your commands.
command_function <- function(cmd, inputs, outputs){
  glue_data(
    list(
      inputs = inputs,
      outputs = outputs
    ),
    cmd
  ) %>%
    purrr::walk(system)
}

# A function to generate quoted calls to command_function(),
# which in turn contain quoted calls to file_in() and file_out().
write_command <- function(...){
  args <- list(...)
  args$inputs <- as.call(list(quote(file_in), args$inputs))
  args$outputs <- as.call(list(quote(file_out), args$outputs))
  c(quote(command_function), args) %>%
    as.call() %>%
    rlang::expr_text()
}

plan <- tibble(
  target = paste0("target_", seq_len(nrow(meta_plan))),
  command = pmap_chr(meta_plan, write_command)
) %>%
  print
#> # A tibble: 2 x 2
#>   target   command                                                         
#>   <chr>    <chr>                                                           
#> 1 target_1 "command_function(cmd = \"cat {inputs} > {outputs}\", inputs = …
#> 2 target_2 "command_function(cmd = \"cat {inputs} {inputs} > {outputs}\", …

Metaprogramming gets much simpler if you do not need to construct literal calls to file_in(), file_out(), etc. in your commands. The construction of model_plan in the gross state product exmaple is an example.

Thanks to Chris Hammill for presenting this scenario and contributing to the solution.


  1. You can turn the command column of your plan into a character vector (e.g. plan$command <- purrr::map_chr(plan$command, rlang::expr_text)) and drake will still understand you. However, the recommended format is a list of expressions. drake_plan() and friends always supply expression lists.

  2. drake_plan() is the best way to create plans, but you can create plans any way you like. drake will understand plans you create directly using data.frame() or tibble().

Copyright Eli Lilly and Company