Chapter 3 drake plans

3.1 What is a drake plan?

A drake plan is a data frame with columns named target and command. Each target is an R object, and each command is an expression to produce it.1 The drake_plan() function is the best way to set up plans.2 Recall the plan from the walkthrough:

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

plan
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 hist     create_plot(data)                                              …
#> 4 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

drake_plan() does not run the workflow, it only creates the plan. To build the actual targets, we need to run make(). Creating the plan is like writing an R script, and running make(your_plan) is like calling source("your_script.R").

3.2 Plans are similar to R scripts.

Your drake plan is like a top-level R script that runs everything from end to end. In fact, you can convert back and forth between plans and scripts using functions plan_to_code() and code_to_plan() (with some caveats).

plan_to_code(plan, "new_script.R")
#> Loading required namespace: styler
cat(readLines("new_script.R"), sep = "\n")
#> raw_data <- readxl::read_excel(file_in("raw_data.xlsx"))
#> data <- raw_data %>% mutate(Species = forcats::fct_inorder(Species))
#> fit <- lm(Sepal.Width ~ Petal.Width + Species, data)
#> hist <- create_plot(data)
#> report <- rmarkdown::render(knitr_in("report.Rmd"),
#>   output_file = file_out("report.html"),
#>   quiet = TRUE
#> )

code_to_plan("new_script.R")
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 4 hist     create_plot(data)                                              …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

And plan_to_notebook() turns plans into R notebooks.

plan_to_notebook(plan, "new_notebook.Rmd")
cat(readLines("new_notebook.Rmd"), sep = "\n")
#> ---
#> title: "My Notebook"
#> output: html_notebook
#> ---
#> 
#> ```{r my_code}
#> raw_data <- readxl::read_excel(file_in("raw_data.xlsx"))
#> data <- raw_data %>% mutate(Species = forcats::fct_inorder(Species))
#> fit <- lm(Sepal.Width ~ Petal.Width + Species, data)
#> hist <- create_plot(data)
#> report <- rmarkdown::render(knitr_in("report.Rmd"),
#>   output_file = file_out("report.html"),
#>   quiet = TRUE
#> )
#> ```

code_to_plan("new_notebook.Rmd")
#> # A tibble: 5 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 raw_data readxl::read_excel(file_in("raw_data.xlsx"))                   …
#> 2 data     raw_data %>% mutate(Species = forcats::fct_inorder(Species))   …
#> 3 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  …
#> 4 hist     create_plot(data)                                              …
#> 5 report   rmarkdown::render(knitr_in("report.Rmd"), output_file = file_ou…

3.3 So why do we use plans?

If you have ever waited more than 10 minutes for an R script to finish, then you know the frustration of having to rerun the whole thing every time you make a change. Plans make life easier.

3.3.1 Plans chop up the work into pieces.

Some targets may need an update while others may not. In the walkthrough, make() was smart enough to skip the data cleaning step and just rebuild the plot and report. drake and its plans compartmentalize the work, and this can save you from wasted effort in the long run.

3.3.2 drake uses plans to schedule you work.

make() automatically learns the build order of your targets and how to run them in parallel. The underlying magic is static code analysis, which automatically detects the dependencies of each target without having to run its command.

create_plot <- function(data) {
  ggplot(data, aes_string(x = "Petal.Width", fill = "Species")) +
    geom_histogram(bins = 20)
}

deps_code(create_plot)
#> # A tibble: 3 x 2
#>   name           type   
#>   <chr>          <chr>  
#> 1 geom_histogram globals
#> 2 ggplot         globals
#> 3 aes_string     globals

deps_code(quote(create_plot(datasets::iris)))
#> # A tibble: 2 x 2
#>   name           type      
#>   <chr>          <chr>     
#> 1 create_plot    globals   
#> 2 datasets::iris namespaced

Because of the dependency relationships, row order does not matter once the plan is fully defined. The following plan declares file before plot.

small_plan <- drake_plan(
  file = ggsave(file_out("plot.png"), plot, width = 7, height = 5),
  plot = create_plot(datasets::iris)
)

But file actually depends on plot.

small_config <- drake_config(small_plan)
vis_drake_graph(small_config)

So make() builds plot first.

library(ggplot2)
make(small_plan)
#> target plot
#> target file

3.4 Special custom columns in your plan

You can add other columns besides the required target and command.

cbind(small_plan, cpu = c(1, 2))
#>   target                                                   command cpu
#> 1   file ggsave(file_out("plot.png"), plot, width = 7, height = 5)   1
#> 2   plot                               create_plot(datasets::iris)   2

Within drake_plan(), target() lets you create any custom column except target, command, and transform, the last of which has a special meaning.

drake_plan(
  file = target(
    ggsave(file_out("plot.png"), plot),
    elapsed = 10
  ),
  create_plot(datasets::iris)
)
#> # A tibble: 2 x 3
#>   target         command                            elapsed
#>   <chr>          <expr>                               <dbl>
#> 1 file           ggsave(file_out("plot.png"), plot)      10
#> 2 drake_target_1 create_plot(datasets::iris)             NA

The following columns have special meanings for make().

  • elapsed and cpu: number of seconds to wait for the target to build before timing out (elapsed for elapsed time and cpu for CPU time).
  • hpc: logical values (TRUE/FALSE/NA) whether to send each target to parallel workers. Click here to learn more.
  • resources: target-specific lists of resources for a computing cluster. See the advanced options in the parallel computing chapter for details.
  • retries: number of times to retry building a target in the event of an error.
  • trigger: rule to decide whether a target needs to run. See the trigger chapter to learn more.

3.5 Large plans

drake version 7.0.0 introduced new syntax to make it easier to create plans. To try it out before the next CRAN release, install the current development version from GitHub.

install.packages("remotes")
library(remotes)
install_github("ropensci/drake")

3.5.1 How to create large plans

Ordinarily, drake_plan() requires you to write out all the targets one-by-one. This is a literal pain.

drake_plan(
  data = get_data(),
  analysis_1_1 = fit_model_x(data, mean = 1, sd = 1),
  analysis_2_1 = fit_model_x(data, mean = 2, sd = 1),
  analysis_5_1 = fit_model_x(data, mean = 5, sd = 1),
  analysis_10_1 = fit_model_x(data, mean = 10, sd = 1),
  analysis_100_1 = fit_model_x(data, mean = 100, sd = 1),
  analysis_1000_1 = fit_model_x(data, mean = 1000, sd = 1),
  analysis_1_2 = fit_model_x(data, mean = 1, sd = 2),
  analysis_2_2 = fit_model_x(data, mean = 2, sd = 2),
  analysis_5_2 = fit_model_x(data, mean = 5, sd = 2),
  analysis_10_2 = fit_model_x(data, mean = 10, sd = 2),
  analysis_100_2 = fit_model_x(data, mean = 100, sd = 2),
  analysis_1000_2 = fit_model_x(data, mean = 1000, sd = 2),
  # UUUGGGHH my wrists are cramping! :( ...
)

Transformations reduce typing, especially when combined with tidy evaluation (!!).

lots_of_sds <- as.numeric(1:1e3)

drake_plan(
  data = get_data(),
  analysis = target(
    fun(data, mean = mean_val, sd = sd_val),
    transform = cross(mean_val = c(2, 5, 10, 100, 1000), sd_val = !!lots_of_sds)
  )
)
#> # A tibble: 5,001 x 2
#>    target          command                       
#>    <chr>           <expr>                        
#>  1 data            get_data()                    
#>  2 analysis_2_1    fun(data, mean = 2, sd = 1)   
#>  3 analysis_5_1    fun(data, mean = 5, sd = 1)   
#>  4 analysis_10_1   fun(data, mean = 10, sd = 1)  
#>  5 analysis_100_1  fun(data, mean = 100, sd = 1) 
#>  6 analysis_1000_1 fun(data, mean = 1000, sd = 1)
#>  7 analysis_2_2    fun(data, mean = 2, sd = 2)   
#>  8 analysis_5_2    fun(data, mean = 5, sd = 2)   
#>  9 analysis_10_2   fun(data, mean = 10, sd = 2)  
#> 10 analysis_100_2  fun(data, mean = 100, sd = 2) 
#> # … with 4,991 more rows

Behind the scenes during a transformation, drake_plan() creates new columns to track what is happening. You can see them with trace = TRUE.

drake_plan(
  data = get_data(),
  analysis = target(
    analyze(data, mean, sd),
    transform = map(mean = c(3, 4), sd = c(1, 2))
  ),
  trace = TRUE
)
#> # A tibble: 3 x 5
#>   target       command             mean  sd    analysis    
#>   <chr>        <expr>              <chr> <chr> <chr>       
#> 1 data         get_data()          <NA>  <NA>  <NA>        
#> 2 analysis_3_1 analyze(data, 3, 1) 3     1     analysis_3_1
#> 3 analysis_4_2 analyze(data, 4, 2) 4     2     analysis_4_2

Because of those columns, you can chain transformations together in complex pipelines.

plan1 <- drake_plan(
  small = get_small_data(),
  large = get_large_data(),
  analysis = target( # Analyze each dataset once with a different mean.
    analyze(data, mean),
    transform = map(data = c(small, large), mean = c(1, 2))
  ),
  # Calculate 2 different performance metrics on every model fit.
  metric = target(
    metric_fun(analysis),
    # mse = mean squared error, mae = mean absolute error.
    # Assume these are functions you write.
    transform = cross(metric_fun = c(mse, mae), analysis)
  ),
  # Summarize the performance metrics for each dataset.
  summ_data = target(
    summary(metric),
    transform = combine(metric, .by = data)
  ),
  # Same, but for each metric type.
  summ_metric = target(
    summary(metric),
    transform = combine(metric, .by = metric_fun)
  )
)

plan1
#> # A tibble: 12 x 2
#>    target                 command                                          
#>    <chr>                  <expr>                                           
#>  1 small                  get_small_data()                                …
#>  2 large                  get_large_data()                                …
#>  3 analysis_small_1       analyze(small, 1)                               …
#>  4 analysis_large_2       analyze(large, 2)                               …
#>  5 metric_mse_analysis_s… mse(analysis_small_1)                           …
#>  6 metric_mae_analysis_s… mae(analysis_small_1)                           …
#>  7 metric_mse_analysis_l… mse(analysis_large_2)                           …
#>  8 metric_mae_analysis_l… mae(analysis_large_2)                           …
#>  9 summ_data_large        summary(metric_mse_analysis_large_2, metric_mae_…
#> 10 summ_data_small        summary(metric_mse_analysis_small_1, metric_mae_…
#> 11 summ_metric_mae        summary(metric_mae_analysis_small_1, metric_mae_…
#> 12 summ_metric_mse        summary(metric_mse_analysis_small_1, metric_mse_…

config1 <- drake_config(plan1)
vis_drake_graph(config1)

And you can write the transformations in any order. The following plan is equivalent to plan1 despite the rearranged rows.

plan2 <- drake_plan(
  # Calculate 2 different performance metrics on every model fit.
  summ_metric = target(
    summary(metric),
    transform = combine(metric, .by = metric_fun)
  ),
  metric = target(
    metric_fun(analysis),
    # mse = mean squared error, mae = mean absolute error.
    # Assume these are functions you write.
    transform = cross(metric_fun = c(mse, mae), analysis)
  ),
  small = get_small_data(),
  analysis = target( # Analyze each dataset once with a different mean.
    analyze(data, mean),
    transform = map(data = c(small, large), mean = c(1, 2))
  ),
  # Summarize the performance metrics for each dataset.
  summ_data = target(
    summary(metric),
    transform = combine(metric, .by = data)
  ),
  large = get_large_data()
  # Same, but for each metric type.
)

plan2
#> # A tibble: 12 x 2
#>    target                 command                                          
#>    <chr>                  <expr>                                           
#>  1 summ_metric_mae        summary(metric_mae_analysis_small_1, metric_mae_…
#>  2 summ_metric_mse        summary(metric_mse_analysis_small_1, metric_mse_…
#>  3 metric_mse_analysis_s… mse(analysis_small_1)                           …
#>  4 metric_mae_analysis_s… mae(analysis_small_1)                           …
#>  5 metric_mse_analysis_l… mse(analysis_large_2)                           …
#>  6 metric_mae_analysis_l… mae(analysis_large_2)                           …
#>  7 small                  get_small_data()                                …
#>  8 analysis_small_1       analyze(small, 1)                               …
#>  9 analysis_large_2       analyze(large, 2)                               …
#> 10 summ_data_large        summary(metric_mse_analysis_large_2, metric_mae_…
#> 11 summ_data_small        summary(metric_mse_analysis_small_1, metric_mae_…
#> 12 large                  get_large_data()                                …

config2 <- drake_config(plan2)
vis_drake_graph(config2)

3.5.2 Start small

To speed up initial testing and experimentation, you may want to limit the number of extra targets created by the map() and cross() transformations. Simply set max_expand in drake_plan().

plan <- drake_plan(
  data = target(
    get_data(source),
    transform = map(source = !!seq_len(25))
  ),
  analysis = target(
    fn(data, param),
    transform = cross(
      data,
      fn = !!letters,
      param = !!seq_len(25)
    ),
  ),
  result = target(
    bind_rows(analysis),
    transform = combine(analysis, .by = fn)
  ),
  max_expand = 3
)

plan
#> # A tibble: 33 x 2
#>    target                 command        
#>    <chr>                  <expr>         
#>  1 data_1L                get_data(1L)   
#>  2 data_13L               get_data(13L)  
#>  3 data_25L               get_data(25L)  
#>  4 analysis_a_1L_data_1L  a(data_1L, 1L) 
#>  5 analysis_m_1L_data_1L  m(data_1L, 1L) 
#>  6 analysis_z_1L_data_1L  z(data_1L, 1L) 
#>  7 analysis_a_13L_data_1L a(data_1L, 13L)
#>  8 analysis_m_13L_data_1L m(data_1L, 13L)
#>  9 analysis_z_13L_data_1L z(data_1L, 13L)
#> 10 analysis_a_25L_data_1L a(data_1L, 25L)
#> # … with 23 more rows

We can more easily inspect the graph to get an idea of the shape of our workflow.

config <- drake_config(plan)
vis_drake_graph(config)

We can even run make(plan) and verify that the subset of targets in the plan turn out okay.

make(plan)                       # Run the small subset of targets we kept.
loadd()                          # Load all those targets into memory.
summary(analysis_z_25L_data_13L) # Look at the values of those targets
                                 # and decide if we are ready to scale up.

Afterwards, we can scale up to the full collection of targets.

plan <- drake_plan(
  data = target(
    get_data(source),
    transform = map(source = !!seq_len(25))
  ),
  analysis = target(
    fn(data, param),
    transform = cross(
      data,
      fn = !!letters,
      param = !!seq_len(25)
    ),
  ),
  result = target(
    bind_rows(analysis),
    transform = combine(analysis, .by = fn)
  )
)

nrow(plan)
#> [1] 16301

The full plan is enormous! Here, vis_drake_graph() would be prohibitively slow and eat up too much memory to show anything useful. Good thing we checked the scaled-down version first.

# config <- drake_config(plan) # Takes a long time.
# vis_drake_graph()            # Takes too much time and too much memory.

The next make() will skip the targets we ran before in test mode if they are still up to date. In other words, experimenting on a downsized plan gave us a head start.

make(plan) # Full scaled-up workflow. Takes much longer.

3.5.3 The types of transformations

drake supports three types of transformations: map(), cross(), and combine(). These are not actual functions, but you can treat them as functions when you use them in drake_plan(). Each transformation takes after a function from the Tidyverse.

drake Tidyverse analogue
map() pmap() from purrr
cross() crossing() from tidyr
combine() summarize() from dplyr

3.5.3.1 map()

map() creates a new target for each row in a grid.

drake_plan(
  x = target(
    simulate_data(center, scale),
    transform = map(center = c(2, 1, 0), scale = c(3, 2, 1))
  )
)
#> # A tibble: 3 x 2
#>   target command            
#>   <chr>  <expr>             
#> 1 x_2_3  simulate_data(2, 3)
#> 2 x_1_2  simulate_data(1, 2)
#> 3 x_0_1  simulate_data(0, 1)

You can supply your own custom grid using the .data argument. Note the use of !! below.

my_grid <- tibble(
  sim_function = c("rnrom", "rt", "rcauchy"),
  title = c("Normal", "Student t", "Cauchy")
)
my_grid$sim_function <- rlang::syms(my_grid$sim_function)

drake_plan(
  x = target(
    simulate_data(sim_function, title, center, scale),
    transform = map(
      center = c(2, 1, 0),
      scale = c(3, 2, 1),
      .data = !!my_grid,
      # In `.id`, you can select one or more grouping variables
      # for pretty target names.
      # Set to FALSE to use short numeric suffixes.
      .id = sim_function # Try `.id = c(sim_function, center)` yourself.
    )
  )
)
#> # A tibble: 3 x 2
#>   target    command                               
#>   <chr>     <expr>                                
#> 1 x_rnrom   simulate_data(rnrom, "Normal", 2, 3)  
#> 2 x_rt      simulate_data(rt, "Student t", 1, 2)  
#> 3 x_rcauchy simulate_data(rcauchy, "Cauchy", 0, 1)

3.5.3.2 Special considerations in map()

map() column-binds variables together to create a grid. The lengths of those variables need to be conformable just as with data.frame().

drake_plan(
  x = target(
    simulate_data(center, scale),
    transform = map(center = c(2, 1, 0), scale = c(3, 2))
  )
)
#> Error: Failed to make a grid of grouping variables for map().
#> Grouping variables in map() must have suitable lengths for coercion to a data frame.
#> Possibly uneven groupings detected in map(center = c(2, 1, 0), scale = c(3, 2)):
#>   c("2", "1", "0")
#>   c("3", "2")

Sometimes, the results are sensible when grouping variable lengths are multiples of each other, but be careful.

drake_plan(
  x = target(
    simulate_data(center, scale),
    transform = map(center = c(2, 1, 0), scale = 4)
  )
)
#> # A tibble: 3 x 2
#>   target command            
#>   <chr>  <expr>             
#> 1 x_2_4  simulate_data(2, 4)
#> 2 x_1_4  simulate_data(1, 4)
#> 3 x_0_4  simulate_data(0, 4)

Things get tricker when drake reuses grouping variables from previous transformations. For example, below, each x_* target has an associated nrow value. So if you write transform = map(x), then nrow goes along for the ride.

drake_plan(
  x = target(
    simulate_data(center),
    transform = map(center = c(1, 2))
  ),
  y = target(
    process_data(x, center),
    transform = map(x)
  ),
  trace = TRUE # Adds extra columns for the grouping variables.
)
#> # A tibble: 4 x 5
#>   target command              center x     y    
#>   <chr>  <expr>               <chr>  <chr> <chr>
#> 1 x_1    simulate_data(1)     1      x_1   <NA> 
#> 2 x_2    simulate_data(2)     2      x_2   <NA> 
#> 3 y_x_1  process_data(x_1, 1) 1      x_1   y_x_1
#> 4 y_x_2  process_data(x_2, 2) 2      x_2   y_x_2

But if other targets have centers’s of their own, drake_plan() may not know what to do with them.

drake_plan(
  w = target(
    simulate_data(center),
    transform = map(center = c(3, 4))
  ),
  x = target(
    simulate_data_2(center),
    transform = map(center = c(1, 2))
  ),
  y = target(
    process_data(w, x, center),
    transform = map(w, x)
  ),
  trace = TRUE
)
#> # A tibble: 6 x 6
#>   target    command                    center w     x     y        
#>   <chr>     <expr>                     <chr>  <chr> <chr> <chr>    
#> 1 w_3       simulate_data(3)           3      w_3   <NA>  <NA>     
#> 2 w_4       simulate_data(4)           4      w_4   <NA>  <NA>     
#> 3 x_1       simulate_data_2(1)         1      <NA>  x_1   <NA>     
#> 4 x_2       simulate_data_2(2)         2      <NA>  x_2   <NA>     
#> 5 y_w_3_x_1 process_data(w_3, x_1, NA) <NA>   w_3   x_1   y_w_3_x_1
#> 6 y_w_4_x_2 process_data(w_4, x_2, NA) <NA>   w_4   x_2   y_w_4_x_2

The problems is that there are 4 values of center and only two x_* targets (and two y_* targets). Even if you explicitly supply center to the transformation, map() can only takes the first two values.

drake_plan(
  w = target(
    simulate_data(center),
    transform = map(center = c(3, 4))
  ),
  x = target(
    simulate_data_2(center),
    transform = map(center = c(1, 2))
  ),
  y = target(
    process_data(w, x, center),
    transform = map(w, x, center)
  ),
  trace = TRUE
)
#> # A tibble: 6 x 6
#>   target      command                   center w     x     y          
#>   <chr>       <expr>                    <chr>  <chr> <chr> <chr>      
#> 1 w_3         simulate_data(3)          3      w_3   <NA>  <NA>       
#> 2 w_4         simulate_data(4)          4      w_4   <NA>  <NA>       
#> 3 x_1         simulate_data_2(1)        1      <NA>  x_1   <NA>       
#> 4 x_2         simulate_data_2(2)        2      <NA>  x_2   <NA>       
#> 5 y_w_3_x_1_3 process_data(w_3, x_1, 3) 3      w_3   x_1   y_w_3_x_1_3
#> 6 y_w_4_x_2_4 process_data(w_4, x_2, 4) 4      w_4   x_2   y_w_4_x_2_4

So please inspect the plan before you run it with make(). Once you have a drake_config() object, vis_drake_graph() and deps_target() can help.

3.5.3.3 cross()

cross() creates a new target for each combination of argument values.

drake_plan(
  x = target(
    simulate_data(nrow, ncol),
    transform = cross(nrow = c(1, 2, 3), ncol = c(4, 5))
  )
)
#> # A tibble: 6 x 2
#>   target command            
#>   <chr>  <expr>             
#> 1 x_1_4  simulate_data(1, 4)
#> 2 x_2_4  simulate_data(2, 4)
#> 3 x_3_4  simulate_data(3, 4)
#> 4 x_1_5  simulate_data(1, 5)
#> 5 x_2_5  simulate_data(2, 5)
#> 6 x_3_5  simulate_data(3, 5)

3.5.3.4 combine()

In combine(), you can insert multiple targets into individual commands. The closest comparison is the unquote-splice operator !!! from the Tidyverse.

plan <- drake_plan(
  data = target(
    sim_data(mean = x, sd = y),
    transform = map(x = c(1, 2), y = c(3, 4))
  ),
  larger = target(
    bind_rows(data, .id = "id") %>%
      arrange(sd) %>%
      head(n = 400),
    transform = combine(data)
  )
)

plan
#> # A tibble: 3 x 2
#>   target   command                                                         
#>   <chr>    <expr>                                                          
#> 1 data_1_3 sim_data(mean = 1, sd = 3)                                     …
#> 2 data_2_4 sim_data(mean = 2, sd = 4)                                     …
#> 3 larger   bind_rows(data_1_3, data_2_4, .id = "id") %>% arrange(sd) %>%  …

drake_plan_source(plan)
#> drake_plan(
#>   data_1_3 = sim_data(mean = 1, sd = 3),
#>   data_2_4 = sim_data(mean = 2, sd = 4),
#>   larger = bind_rows(data_1_3, data_2_4, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400)
#> )

config <- drake_config(plan)
vis_drake_graph(config)

You can different groups of targets in the same command.

plan <- drake_plan(
  data_group1 = target(
    sim_data(mean = x, sd = y),
    transform = map(x = c(1, 2), y = c(3, 4))
  ),
  data_group2 = target(
    pull_data(url),
    transform = map(url = c("example1.com", "example2.com"))
  ),
  larger = target(
    bind_rows(data_group1, data_group2, .id = "id") %>%
      arrange(sd) %>%
      head(n = 400),
    transform = combine(data_group1, data_group2)
  )
)

drake_plan_source(plan)
#> drake_plan(
#>   data_group1_1_3 = sim_data(mean = 1, sd = 3),
#>   data_group1_2_4 = sim_data(mean = 2, sd = 4),
#>   data_group2_example1.com = pull_data("example1.com"),
#>   data_group2_example2.com = pull_data("example2.com"),
#>   larger = bind_rows(data_group1_1_3, data_group1_2_4, data_group2_example1.com,
#>     data_group2_example2.com,
#>     .id = "id"
#>   ) %>%
#>     arrange(sd) %>%
#>     head(n = 400)
#> )

And as with group_by() from dplyr, you can create a separate aggregate for each combination of levels of the arguments. Just pass a symbol or vector of symbols to the optional .by argument of combine().

plan <- drake_plan(
  data = target(
    sim_data(mean = x, sd = y, skew = z),
    transform = cross(x = c(1, 2), y = c(3, 4), z = c(5, 6))
  ),
  combined = target(
    bind_rows(data, .id = "id") %>%
      arrange(sd) %>%
      head(n = 400),
    transform = combine(data, .by = c(x, y))
  )
)

drake_plan_source(plan)
#> drake_plan(
#>   data_1_3_5 = sim_data(mean = 1, sd = 3, skew = 5),
#>   data_2_3_5 = sim_data(mean = 2, sd = 3, skew = 5),
#>   data_1_4_5 = sim_data(mean = 1, sd = 4, skew = 5),
#>   data_2_4_5 = sim_data(mean = 2, sd = 4, skew = 5),
#>   data_1_3_6 = sim_data(mean = 1, sd = 3, skew = 6),
#>   data_2_3_6 = sim_data(mean = 2, sd = 3, skew = 6),
#>   data_1_4_6 = sim_data(mean = 1, sd = 4, skew = 6),
#>   data_2_4_6 = sim_data(mean = 2, sd = 4, skew = 6),
#>   combined_1_3 = bind_rows(data_1_3_5, data_1_3_6, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400),
#>   combined_2_3 = bind_rows(data_2_3_5, data_2_3_6, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400),
#>   combined_1_4 = bind_rows(data_1_4_5, data_1_4_6, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400),
#>   combined_2_4 = bind_rows(data_2_4_5, data_2_4_6, .id = "id") %>%
#>     arrange(sd) %>%
#>     head(n = 400)
#> )

In your post-processing, you may need the values of x and y that underly data_1_3 and data_2_4. Solution: get the trace and the target names. We define a new plan

plan <- drake_plan(
  data = target(
    sim_data(mean = x, sd = y),
    transform = map(x = c(1, 2), y = c(3, 4))
  ),
  larger = target(
    post_process(data, plan = ignore(plan)) %>%
      arrange(sd) %>%
      head(n = 400),
    transform = combine(data)
  ),
  trace = TRUE
)

drake_plan_source(plan)
#> drake_plan(
#>   data_1_3 = target(
#>     command = sim_data(mean = 1, sd = 3),
#>     x = "1",
#>     y = "3",
#>     data = "data_1_3"
#>   ),
#>   data_2_4 = target(
#>     command = sim_data(mean = 2, sd = 4),
#>     x = "2",
#>     y = "4",
#>     data = "data_2_4"
#>   ),
#>   larger = target(
#>     command = post_process(data_1_3, data_2_4, plan = ignore(plan)) %>%
#>       arrange(sd) %>%
#>       head(n = 400),
#>     larger = "larger"
#>   )
#> )

and a new function

post_process <- function(..., plan) {
  args <- list(...)
  names(args) <- all.vars(substitute(list(...)))
  trace <- filter(plan, target %in% names(args))
  # Do post-processing with args and trace.
}

3.5.4 Grouping variables

A grouping variable is an argument to map(), cross(), or combine() that identifies a sub-collection of target names. Grouping variables can be either literals or symbols. Symbols can be scalars or vectors, and you can pass them to transformations with or without argument names.

3.5.4.1 Literal arguments

When you pass a grouping variable of literals, you must use an explicit argument name. One does not simply write map(c(1, 2)).

drake_plan(x = target(sqrt(y), transform = map(y = c(1, 2))))
#> # A tibble: 2 x 2
#>   target command
#>   <chr>  <expr> 
#> 1 x_1    sqrt(1)
#> 2 x_2    sqrt(2)

And if you supply integer sequences the usual way, you may notice some rows are missing.

drake_plan(x = target(sqrt(y), transform = map(y = 1:3)))
#> # A tibble: 2 x 2
#>   target command
#>   <chr>  <expr> 
#> 1 x_1    sqrt(1)
#> 2 x_3    sqrt(3)

Tidy evaluation and as.numeric() make sure all the data points show up.

y_vals <- as.numeric(1:3)
drake_plan(x = target(sqrt(y), transform = map(y = !!y_vals)))
#> # A tibble: 3 x 2
#>   target command
#>   <chr>  <expr> 
#> 1 x_1    sqrt(1)
#> 2 x_2    sqrt(2)
#> 3 x_3    sqrt(3)

Character vectors usually work without a hitch, and quotes are converted into dots to make valid target names.

drake_plan(x = target(get_data(y), transform = map(y = c("a", "b", "c"))))
#> # A tibble: 3 x 2
#>   target command      
#>   <chr>  <expr>       
#> 1 x_a    get_data("a")
#> 2 x_b    get_data("b")
#> 3 x_c    get_data("c")
y_vals <- letters
drake_plan(x = target(get_data(y), transform = map(y = !!y_vals)))
#> # A tibble: 26 x 2
#>    target command      
#>    <chr>  <expr>       
#>  1 x_a    get_data("a")
#>  2 x_b    get_data("b")
#>  3 x_c    get_data("c")
#>  4 x_d    get_data("d")
#>  5 x_e    get_data("e")
#>  6 x_f    get_data("f")
#>  7 x_g    get_data("g")
#>  8 x_h    get_data("h")
#>  9 x_i    get_data("i")
#> 10 x_j    get_data("j")
#> # … with 16 more rows

3.5.4.2 Named symbol arguments

Symbols passed with explicit argument names define new groupings of existing targets on the fly, and only the map() and cross() transformations can accept them this ways. To generate long symbol lists, use the syms() function from the rlang package. Remember to use the tidy evaluation operator !! inside the transformation.

vals <- rlang::syms(letters)
drake_plan(x = target(get_data(y), transform = map(y = !!vals)))
#> # A tibble: 26 x 2
#>    target command    
#>    <chr>  <expr>     
#>  1 x_a    get_data(a)
#>  2 x_b    get_data(b)
#>  3 x_c    get_data(c)
#>  4 x_d    get_data(d)
#>  5 x_e    get_data(e)
#>  6 x_f    get_data(f)
#>  7 x_g    get_data(g)
#>  8 x_h    get_data(h)
#>  9 x_i    get_data(i)
#> 10 x_j    get_data(j)
#> # … with 16 more rows

The new groupings carry over to downstream targets by default, which you can see with trace = TRUE. Below, the rows for targets w_x and w_y have entries in the and z column.

drake_plan(
  x = abs(mean(rnorm(10))),
  y = abs(mean(rnorm(100, 1))),
  z = target(sqrt(val), transform = map(val = c(x, y))),
  w = target(val + 1, transform = map(val)),
  trace = TRUE
)
#> # A tibble: 6 x 5
#>   target command                  val   z     w    
#>   <chr>  <expr>                   <chr> <chr> <chr>
#> 1 x      abs(mean(rnorm(10)))     <NA>  <NA>  <NA> 
#> 2 y      abs(mean(rnorm(100, 1))) <NA>  <NA>  <NA> 
#> 3 z_x    sqrt(x)                  x     z_x   <NA> 
#> 4 z_y    sqrt(y)                  y     z_y   <NA> 
#> 5 w_x    x + 1                    x     z_x   w_x  
#> 6 w_y    y + 1                    y     z_y   w_y

However, this is incorrect because w does not depend on z_x or z_y. So for w, you should write map(val = c(x, y)) instead of map(val) to tell drake to clear the trace. Then, you will see NAs in the z column for w_x and w_y, which is right and proper.

drake_plan(
  x = abs(mean(rnorm(10))),
  y = abs(mean(rnorm(100, 1))),
  z = target(sqrt(val), transform = map(val = c(x, y))),
  w = target(val + 1, transform = map(val = c(x, y))),
  trace = TRUE
)
#> # A tibble: 6 x 5
#>   target command                  val   z     w    
#>   <chr>  <expr>                   <chr> <chr> <chr>
#> 1 x      abs(mean(rnorm(10)))     <NA>  <NA>  <NA> 
#> 2 y      abs(mean(rnorm(100, 1))) <NA>  <NA>  <NA> 
#> 3 z_x    sqrt(x)                  x     z_x   <NA> 
#> 4 z_y    sqrt(y)                  y     z_y   <NA> 
#> 5 w_x    x + 1                    x     <NA>  w_x  
#> 6 w_y    y + 1                    y     <NA>  w_y

3.5.5 Tags

Tags are special optional grouping variables. They are ignored while the transformation is happening and then added to the plan to help subsequent transformations. There are two types of tags:

  1. In-tags, which contain the target name you start with, and
  2. Out-tags, which contain the target names generated by the transformations.
drake_plan(
  x = target(
    command,
    transform = map(y = c(1, 2), .tag_in = from, .tag_out = c(to, out))
  ),
  trace = TRUE
)
#> # A tibble: 2 x 7
#>   target command y     x     from  to    out  
#>   <chr>  <expr>  <chr> <chr> <chr> <chr> <chr>
#> 1 x_1    command 1     x_1   x     x_1   x_1  
#> 2 x_2    command 2     x_2   x     x_2   x_2

Subsequent transformations can use tags as grouping variables and add to existing tags.

plan <- drake_plan(
  prep_work = do_prep_work(),
  local = target(
    get_local_data(n, prep_work),
    transform = map(n = c(1, 2), .tag_in = data_source, .tag_out = data)
  ),
  online = target(
    get_online_data(n, prep_work, port = "8080"),
    transform = map(n = c(1, 2), .tag_in = data_source, .tag_out = data)
  ),
  summary = target(
    summarize(bind_rows(data, .id = "data")),
    transform = combine(data, .by = data_source)
  ),
  munged = target(
    munge(bind_rows(data, .id = "data")),
    transform = combine(data, .by = n)
  )
)

plan
#> # A tibble: 9 x 2
#>   target         command                                               
#>   <chr>          <expr>                                                
#> 1 prep_work      do_prep_work()                                        
#> 2 local_1        get_local_data(1, prep_work)                          
#> 3 local_2        get_local_data(2, prep_work)                          
#> 4 online_1       get_online_data(1, prep_work, port = "8080")          
#> 5 online_2       get_online_data(2, prep_work, port = "8080")          
#> 6 summary_local  summarize(bind_rows(local_1, local_2, .id = "data"))  
#> 7 summary_online summarize(bind_rows(online_1, online_2, .id = "data"))
#> 8 munged_1       munge(bind_rows(local_1, online_1, .id = "data"))     
#> 9 munged_2       munge(bind_rows(local_2, online_2, .id = "data"))

config <- drake_config(plan)
vis_drake_graph(config)


3.5.6 Target names

All transformations have an optional .id argument to control the names of targets. Use it to select the grouping variables that go into the names, as well as the order they appear in the suffixes.

drake_plan(
  data = target(
    get_data(param1, param2),
    transform = map(
      param1 = c(123, 456),
      param2 = c(7, 9),
      param2 = c("abc", "xyz"),
      .id = param2
    )
  )
)
#> # A tibble: 2 x 2
#>   target command         
#>   <chr>  <expr>          
#> 1 data_7 get_data(123, 7)
#> 2 data_9 get_data(456, 9)
drake_plan(
  data = target(
    get_data(param1, param2),
    transform = map(
      param1 = c(123, 456),
      param2 = c(7, 9),
      param2 = c("abc", "xyz"),
      .id = c(param2, param1)
    )
  )
)
#> # A tibble: 2 x 2
#>   target     command         
#>   <chr>      <expr>          
#> 1 data_7_123 get_data(123, 7)
#> 2 data_9_456 get_data(456, 9)
drake_plan(
  data = target(
    get_data(param1, param2),
    transform = map(
      param1 = c(123, 456),
      param2 = c(7, 9),
      param2 = c("abc", "xyz"),
      .id = c(param1, param2)
    )
  )
)
#> # A tibble: 2 x 2
#>   target     command         
#>   <chr>      <expr>          
#> 1 data_123_7 get_data(123, 7)
#> 2 data_456_9 get_data(456, 9)

Set .id to FALSE to ignore the grouping variables altogether.

drake_plan(
  data = target(
    get_data(param1, param2),
    transform = map(
      param1 = c(123, 456),
      param2 = c(7, 9),
      param2 = c("abc", "xyz"),
      .id = FALSE
    )
  )
)
#> # A tibble: 2 x 2
#>   target command         
#>   <chr>  <expr>          
#> 1 data   get_data(123, 7)
#> 2 data_2 get_data(456, 9)

Finally, drake supports a special .id_chr symbol in commands to let you refer to the name of the current target as a character string.

as_chr <- function(x) {
  deparse(substitute(x))
}
plan <- drake_plan(
  data = target(
    get_data(param),
    transform = map(param = c(123, 456))
  ),
  keras_model = target(
    save_model_hdf5(fit_model(data), file_out(!!sprintf("%s.h5", .id_chr))),
    transform = map(data, .id = param)
  ),
  result = target(
    predict(load_model_hdf5(file_in(!!sprintf("%s.h5", as_chr(keras_model))))),
    transform = map(keras_model, .id = param)
  )
)

plan
#> # A tibble: 6 x 2
#>   target         command                                                   
#>   <chr>          <expr>                                                    
#> 1 data_123       get_data(123)                                            …
#> 2 data_456       get_data(456)                                            …
#> 3 keras_model_1… save_model_hdf5(fit_model(data_123), file_out("keras_mode…
#> 4 keras_model_4… save_model_hdf5(fit_model(data_456), file_out("keras_mode…
#> 5 result_123     predict(load_model_hdf5(file_in("keras_model_123.h5")))  …
#> 6 result_456     predict(load_model_hdf5(file_in("keras_model_456.h5")))  …
drake_plan_source(plan)
#> drake_plan(
#>   data_123 = get_data(123),
#>   data_456 = get_data(456),
#>   keras_model_123 = save_model_hdf5(fit_model(data_123), file_out("keras_model_123.h5")),
#>   keras_model_456 = save_model_hdf5(fit_model(data_456), file_out("keras_model_456.h5")),
#>   result_123 = predict(load_model_hdf5(file_in("keras_model_123.h5"))),
#>   result_456 = predict(load_model_hdf5(file_in("keras_model_456.h5")))
#> )

  1. You can turn the command column of your plan into a character vector (e.g. plan$command <- purrr::map_chr(plan$command, rlang::expr_text)) and drake will still understand you. However, the recommended format is a list of expressions. drake_plan() and friends always supply expression lists.

  2. drake_plan() is the best way to create plans, but you can create plans any way you like. drake will understand plans you create directly using data.frame() or tibble().

Copyright Eli Lilly and Company