Chapter 6 Workflow plan data frames

6.1 What is a workflow plan data frame?

Your workflow plan data frame is the object where you declare all the objects and files you are going to produce when you run your project. It enumerates each output R object, or target, and the command that will produce it. Here is the workflow plan from our previous example.

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)) %>%
    select(-X__1),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)
plan
## # A tibble: 5 x 2
##   target   command                                                        
##   <chr>    <chr>                                                          
## 1 raw_data "readxl::read_excel(file_in(\"raw_data.xlsx\"))"               
## 2 data     "raw_data %>% mutate(Species = forcats::fct_inorder(Species)) …
## 3 hist     create_plot(data)                                              
## 4 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  
## 5 report   "rmarkdown::render(knitr_in(\"report.Rmd\"), output_file = fil…

When you run make(plan), drake will produce targets raw_data, data, hist, fit, and report.

6.2 Plans are like R scripts.

Your workflow plan data frame is like your top-level “run everything” script in a project. In fact, you can convert back and forth between plans and scripts using functions plan_to_code() and code_to_plan() (please note the caveats here).

plan_to_code(plan, "new_script.R")
## Loading required namespace: styler
## Warning: `.named` can no longer be a width
## This warning is displayed once per session.
cat(readLines("new_script.R"), sep = "\n")
## raw_data <- readxl::read_excel(file_in("raw_data.xlsx"))
## data <- raw_data %>%
##   mutate(Species = forcats::fct_inorder(Species)) %>%
##   select(-X__1)
## fit <- lm(Sepal.Width ~ Petal.Width + Species, data)
## hist <- create_plot(data)
## report <- rmarkdown::render(knitr_in("report.Rmd"),
##   output_file = file_out("report.html"),
##   quiet = TRUE
## )

code_to_plan("new_script.R")
## # A tibble: 5 x 2
##   target   command                                                        
##   <chr>    <chr>                                                          
## 1 raw_data "readxl::read_excel(file_in(\"raw_data.xlsx\"))"               
## 2 data     "raw_data %>% mutate(Species = forcats::fct_inorder(Species)) …
## 3 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  
## 4 hist     create_plot(data)                                              
## 5 report   "rmarkdown::render(knitr_in(\"report.Rmd\"), output_file = fil…

And plan_to_notebook() turns plans into R notebooks.

plan_to_notebook(plan, "new_notebook.Rmd")
cat(readLines("new_notebook.Rmd"), sep = "\n")
## ---
## title: "My Notebook"
## output: html_notebook
## ---
## 
## ```{r my_code}
## raw_data <- readxl::read_excel(file_in("raw_data.xlsx"))
## data <- raw_data %>%
##   mutate(Species = forcats::fct_inorder(Species)) %>%
##   select(-X__1)
## fit <- lm(Sepal.Width ~ Petal.Width + Species, data)
## hist <- create_plot(data)
## report <- rmarkdown::render(knitr_in("report.Rmd"),
##   output_file = file_out("report.html"),
##   quiet = TRUE
## )
## ```

code_to_plan("new_notebook.Rmd")
## # A tibble: 5 x 2
##   target   command                                                        
##   <chr>    <chr>                                                          
## 1 raw_data "readxl::read_excel(file_in(\"raw_data.xlsx\"))"               
## 2 data     "raw_data %>% mutate(Species = forcats::fct_inorder(Species)) …
## 3 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  
## 4 hist     create_plot(data)                                              
## 5 report   "rmarkdown::render(knitr_in(\"report.Rmd\"), output_file = fil…

6.3 So why do we use plans?

The workflow plan may seem like a burden to set up, and the use of data frames may seem counterintuitive at first, but the rewards are worth the effort.

6.3.1 You can skip up-to-date work.

As we saw in our first example, subsequent make()s skip work that is already up to date. To skip steps of the workflow, we need to know what those steps actaully are. Workflow plan data frames formally define skippable steps, whereas scripts and notebooks on their own do not.

This general approach of declaring targets in advance has stood the test of time. The idea dates at least as far back as GNU Make, which uses Makefiles to declare targets and dependencies. drake’s predecessor remake uses YAML files in a similar way.

6.3.2 Data frames scale well.

Makefiles are successful for Make because they accommodate software written in multiple languages. However, such external configuration files are not the best solution for R. Maintaining a Makefile or a remake YAML file requires a lot of manual typing. But with drake plans, you can use the usual data frame manipulation tools to expand, generate, and piece together large projets. The gsp example shows how expand.grid() and rbind() to automatically create plans with hundreds of targets. In addition, drake has a wildcard templating mechanism to generate large plans.

6.3.3 You do not need to worry about which targets run first.

When you call make() on the plan above, drake takes care of "raw_data.xlsx", then raw_data, and then data in sequence. Once data completes, fit and hist can start in any order, and then report begins once everything else is done. The execution does not depend on the order of the rows in your plan. In other words, the following plan is equivalent.

drake_plan(
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  ),
  hist = create_plot(data),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)) %>%
    select(-X__1),
  raw_data = readxl::read_excel(file_in("raw_data.xlsx"))
)
## # A tibble: 5 x 2
##   target   command                                                        
##   <chr>    <chr>                                                          
## 1 fit      lm(Sepal.Width ~ Petal.Width + Species, data)                  
## 2 report   "rmarkdown::render(knitr_in(\"report.Rmd\"), output_file = fil…
## 3 hist     create_plot(data)                                              
## 4 data     "raw_data %>% mutate(Species = forcats::fct_inorder(Species)) …
## 5 raw_data "readxl::read_excel(file_in(\"raw_data.xlsx\"))"

6.4 Automatic dependency detection

Why can you safely scramble the rows of a drake plan? Why is row order irrelevant to execution order? Because drake analyzes commands and functions for dependencies, and make() processes those dependencies before moving on to downstream targets. To detect dependencies, drake walks through the abstract syntax tree of every piece of code to find the objects and files relevant to the workflow pipeline.

For more information on static code analysis, see this section of Advanced R and the CodeDepends package.

create_plot <- function(data) {
  ggplot(data, aes_string(x = "Petal.Width", fill = "Species")) +
    geom_histogram()
}

deps_code(create_plot)
## $globals
## [1] "ggplot"         "aes_string"     "geom_histogram"

deps_code(
  quote({
    some_function_i_wrote(data)
    rmarkdown::render(
      knitr_in("report.Rmd"),
      output_file = file_out("report.html"),
      quiet = TRUE
    )
  })
)
## $globals
## [1] "some_function_i_wrote" "data"                 
## 
## $namespaced
## [1] "rmarkdown::render"
## 
## $loadd
## [1] "fit"
## 
## $readd
## [1] "hist"
## 
## $knitr_in
## [1] "\"report.Rmd\""
## 
## $file_out
## [1] "\"report.html\""

drake detects dependencies without actually running the command.

file.exists("report.html")
## [1] FALSE

Automatically detected dependencies include:

  1. Other targets in the plan.
  2. Objects and functions in your environment.
  3. Objects and functions from packages that you reference with :: or ::: (namespaced objects).
  4. Input and output files declared in your commands with file_in(), knitr_in(), or file_out().
  5. Input files declared in imported functions ((2) or (3)) with file_in() or knitr_in().
  6. For knitr or R Markdown reports declared with knitr_in() (example here), drake scans active code chunks for objects mentioned with loadd() and readd(). So when fit or hist change, drake rebuilds the report target to produce the file report.html.

Targets can depend on one another through file_in()/file_out() connections.

saveRDS(1, "start.rds")

write_files <- function(){
  x <- readRDS(file_in("start.rds"))
  for (file in letters[1:3]){
    saveRDS(x, file)
  }
}

small_plan <- drake_plan(
  x = {
    write_files()
    file_out("a", "b", "c")
  },
  y = readRDS(file_in("a"))
)

config <- drake_config(small_plan)
vis_drake_graph(config)

So when target x changes the output for files "a", "b", or "c", drake knows to rebuild target y. In addition, if you accidentally modify any of these output files by hand, drake will run the command of x to restore the files to a reproducible state.

6.5 Automatically generating workflow plans

drake provides many more utilites that increase the flexibility of workflow plan generation beyond expand.grid().

  • drake_plan()
  • map_plan()
  • evaluate_plan()
  • plan_analyses()
  • plan_summaries()
  • expand_plan()
  • gather_by()
  • reduce_by()
  • gather_plan()
  • reduce_plan()

6.5.1 map_plan()

purrr-like functional programming is like looping, but cleaner. The idea is to iterate the same computation over multiple different data points. You write a function to do something once, and a map()-like helper invokes it on each point in your dataset. drake’s version of map() — or more precisely, pmap_df() — is map_plan().

In the following example, we want to know how well each pair covariates in the mtcars dataset can predict fuel efficiency (in miles per gallon). We will try multiple pairs of covariates using the same statistical analysis, so it is a great time for drake-flavored functional programming with map_plan().

As with its cousin, pmap_df(), map_plan() needs

  1. A function.
  2. A grid of function arguments.

Our function fits a fuel efficiency model given a single pair of covariate names x1 and x2.

my_model_fit <- function(x1, x2, data){
  lm(as.formula(paste("mpg ~", x1, "+", x2)), data = data)
}

Our grid of function arguments is a data frame of possible values for x1, x2, and data.

covariates <- setdiff(colnames(mtcars), "mpg") # Exclude the response variable.
args <- t(combn(covariates, 2)) # Take all possible pairs.
colnames(args) <- c("x1", "x2") # The column names must be the argument names of my_model_fit()
args <- tibble::as_tibble(args) # Tibbles are so nice.
args$data <- "mtcars"

args
## # A tibble: 45 x 3
##    x1    x2    data  
##    <chr> <chr> <chr> 
##  1 cyl   disp  mtcars
##  2 cyl   hp    mtcars
##  3 cyl   drat  mtcars
##  4 cyl   wt    mtcars
##  5 cyl   qsec  mtcars
##  6 cyl   vs    mtcars
##  7 cyl   am    mtcars
##  8 cyl   gear  mtcars
##  9 cyl   carb  mtcars
## 10 disp  hp    mtcars
## # ... with 35 more rows

Each row of args corresponds to a call to my_model_fit(). To actually write out all those function calls, we use map_plan().

map_plan(args, my_model_fit)
## # A tibble: 45 x 2
##    target              command                                            
##    <chr>               <chr>                                              
##  1 my_model_fit_501e0… "my_model_fit(x1 = \"cyl\", x2 = \"disp\", data = …
##  2 my_model_fit_d5de1… "my_model_fit(x1 = \"cyl\", x2 = \"hp\", data = \"…
##  3 my_model_fit_eac6c… "my_model_fit(x1 = \"cyl\", x2 = \"drat\", data = …
##  4 my_model_fit_3900e… "my_model_fit(x1 = \"cyl\", x2 = \"wt\", data = \"…
##  5 my_model_fit_a2d79… "my_model_fit(x1 = \"cyl\", x2 = \"qsec\", data = …
##  6 my_model_fit_f5c0a… "my_model_fit(x1 = \"cyl\", x2 = \"vs\", data = \"…
##  7 my_model_fit_507d2… "my_model_fit(x1 = \"cyl\", x2 = \"am\", data = \"…
##  8 my_model_fit_b5f9a… "my_model_fit(x1 = \"cyl\", x2 = \"gear\", data = …
##  9 my_model_fit_8c4c5… "my_model_fit(x1 = \"cyl\", x2 = \"carb\", data = …
## 10 my_model_fit_f9bb9… "my_model_fit(x1 = \"disp\", x2 = \"hp\", data = \…
## # ... with 35 more rows

We now have a plan, but it has a couple issues.

  1. The data argument should be a symbol. In other words, we want my_model_fit(data = mtcars), not my_model_fit(data = "mtcars"). So we use the syms() function from the rlang package turn args$data into a list of symbols.
  2. The default argument names are ugly, so we can add a new "id" column to args (or select one with the id argument of map_plan()).
# Fixes (1)
args$data <- rlang::syms(args$data)

# Alternative if each element of `args$data` is code with multiple symbols:
# args$data <- purrr::map(args$data, rlang::parse_expr)

# Fixes (2)
args$id <- paste0("fit_", args$x1, "_", args$x2)

args
## # A tibble: 45 x 4
##    x1    x2    data     id          
##    <chr> <chr> <list>   <chr>       
##  1 cyl   disp  <symbol> fit_cyl_disp
##  2 cyl   hp    <symbol> fit_cyl_hp  
##  3 cyl   drat  <symbol> fit_cyl_drat
##  4 cyl   wt    <symbol> fit_cyl_wt  
##  5 cyl   qsec  <symbol> fit_cyl_qsec
##  6 cyl   vs    <symbol> fit_cyl_vs  
##  7 cyl   am    <symbol> fit_cyl_am  
##  8 cyl   gear  <symbol> fit_cyl_gear
##  9 cyl   carb  <symbol> fit_cyl_carb
## 10 disp  hp    <symbol> fit_disp_hp 
## # ... with 35 more rows

Much better.

plan <- map_plan(args, my_model_fit)
plan
## # A tibble: 45 x 2
##    target       command                                                   
##    <chr>        <chr>                                                     
##  1 fit_cyl_disp "my_model_fit(x1 = \"cyl\", x2 = \"disp\", data = mtcars)"
##  2 fit_cyl_hp   "my_model_fit(x1 = \"cyl\", x2 = \"hp\", data = mtcars)"  
##  3 fit_cyl_drat "my_model_fit(x1 = \"cyl\", x2 = \"drat\", data = mtcars)"
##  4 fit_cyl_wt   "my_model_fit(x1 = \"cyl\", x2 = \"wt\", data = mtcars)"  
##  5 fit_cyl_qsec "my_model_fit(x1 = \"cyl\", x2 = \"qsec\", data = mtcars)"
##  6 fit_cyl_vs   "my_model_fit(x1 = \"cyl\", x2 = \"vs\", data = mtcars)"  
##  7 fit_cyl_am   "my_model_fit(x1 = \"cyl\", x2 = \"am\", data = mtcars)"  
##  8 fit_cyl_gear "my_model_fit(x1 = \"cyl\", x2 = \"gear\", data = mtcars)"
##  9 fit_cyl_carb "my_model_fit(x1 = \"cyl\", x2 = \"carb\", data = mtcars)"
## 10 fit_disp_hp  "my_model_fit(x1 = \"disp\", x2 = \"hp\", data = mtcars)" 
## # ... with 35 more rows

We may also want to retain information about the constituent function arguments of each target. With map_plan(trace = TRUE), we can append the columns of args alongside the usual "target" and "command" columns of our plan.

map_plan(args, my_model_fit, trace = TRUE)
## # A tibble: 45 x 6
##    target    command                           x1    x2    data   id      
##    <chr>     <chr>                             <chr> <chr> <list> <chr>   
##  1 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   disp  <symb… fit_cyl…
##  2 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   hp    <symb… fit_cyl…
##  3 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   drat  <symb… fit_cyl…
##  4 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   wt    <symb… fit_cyl…
##  5 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   qsec  <symb… fit_cyl…
##  6 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   vs    <symb… fit_cyl…
##  7 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   am    <symb… fit_cyl…
##  8 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   gear  <symb… fit_cyl…
##  9 fit_cyl_… "my_model_fit(x1 = \"cyl\", x2 =… cyl   carb  <symb… fit_cyl…
## 10 fit_disp… "my_model_fit(x1 = \"disp\", x2 … disp  hp    <symb… fit_dis…
## # ... with 35 more rows

In any case, we can now fit our models.

make(plan, verbose = FALSE)

And inspect the output.

readd(fit_cyl_disp)
## 
## Call:
## lm(formula = as.formula(paste("mpg ~", x1, "+", x2)), data = data)
## 
## Coefficients:
## (Intercept)          cyl         disp  
##    34.66099     -1.58728     -0.02058

6.5.2 Wildcard templating

In drake, you can write plans with wildcards. These wildcards are placeholders for text in commands. By iterating over the possible values of a wildcard, you can easily generate plans with thousands of targets. Let’s say you are running a simulation study, and you need to generate sets of random numbers from different distributions.

plan <- drake_plan(
  t  = rt(1000, df = 5),
  normal = runif(1000, mean = 0, sd = 1)
)

If you need to generate many datasets with different means, you may wish to write out each target individually.

drake_plan(
  t  = rt(1000, df = 5),
  normal_0 = runif(1000, mean = 0, sd = 1),
  normal_1 = runif(1000, mean = 1, sd = 1),
  normal_2 = runif(1000, mean = 2, sd = 1),
  normal_3 = runif(1000, mean = 3, sd = 1),
  normal_4 = runif(1000, mean = 4, sd = 1),
  normal_5 = runif(1000, mean = 5, sd = 1),
  normal_6 = runif(1000, mean = 6, sd = 1),
  normal_7 = runif(1000, mean = 7, sd = 1),
  normal_8 = runif(1000, mean = 8, sd = 1),
  normal_9 = runif(1000, mean = 9, sd = 1)
)

But writing all that code manually is a pain and prone to human error. Instead, use evaluate_plan()

plan <- drake_plan(
  t  = rt(1000, df = 5),
  normal = runif(1000, mean = mean__, sd = 1)
)
evaluate_plan(plan, wildcard = "mean__", values = 0:9)
## # A tibble: 11 x 2
##    target   command                      
##    <chr>    <chr>                        
##  1 t        rt(1000, df = 5)             
##  2 normal_0 runif(1000, mean = 0, sd = 1)
##  3 normal_1 runif(1000, mean = 1, sd = 1)
##  4 normal_2 runif(1000, mean = 2, sd = 1)
##  5 normal_3 runif(1000, mean = 3, sd = 1)
##  6 normal_4 runif(1000, mean = 4, sd = 1)
##  7 normal_5 runif(1000, mean = 5, sd = 1)
##  8 normal_6 runif(1000, mean = 6, sd = 1)
##  9 normal_7 runif(1000, mean = 7, sd = 1)
## 10 normal_8 runif(1000, mean = 8, sd = 1)
## 11 normal_9 runif(1000, mean = 9, sd = 1)

You can specify multiple wildcards at once. If multiple wildcards appear in the same command, you will get a new target for each unique combination of values.

plan <- drake_plan(
  t  = rt(1000, df = df__),
  normal = runif(1000, mean = mean__, sd = sd__)
)
evaluate_plan(
  plan,
  rules = list(
    mean__ = c(0, 1),
    sd__ = c(3, 4),
    df__ = 5:7
  )
)
## # A tibble: 7 x 2
##   target     command                      
##   <chr>      <chr>                        
## 1 t_5        rt(1000, df = 5)             
## 2 t_6        rt(1000, df = 6)             
## 3 t_7        rt(1000, df = 7)             
## 4 normal_0_3 runif(1000, mean = 0, sd = 3)
## 5 normal_0_4 runif(1000, mean = 0, sd = 4)
## 6 normal_1_3 runif(1000, mean = 1, sd = 3)
## 7 normal_1_4 runif(1000, mean = 1, sd = 4)

Wildcards for evaluate_plan() do not need to have the double-underscore suffix. Any valid symbol will do.

plan <- drake_plan(
  t  = rt(1000, df = .DF.),
  normal = runif(1000, mean = `{MEAN}`, sd = ..sd)
)
evaluate_plan(
  plan,
  rules = list(
    "`{MEAN}`" = c(0, 1),
    ..sd = c(3, 4),
    .DF. = 5:7
  )
)
## # A tibble: 7 x 2
##   target     command                      
##   <chr>      <chr>                        
## 1 t_5        rt(1000, df = 5)             
## 2 t_6        rt(1000, df = 6)             
## 3 t_7        rt(1000, df = 7)             
## 4 normal_0_3 runif(1000, mean = 0, sd = 3)
## 5 normal_0_4 runif(1000, mean = 0, sd = 4)
## 6 normal_1_3 runif(1000, mean = 1, sd = 3)
## 7 normal_1_4 runif(1000, mean = 1, sd = 4)

Set expand to FALSE to disable expansion.

plan <- drake_plan(
  t  = rpois(samples__, lambda = mean__),
  normal = runif(samples__, mean = mean__)
)
evaluate_plan(
  plan,
  rules = list(
    samples__ = c(50, 100),
    mean__ = c(1, 5)
  ),
  expand = FALSE
)
## # A tibble: 2 x 2
##   target command              
##   <chr>  <chr>                
## 1 t      rpois(50, lambda = 1)
## 2 normal runif(100, mean = 5)

Wildcard templating can sometimes be tricky. For example, suppose your project is to analyze school data, and your workflow checks several metrics of several schools. The idea is to write a workflow plan with your metrics and let the wildcard templating expand over the available schools.

hard_plan <- drake_plan(
  credits = check_credit_hours(school__),
  students = check_students(school__),
  grads = check_graduations(school__),
  public_funds = check_public_funding(school__)
)

evaluate_plan(
  hard_plan,
  rules = list(school__ = c("schoolA", "schoolB", "schoolC"))
)
## # A tibble: 12 x 2
##    target               command                      
##    <chr>                <chr>                        
##  1 credits_schoolA      check_credit_hours(schoolA)  
##  2 credits_schoolB      check_credit_hours(schoolB)  
##  3 credits_schoolC      check_credit_hours(schoolC)  
##  4 students_schoolA     check_students(schoolA)      
##  5 students_schoolB     check_students(schoolB)      
##  6 students_schoolC     check_students(schoolC)      
##  7 grads_schoolA        check_graduations(schoolA)   
##  8 grads_schoolB        check_graduations(schoolB)   
##  9 grads_schoolC        check_graduations(schoolC)   
## 10 public_funds_schoolA check_public_funding(schoolA)
## 11 public_funds_schoolB check_public_funding(schoolB)
## 12 public_funds_schoolC check_public_funding(schoolC)

But what if some metrics do not make sense? For example, what if schoolC is a completely privately-funded school? With no public funds, check_public_funds(schoolC) may quit in error if we are not careful. This is where setting up workflow plans requires a little creativity. In this case, we recommend that you use two wildcards: one for all the schools and another for just the public schools. The new plan has no twelfth row.

plan_template <- drake_plan(
  school = get_school_data("school__"),
  credits = check_credit_hours(all_schools__),
  students = check_students(all_schools__),
  grads = check_graduations(all_schools__),
  public_funds = check_public_funding(public_schools__)
)
evaluate_plan(
  plan = plan_template,
  rules = list(
    school__ = c("A", "B", "C"),
    all_schools__ =  c("school_A", "school_B", "school_C"),
    public_schools__ = c("school_A", "school_B")
  )
)
## # A tibble: 14 x 2
##    target                command                       
##    <chr>                 <chr>                         
##  1 school_A              "get_school_data(\"A\")"      
##  2 school_B              "get_school_data(\"B\")"      
##  3 school_C              "get_school_data(\"C\")"      
##  4 credits_school_A      check_credit_hours(school_A)  
##  5 credits_school_B      check_credit_hours(school_B)  
##  6 credits_school_C      check_credit_hours(school_C)  
##  7 students_school_A     check_students(school_A)      
##  8 students_school_B     check_students(school_B)      
##  9 students_school_C     check_students(school_C)      
## 10 grads_school_A        check_graduations(school_A)   
## 11 grads_school_B        check_graduations(school_B)   
## 12 grads_school_C        check_graduations(school_C)   
## 13 public_funds_school_A check_public_funding(school_A)
## 14 public_funds_school_B check_public_funding(school_B)

Thanks to Alex Axthelm for this use case in issue 235.

6.5.3 Wildcard clusters

With evaluate_plan(trace = TRUE), you can generate columns that show how the targets were generated from the wildcards.

plan_template <- drake_plan(
  school = get_school_data("school__"),
  credits = check_credit_hours(all_schools__),
  students = check_students(all_schools__),
  grads = check_graduations(all_schools__),
  public_funds = check_public_funding(public_schools__)
)
plan <- evaluate_plan(
  plan = plan_template,
  rules = list(
    school__ = c("A", "B", "C"),
    all_schools__ =  c("school_A", "school_B", "school_C"),
    public_schools__ = c("school_A", "school_B")
  ),
  trace = TRUE
)
plan
## # A tibble: 14 x 8
##    target command school__ school___from all_schools__ all_schools___f…
##    <chr>  <chr>   <chr>    <chr>         <chr>         <chr>           
##  1 schoo… "get_s… A        school        <NA>          <NA>            
##  2 schoo… "get_s… B        school        <NA>          <NA>            
##  3 schoo… "get_s… C        school        <NA>          <NA>            
##  4 credi… check_… <NA>     <NA>          school_A      credits         
##  5 credi… check_… <NA>     <NA>          school_B      credits         
##  6 credi… check_… <NA>     <NA>          school_C      credits         
##  7 stude… check_… <NA>     <NA>          school_A      students        
##  8 stude… check_… <NA>     <NA>          school_B      students        
##  9 stude… check_… <NA>     <NA>          school_C      students        
## 10 grads… check_… <NA>     <NA>          school_A      grads           
## 11 grads… check_… <NA>     <NA>          school_B      grads           
## 12 grads… check_… <NA>     <NA>          school_C      grads           
## 13 publi… check_… <NA>     <NA>          <NA>          <NA>            
## 14 publi… check_… <NA>     <NA>          <NA>          <NA>            
## # ... with 2 more variables: public_schools__ <chr>,
## #   public_schools___from <chr>

And then when you visualize the dependency graph, you can cluster nodes based on the wildcard info.

config <- drake_config(plan)
vis_drake_graph(
  config,
  group = "all_schools__",
  clusters = c("school_A", "school_B", "school_C")
)

See the visualization guide for more details.

6.5.4 Specialized wildcard functionality

In the mtcars example, we will analyze bootstrapped versions of the mtcars dataset to look for an association between the weight and the fuel efficiency of cars. This example uses plan_analyses() and plan_summaries(), two specialized applications of evaluate_plan(). First, we generate the plan for the bootstrapped datasets.

my_datasets <- drake_plan(
  small = simulate(48),
  large = simulate(64))
my_datasets
## # A tibble: 2 x 2
##   target command     
##   <chr>  <chr>       
## 1 small  simulate(48)
## 2 large  simulate(64)

We want to analyze each dataset with one of two regression models.

methods <- drake_plan(
  regression1 = reg1(dataset__),
  regression2 = reg2(dataset__))
methods
## # A tibble: 2 x 2
##   target      command        
##   <chr>       <chr>          
## 1 regression1 reg1(dataset__)
## 2 regression2 reg2(dataset__)

We evaluate the dataset__ wildcard to generate all the regression commands we will need.

my_analyses <- plan_analyses(methods, datasets = my_datasets)
my_analyses
## # A tibble: 4 x 2
##   target            command    
##   <chr>             <chr>      
## 1 regression1_small reg1(small)
## 2 regression1_large reg1(large)
## 3 regression2_small reg2(small)
## 4 regression2_large reg2(large)

Next, we summarize each analysis of each dataset. We calculate descriptive statistics on the residuals, and we collect the regression coefficients and their p-values.

summary_types <- drake_plan(
  summ = suppressWarnings(summary(analysis__$residuals)),
  coef = suppressWarnings(summary(analysis__))$coefficients
)
summary_types
## # A tibble: 2 x 2
##   target command                                           
##   <chr>  <chr>                                             
## 1 summ   suppressWarnings(summary(analysis__$residuals))   
## 2 coef   suppressWarnings(summary(analysis__))$coefficients

results <- plan_summaries(summary_types, analyses = my_analyses,
  datasets = my_datasets, gather = NULL) # Gathering is suppressed here.
results
## # A tibble: 8 x 2
##   target                command                                           
##   <chr>                 <chr>                                             
## 1 summ_regression1_sma… suppressWarnings(summary(regression1_small$residu…
## 2 summ_regression1_lar… suppressWarnings(summary(regression1_large$residu…
## 3 summ_regression2_sma… suppressWarnings(summary(regression2_small$residu…
## 4 summ_regression2_lar… suppressWarnings(summary(regression2_large$residu…
## 5 coef_regression1_sma… suppressWarnings(summary(regression1_small))$coef…
## 6 coef_regression1_lar… suppressWarnings(summary(regression1_large))$coef…
## 7 coef_regression2_sma… suppressWarnings(summary(regression2_small))$coef…
## 8 coef_regression2_lar… suppressWarnings(summary(regression2_large))$coef…

Next, we bind all the rows together for a single plan that we can later supply to make().

my_plan <- rbind(my_datasets, my_analyses, results)
my_plan
## # A tibble: 14 x 2
##    target               command                                           
##    <chr>                <chr>                                             
##  1 small                simulate(48)                                      
##  2 large                simulate(64)                                      
##  3 regression1_small    reg1(small)                                       
##  4 regression1_large    reg1(large)                                       
##  5 regression2_small    reg2(small)                                       
##  6 regression2_large    reg2(large)                                       
##  7 summ_regression1_sm… suppressWarnings(summary(regression1_small$residu…
##  8 summ_regression1_la… suppressWarnings(summary(regression1_large$residu…
##  9 summ_regression2_sm… suppressWarnings(summary(regression2_small$residu…
## 10 summ_regression2_la… suppressWarnings(summary(regression2_large$residu…
## 11 coef_regression1_sm… suppressWarnings(summary(regression1_small))$coef…
## 12 coef_regression1_la… suppressWarnings(summary(regression1_large))$coef…
## 13 coef_regression2_sm… suppressWarnings(summary(regression2_small))$coef…
## 14 coef_regression2_la… suppressWarnings(summary(regression2_large))$coef…

6.5.5 Non-wildcard functions

6.5.5.1 expand_plan()

Sometimes, you just want multiple replicates of the same targets.

plan <- drake_plan(
  fake_data = simulate_from_model(),
  bootstrapped_data = bootstrap_from_real_data(real_data)
)
expand_plan(plan, values = 1:3)
## # A tibble: 6 x 2
##   target              command                            
##   <chr>               <chr>                              
## 1 fake_data_1         simulate_from_model()              
## 2 fake_data_2         simulate_from_model()              
## 3 fake_data_3         simulate_from_model()              
## 4 bootstrapped_data_1 bootstrap_from_real_data(real_data)
## 5 bootstrapped_data_2 bootstrap_from_real_data(real_data)
## 6 bootstrapped_data_3 bootstrap_from_real_data(real_data)

6.5.5.2 gather_plan() and gather_by()

Other times, you want to combine multiple targets into one.

plan <- drake_plan(
  small = data.frame(type = "small", x = rnorm(25), y = rnorm(25)),
  large = data.frame(type = "large", x = rnorm(1000), y = rnorm(1000))
)
gather_plan(plan, target = "combined")
## # A tibble: 1 x 2
##   target   command                           
##   <chr>    <chr>                             
## 1 combined list(small = small, large = large)

In this case, small and large are data frames, so it may be more convenient to combine the rows together.

gather_plan(plan, target = "combined", gather = "rbind")
## # A tibble: 1 x 2
##   target   command                            
##   <chr>    <chr>                              
## 1 combined rbind(small = small, large = large)

See also gather_by() to gather multiple groups of targets based on other columns in the plan (e.g. from evaluate_plan(trace = TRUE)).

6.5.5.3 reduce_plan() and reduce_by()

reduce_plan() is similar to gather_plan(), but it allows you to combine multiple targets together in pairs. This is useful if combining everything at once requires too much time or computer memory, or if you want to parallelize the aggregation.

plan <- drake_plan(
  a = 1,
  b = 2,
  c = 3,
  d = 4
)
reduce_plan(plan)
## # A tibble: 3 x 2
##   target   command            
##   <chr>    <chr>              
## 1 target_1 a + b              
## 2 target_2 c + d              
## 3 target   target_1 + target_2

You can control how each pair of targets gets combined.

reduce_plan(plan, begin = "c(", op = ", ", end = ")")
## # A tibble: 3 x 2
##   target   command              
##   <chr>    <chr>                
## 1 target_1 c(a, b)              
## 2 target_2 c(c, d)              
## 3 target   c(target_1, target_2)

See also reduce_by() to do reductions on multiple groups of targets based on other columns in the plan (e.g. from evaluate_plan(trace = TRUE)).

6.5.6 Custom metaprogramming

The workflow plan is just a data frame. There is nothing magic about it, and you can create it any way you want. With your own custom metaprogramming, you don’t even need the drake_plan() function.

The following example could more easily be implemented with map_plan(), but we use other techniques to demonstrate the versatility of custom metaprogramming. Let’s consider a file-based example workflow. Here, our targets execute Linux commands to process input files and create output files.

cat in1.txt > out1.txt
cat in2.txt > out2.txt

The glue package can automatically generate these Linux commands.

library(glue)
glue_data(
  list(
    inputs = c("in1.txt", "in2.txt"), 
    outputs = c("out1.txt", "out2.txt")
  ),
  "cat {inputs} > {outputs}"
)
## cat in1.txt > out1.txt
## cat in2.txt > out2.txt

Our drake commands will use system() to execute the Linux commands that glue generates. Technically, we could use drake_plan() if we wanted.

library(tidyverse)
drake_plan(
  glue_data(
    list(
      inputs = file_in(c("in1.txt", "in2.txt")), 
      outputs = file_out(c("out1.txt", "out2.txt"))
    ),
    "cat {inputs} > {outputs}"
  ) %>%
    lapply(FUN = system)
)
## # A tibble: 1 x 2
##   target        command                                                   
##   <chr>         <chr>                                                     
## 1 drake_target… "glue_data(list(inputs = file_in(c(\"in1.txt\", \"in2.txt…

But what if we want to generate these glue commands instead of writing them literally in our plan? This is a job for custom metaprogramming with tidy evaluation. First, we create a function to generate the drake command of an arbitrary target.

library(rlang) # for tidy evaluation
write_command <- function(cmd, inputs = NULL , outputs = NULL){
  inputs <- enexpr(inputs)
  outputs <- enexpr(outputs)
  expr({
    glue_data(
      list(
        inputs = file_in(!!inputs),
        outputs = file_out(!!outputs)
      ),
      !!cmd
    ) %>%
      lapply(FUN = system)
  }) %>%
    expr_text
}

write_command(
  cmd = "cat {inputs} > {outputs}",
  inputs = c("in1.txt", "in2.txt"),
  outputs = c("out1.txt", "out2.txt")
) %>%
  cat
## {
##     glue_data(list(inputs = file_in(c("in1.txt", "in2.txt")), 
##         outputs = file_out(c("out1.txt", "out2.txt"))), "cat {inputs} > {outputs}") %>% 
##         lapply(FUN = system)
## }

Then, we lay out all the arguments we will pass to write_command(). Here, each row corresponds to a separate target.

meta_plan <- tribble(
  ~cmd, ~inputs, ~outputs,
  "cat {inputs} > {outputs}", c("in1.txt", "in2.txt"), c("out1.txt", "out2.txt"),
  "cat {inputs} {inputs} > {outputs}", c("out1.txt", "out2.txt"), c("out3.txt", "out4.txt")
) %>%
  print
## # A tibble: 2 x 3
##   cmd                               inputs    outputs  
##   <chr>                             <list>    <list>   
## 1 cat {inputs} > {outputs}          <chr [2]> <chr [2]>
## 2 cat {inputs} {inputs} > {outputs} <chr [2]> <chr [2]>

Finally, we create our workflow plan without any built-in drake functions.

plan <- tibble(
  target = paste0("target_", seq_len(nrow(meta_plan))),
  command = pmap_chr(meta_plan, write_command)
) %>%
  print
## # A tibble: 2 x 2
##   target   command                                                        
##   <chr>    <chr>                                                          
## 1 target_1 "{\n    glue_data(list(inputs = file_in(c(\"in1.txt\", \"in2.t…
## 2 target_2 "{\n    glue_data(list(inputs = file_in(c(\"out1.txt\", \"out2…
writeLines("in1", "in1.txt")
writeLines("in2", "in2.txt")
vis_drake_graph(drake_config(plan))

Alternatively, you could use as.call() instead of tidy evaluation to generate your plan. Use as.call() to construct calls to file_in(), file_out(), and custom functions in your commands.

library(purrr) # pmap_chr() is particularly useful here.

# A function that will be called in your commands.
command_function <- function(cmd, inputs, outputs){
  glue_data(
    list(
      inputs = inputs,
      outputs = outputs
    ),
    cmd
  ) %>%
    purrr::walk(system)
}

# A function to generate quoted calls to command_function(),
# which in turn contain quoted calls to file_in() and file_out().
write_command <- function(...){
  args <- list(...)
  args$inputs <- as.call(list(quote(file_in), args$inputs))
  args$outputs <- as.call(list(quote(file_out), args$outputs))
  c(quote(command_function), args) %>%
    as.call() %>%
    rlang::expr_text()
}

plan <- tibble(
  target = paste0("target_", seq_len(nrow(meta_plan))),
  command = pmap_chr(meta_plan, write_command)
) %>%
  print
## # A tibble: 2 x 2
##   target   command                                                        
##   <chr>    <chr>                                                          
## 1 target_1 "command_function(cmd = \"cat {inputs} > {outputs}\", inputs =…
## 2 target_2 "command_function(cmd = \"cat {inputs} {inputs} > {outputs}\",…

Metaprogramming gets much simpler if you do not need to construct literal calls to file_in(), file_out(), etc. in your commands. The construction of model_plan in the gross state product exmaple is an example.

Thanks to Chris Hammill for presenting this scenario and contributing to the solution.

6.6 Optional columns in your plan.

Besides the usual columns target and command, there are other columns you can add.

  • elapsed and cpu: number of seconds to wait for the target to build before timing out (elapsed for elapsed time and cpu for CPU time).
  • priority: for parallel computing, optionally rank the targets according to priority. That way, when two targets become ready to build at the same time, drake will pick the one with the dominant priority first.
  • retries: number of times to retry building a target in the event of an error.
  • trigger: choose the criterion that drake uses to decide whether to build the target. See ?trigger or read the trigger chapter to learn more.
  • worker: for paralllel computing, optionally name the preferred worker to assign to each target.
Copyright Eli Lilly and Company