Chapter 12 Time: logging, prediction, and strategy

Thanks to Jasper Clarkberg, drake records how long it takes to build each target. For large projects that take hours or days to run, this feature becomes important for planning and execution.

library(drake)
load_mtcars_example() # Get the code with drake_example("mtcars").
make(my_plan, jobs = 2)

build_times(digits = 8) # From the cache.
## # A tibble: 21 x 5
##    item                  type   elapsed        user          system       
##  * <chr>                 <chr>  <S4: Duration> <S4: Duratio> <S4: Duratio>
##  1 "\"report.md\""       target 0.124s         0.092s        0.032s       
##  2 "\"report.Rmd\""      import 0.002s         0s            0s           
##  3 coef_regression1_lar… target 0.01s          0.008s        0s           
##  4 coef_regression1_sma… target 0.009s         0.004s        0.004s       
##  5 coef_regression2_lar… target 0.012s         0.004s        0.004s       
##  6 coef_regression2_sma… target 0.009s         0.012s        0s           
##  7 large                 target 0.011s         0.008s        0.004s       
##  8 random_rows           import 0.003s         0.004s        0s           
##  9 reg1                  import 0.003s         0s            0s           
## 10 reg2                  import 0.003s         0.004s        0s           
## # ... with 11 more rows

## `dplyr`-style `tidyselect` commands
build_times(starts_with("coef"), digits = 8)
## # A tibble: 4 x 5
##   item                   type   elapsed        user          system       
## * <chr>                  <chr>  <S4: Duration> <S4: Duratio> <S4: Duratio>
## 1 coef_regression1_large target 0.01s          0.008s        0s           
## 2 coef_regression1_small target 0.009s         0.004s        0.004s       
## 3 coef_regression2_large target 0.012s         0.004s        0.004s       
## 4 coef_regression2_small target 0.009s         0.012s        0s

build_times(digits = 8, targets_only = TRUE)
## # A tibble: 16 x 5
##    item                  type   elapsed        user          system       
##  * <chr>                 <chr>  <S4: Duration> <S4: Duratio> <S4: Duratio>
##  1 "\"report.md\""       target 0.124s         0.092s        0.032s       
##  2 coef_regression1_lar… target 0.01s          0.008s        0s           
##  3 coef_regression1_sma… target 0.009s         0.004s        0.004s       
##  4 coef_regression2_lar… target 0.012s         0.004s        0.004s       
##  5 coef_regression2_sma… target 0.009s         0.012s        0s           
##  6 large                 target 0.011s         0.008s        0.004s       
##  7 regression1_large     target 0.012s         0.012s        0s           
##  8 regression1_small     target 0.011s         0.012s        0s           
##  9 regression2_large     target 0.011s         0.012s        0s           
## 10 regression2_small     target 0.022s         0.008s        0s           
## 11 report                target 0.135s         0.104s        0.032s       
## 12 small                 target 0.016s         0.012s        0s           
## 13 summ_regression1_lar… target 0.018s         0.004s        0.008s       
## 14 summ_regression1_sma… target 0.009s         0.008s        0s           
## 15 summ_regression2_lar… target 0.008s         0.008s        0s           
## 16 summ_regression2_sma… target 0.013s         0.008s        0s

For drake version 4.1.0 and earlier, build_times() just measures the elapsed runtime of each command in my_plan$command. For later versions, the build times also account for all the internal operations in drake:::build(), such as storage and hashing.

12.1 Predict total runtime

drake uses these times to predict the runtime of the next make(). At this moment, everything is up to date in the current example, so the next make() should be fast. Here, we only factor in the times of the formal targets in the workflow plan, excluding any imports.

config <- drake_config(my_plan, verbose = FALSE)
predict_runtime(config, targets_only = TRUE)
## [1] "0s"

Suppose we change a dependency to make some targets out of date. Now, the next make() should take a little longer since some targets are out of date.

reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}

predict_runtime(config, targets_only = TRUE)
## [1] "0.21s"

But what if you plan on starting from scratch next time, either after clean() or with make(trigger = trigger(condition = TRUE))?

predict_runtime(config, from_scratch = TRUE, targets_only = TRUE)
## [1] "0.306s"

12.2 Strategize your high-performance computing

Let’s say you are scaling up your workflow. You just put bigger data and heavier computation in your custom code, and the next time you run make(), your targets will take much longer to build. In fact, you estimate that every target except for your R Markdown report will take two hours to complete. Let’s write down these known times in seconds.

known_times <- c(5, rep(7200, nrow(my_plan) - 1))
names(known_times) <- c(file_store("report.md"), my_plan$target[-1])
known_times
##            "report.md"                  small                  large 
##                      5                   7200                   7200 
##      regression1_small      regression1_large      regression2_small 
##                   7200                   7200                   7200 
##      regression2_large summ_regression1_small summ_regression1_large 
##                   7200                   7200                   7200 
## summ_regression2_small summ_regression2_large coef_regression1_small 
##                   7200                   7200                   7200 
## coef_regression1_large coef_regression2_small coef_regression2_large 
##                   7200                   7200                   7200

How many parallel jobs should you use in the next make()? The predict_runtime() function can help you decide. predict_runtime(jobs = n) simulates persistent parallel workers and reports the estimated total runtime of make(jobs = n). (See also predict_load_balancing().)

time <- c()
for (jobs in 1:12){
  time[jobs] <- predict_runtime(
    config,
    jobs = jobs,
    from_scratch = TRUE,
    known_times = known_times
  )
}
library(ggplot2)
ggplot(data.frame(time = time / 3600, jobs = ordered(1:12), group = 1)) +
  geom_line(aes(x = jobs, y = time, group = group)) +
  scale_y_continuous(breaks = 0:10 * 4, limits = c(0, 29)) +
  theme_gray(16) +
  xlab("jobs argument of make()") +
  ylab("Predicted runtime of make() (hours)")

We see serious potential speed gains up to 4 jobs, but beyond that point, we have to double the jobs to shave off another 2 hours. Your choice of jobs for make() ultimately depends on the runtime you can tolerate and the computing resources at your disposal.

A final note on predicting runtime: the output of predict_runtime() and predict_load_balancing() also depends the optional workers column of your drake_plan(). If you micromanage which workers are allowed to build which targets, you may minimize reads from disk, but you could also slow down your workflow if you are not careful. See the high-performance computing guide for more.

Copyright Eli Lilly and Company