Chapter 12 Time: logging, prediction, and strategy

Thanks to Jasper Clarkberg, drake records how long it takes to build each target. For large projects that take hours or days to run, this feature becomes important for planning and execution.

library(drake)
load_mtcars_example() # Get the code with drake_example("mtcars").
make(my_plan)
#> target large
#> target small
#> target regression1_large
#> target regression2_large
#> target regression1_small
#> target regression2_small
#> target summ_regression1_large
#> target coef_regression1_large
#> target summ_regression2_large
#> target coef_regression2_large
#> target summ_regression1_small
#> target coef_regression1_small
#> target coef_regression2_small
#> target summ_regression2_small
#> target report

build_times(digits = 8) # From the cache.
#> # A tibble: 15 x 4
#>    target                 elapsed        user           system        
#>    <chr>                  <S4: Duration> <S4: Duration> <S4: Duration>
#>  1 coef_regression1_large 0.004s         0.004s         0s            
#>  2 coef_regression1_small 0.004s         0.004s         0s            
#>  3 coef_regression2_large 0.003s         0s             0s            
#>  4 coef_regression2_small 0.005s         0.004s         0s            
#>  5 large                  0.005s         0.004s         0s            
#>  6 regression1_large      0.005s         0.004s         0s            
#>  7 regression1_small      0.005s         0.004s         0.004s        
#>  8 regression2_large      0.005s         0.004s         0s            
#>  9 regression2_small      0.004s         0.004s         0s            
#> 10 report                 0.066s         0.064s         0s            
#> 11 small                  0.004s         0.004s         0s            
#> 12 summ_regression1_large 0.004s         0s             0s            
#> 13 summ_regression1_small 0.004s         0s             0.004s        
#> 14 summ_regression2_large 0.004s         0.004s         0s            
#> 15 summ_regression2_small 0.004s         0.004s         0s

## `dplyr`-style `tidyselect` commands
build_times(starts_with("coef"), digits = 8)
#> # A tibble: 4 x 4
#>   target                 elapsed        user           system        
#>   <chr>                  <S4: Duration> <S4: Duration> <S4: Duration>
#> 1 coef_regression1_large 0.004s         0.004s         0s            
#> 2 coef_regression1_small 0.004s         0.004s         0s            
#> 3 coef_regression2_large 0.003s         0s             0s            
#> 4 coef_regression2_small 0.005s         0.004s         0s

12.1 Predict total runtime

drake uses these times to predict the runtime of the next make(). At this moment, everything is up to date in the current example, so the next make() should ideally take no time at all (except for preprocessing overhead).

config <- drake_config(my_plan, verbose = FALSE)
predict_runtime(config)
#> [1] "0s"

Suppose we change a dependency to make some targets out of date. Now, the next make() should take longer since some targets are out of date.

reg2 <- function(d){
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}

predict_runtime(config)
#> [1] "0.091s"

And what if you plan to delete the cache and build all the targets from scratch?

predict_runtime(config, from_scratch = TRUE)
#> [1] "0.126s"

12.2 Strategize your high-performance computing

Let’s say you are scaling up your workflow. You just put bigger data and heavier computation in your custom code, and the next time you run make(), your targets will take much longer to build. In fact, you estimate that every target except for your R Markdown report will take two hours to complete. Let’s write down these known times in seconds.

known_times <- rep(7200, nrow(my_plan))
names(known_times) <- my_plan$target
known_times["report"] <- 5
known_times
#>                 report                  small                  large 
#>                      5                   7200                   7200 
#>      regression1_small      regression1_large      regression2_small 
#>                   7200                   7200                   7200 
#>      regression2_large summ_regression1_small summ_regression1_large 
#>                   7200                   7200                   7200 
#> summ_regression2_small summ_regression2_large coef_regression1_small 
#>                   7200                   7200                   7200 
#> coef_regression1_large coef_regression2_small coef_regression2_large 
#>                   7200                   7200                   7200

How many parallel jobs should you use in the next make()? The predict_runtime() function can help you decide. predict_runtime(jobs = n) simulates persistent parallel workers and reports the estimated total runtime of make(jobs = n). (See also predict_workers().)

time <- c()
for (jobs in 1:12){
  time[jobs] <- predict_runtime(
    config,
    jobs = jobs,
    from_scratch = TRUE,
    known_times = known_times
  )
}
library(ggplot2)
ggplot(data.frame(time = time / 3600, jobs = ordered(1:12), group = 1)) +
  geom_line(aes(x = jobs, y = time, group = group)) +
  scale_y_continuous(breaks = 0:10 * 4, limits = c(0, 29)) +
  theme_gray(16) +
  xlab("jobs argument of make()") +
  ylab("Predicted runtime of make() (hours)")

We see serious potential speed gains up to 4 jobs, but beyond that point, we have to double the jobs to shave off another 2 hours. Your choice of jobs for make() ultimately depends on the runtime you can tolerate and the computing resources at your disposal.

A final note on predicting runtime: the output of predict_runtime() and predict_workers() also depends the optional workers column of your drake_plan(). If you micromanage which workers are allowed to build which targets, you may minimize reads from disk, but you could also slow down your workflow if you are not careful. See the high-performance computing guide for more.

Copyright Eli Lilly and Company