Chapter 7 How to organize the files of drake projects

Unlike most workflow managers, drake focuses on your R session, and it does not care how you organize your files. This flexibility is great in the long run, but it leaves many new users wondering how to structure their projects. This chapter provides guidance, advice, and recommendations on structure and organization.

7.1 Examples

For examples of how to structure your code files, see the beginner oriented example projects:

Write the code directly with the drake_example() function.

drake_example("mtcars")
drake_example("gsp")
drake_example("packages")

In practice, you do not need to organize your files the way the examples do, but it does happen to be a reasonable way of doing things.

7.2 Where do you put your code?

It is best to write your code as a bunch of functions. You can save those functions in R scripts and then source() them before doing anything else.

# Load functions get_data(), analyze_data, and summarize_results()
source("my_functions.R")

Then, set up your workflow plan data frame.

good_plan <- drake_plan(
  my_data = get_data(file_in("data.csv")), # External files need to be in commands explicitly. # nolint
  my_analysis = analyze_data(my_data),
  my_summaries = summarize_results(my_data, my_analysis)
)

good_plan
## # A tibble: 3 x 2
##   target       command                                
##   <chr>        <chr>                                  
## 1 my_data      "get_data(file_in(\"data.csv\"))"      
## 2 my_analysis  analyze_data(my_data)                  
## 3 my_summaries summarize_results(my_data, my_analysis)

drake knows that my_analysis depends on my_data because my_data is an argument to analyze_data(), which is part of the command for my_analysis.

config <- drake_config(good_plan)
## Unloading targets from environment:
##   my_summaries
vis_drake_graph(config)

Now, you can call make() to build the targets.

make(good_plan)

If your commands are really long, just put them in larger functions. drake analyzes imported functions for non-file dependencies.

7.3 Your commands are code chunks, not R scripts

Some people are accustomed to dividing their work into R scripts and then calling source() to run each step of the analysis. For example you might have the following files.

  • get_data.R
  • analyze_data.R
  • summarize_results.R

If you migrate to drake, you may be tempted to set up a workflow plan like this.

bad_plan <- drake_plan(
  my_data = source(file_in("get_data.R")),
  my_analysis = source(file_in("analyze_data.R")),
  my_summaries = source(file_in("summarize_data.R"))
)

bad_plan
## # A tibble: 3 x 2
##   target       command                                
##   <chr>        <chr>                                  
## 1 my_data      "source(file_in(\"get_data.R\"))"      
## 2 my_analysis  "source(file_in(\"analyze_data.R\"))"  
## 3 my_summaries "source(file_in(\"summarize_data.R\"))"

But now, the dependency structure of your work is broken. Your R script files are dependencies, but since my_data is not mentioned in a function or command, drake does not know that my_analysis depends on it.

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
config <- drake_config(bad_plan)
vis_drake_graph(config)
## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE

Dangers:

  1. In the first make(bad_plan, jobs = 2), drake will try to build my_data and my_analysis at the same time even though my_data must finish before my_analysis begins.
  2. drake is oblivious to data.csv since it is not explicitly mentioned in a workflow plan command. So when data.csv changes, make(bad_plan) will not rebuild my_data.
  3. my_analysis will not update when my_data changes.
  4. The return value of source() is formatted counter-intuitively. If source(file_in("get_data.R")) is the command for my_data, then my_data will always be a list with elements "value" and "visible". In other words, source(file_in("get_data.R"))$value is really what you would want.

In addition, this source()-based approach is simply inconvenient. drake rebuilds my_data every time get_data.R changes, even when those changes are just extra comments or blank lines. On the other hand, in the previous plan that uses my_data = get_data(), drake does not trigger rebuilds when comments or whitespace in get_data() are modified. drake is R-focused, not file-focused. If you embrace this viewpoint, your work will be easier.

7.4 Workflows as R packages

The R package structure is a great way to organize the files of your project. Writing your own package to contain your data science workflow is a good idea, but you will need to

  1. Use expose_imports() to properly account for all your nested function dependencies, and
  2. If you load the package with devtools::load_all(), set the prework argument of make(): e.g. make(prework = "devtools::load_all()").

Thanks to Jasper Clarkberg for the workaround behind expose_imports().

7.4.1 Advantages of putting workflows in R packages

7.4.2 The problem

For drake, there is one problem: nested functions. drake always looks for imported functions nested in other imported functions, but only in your environment. When it sees a function from a package, it does not look in its body for other imports.

To see this, consider the digest() function from the digest package. Digest package is a utility for computing hashes, not a data science workflow, but I will use it to demonstrate how drake treats imports from packages.

library(digest)
g <- function(x){
  digest(x)
}
f <- function(x){
  g(x)
}
plan <- drake_plan(x = f(1))

# Here are the reproducibly tracked objects in the workflow.
config <- drake_config(plan)
tracked(config)
## [1] "f" "g" "x"

# But the `digest()` function has dependencies too.
# Because `drake` knows `digest()` is from a package,
# it ignores these dependencies by default.
head(deps_code(digest), 10)
## $globals
##  [1] "match.arg"                "stop"                    
##  [3] "warning"                  "return"                  
##  [5] "invisible"                "is.infinite"             
##  [7] "is.character"             "missing"                 
##  [9] "names"                    "formals"                 
## [11] "any"                      "is.na"                   
## [13] "pmatch"                   "which"                   
## [15] "as.raw"                   "inherits"                
## [17] "paste"                    "switch"                  
## [19] "path.expand"              "file.exists"             
## [21] "isTRUE"                   "file.info"               
## [23] "file.access"              ".Call"                   
## [25] "digest_impl"              "as.integer"              
## [27] ".getCRC32PreferOldOutput" "sub"                     
## 
## $namespaced
## [1] "base::serialize"

7.4.3 The solution

To force drake to dive deeper into the nested functions in a package, you must use expose_imports(). Again, I demonstrate with the digest package package, but you should really only do this with a package you write yourself to contain your workflow. For external packages, packrat is a much better solution for package reproducibility.

expose_imports(digest)
## <environment: R_GlobalEnv>
config <- drake_config(plan)
new_objects <- tracked(config)
head(new_objects, 10)
## [1] ".getCRC32PreferOldOutput" "base::serialize"         
## [3] "digest"                   "digest_impl"             
## [5] "f"                        "g"                       
## [7] "x"
length(new_objects)
## [1] 7

# Now when you call `make()`, `drake` will dive into `digest`
# to import dependencies.

cache <- storr::storr_environment() # just for examples
make(plan, cache = cache)
## target x
head(cached(cache = cache), 10)
## [1] "base::serialize" "digest"          "digest_impl"     "f"              
## [5] "g"               "x"
length(cached(cache = cache))
## [1] 6
Copyright Eli Lilly and Company