# Chapter 8 Finding the best model of gross state product

The following data analysis workflow shows off drake’s ability to generate lots of reproducibly-tracked tasks with ease. The same technique would be cumbersome, even intractable, with GNU Make.

## 8.1 Get the code.

Write the code files to your workspace.

drake_example("gsp")

The new gsp folder now includes a file structure of a serious drake project, plus an interactive-tutorial.R to narrate the example. The code is also online here.

## 8.2 Objective and methods

The goal is to search for factors closely associated with the productivity of states in the USA around the 1970s and 1980s. For the sake of simplicity, we use gross state product as a metric of productivity, and we restrict ourselves to multiple linear regression models with three variables. For each of the 84 possible models, we fit the data and then evaluate the root mean squared prediction error (RMSPE).

\begin{aligned} \text{RMSPE} = \sqrt{(\text{y} - \widehat{y})^T(y - \widehat{y})} \end{aligned} Here, $$y$$ is the vector of observed gross state products in the data, and $$\widehat{y}$$ is the vector of predicted gross state products under one of the models. We take the best variables to be the triplet in the model with the lowest RMSPE.

## 8.3 Data

The Produc dataset from the Ecdat package contains data on the Gross State Product from 1970 to 1986. Each row is a single observation on a single state for a single year. The dataset has the following variables as columns. See the references later in this report for more details.

• gsp: gross state product.
• state: the state.
• year: the year.
• pcap: private capital stock.
• hwy: highway and streets.
• water: water and sewer facilities.
• util: other public buildings and structures.
• pc: public capital.
• emp: labor input measured by the employment in non-agricultural payrolls.
• unemp: state unemployment rate.
library(Ecdat)
data(Produc)
#>     state year     pcap     hwy   water    util       pc   gsp    emp
#> 1 ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5
#> 2 ALABAMA 1971 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9
#> 3 ALABAMA 1972 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3
#> 4 ALABAMA 1973 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5
#> 5 ALABAMA 1974 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8
#> 6 ALABAMA 1975 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4
#>   unemp
#> 1   4.7
#> 2   5.2
#> 3   4.7
#> 4   3.9
#> 5   5.5
#> 6   7.7

## 8.4 Analysis

First, we load the required packages. drake is aware of all the packages you load with library() or require().

library(biglm) # lightweight models, easier to store than with lm()
library(drake)
library(Ecdat) # econometrics datasets
library(ggplot2)
library(knitr)
library(purrr)
library(tidyverse)

Next, we construct our plan. The following code uses drake’s special new language for generating plans (learn more here).

predictors <- setdiff(colnames(Produc), "gsp")

# We will try all combinations of three covariates.
combos <- combn(predictors, 3) %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
setNames(c("x1", "x2", "x3"))

#>      x1   x2    x3
#> 1 state year  pcap
#> 2 state year   hwy
#> 3 state year water
#> 4 state year  util
#> 5 state year    pc
#> 6 state year   emp

# We need to list each covariate as a symbol.
for (col in colnames(combos)) {
combos[[col]] <- rlang::syms(combos[[col]])
}

# Requires drake >= 7.0.0 or the development version
# at github.com/ropensci/drake.
# Install with remotes::install_github("ropensci/drake").
plan <- drake_plan(
model = target(
biglm(gsp ~ x1 + x2 + x3, data = Ecdat::Produc),
transform = map(.data = !!combos) # Remember the bang-bang!!
),
rmspe_i = target(
get_rmspe(model, Ecdat::Produc),
transform = map(model)
),
rmspe = target(
bind_rows(rmspe_i, .id = "model"),
transform = combine(rmspe_i)
),
plot = ggsave(
filename = file_out("rmspe.pdf"),
plot = plot_rmspe(rmspe),
width = 8,
height = 8
),
report = knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
)

plan
#> # A tibble: 171 x 2
#>    target                command
#>    <chr>                 <expr>
#>  1 model_state_year_pcap biglm(gsp ~ state + year + pcap, data = Ecdat::Pr…
#>  2 model_state_year_hwy  biglm(gsp ~ state + year + hwy, data = Ecdat::Pro…
#>  3 model_state_year_wat… biglm(gsp ~ state + year + water, data = Ecdat::P…
#>  4 model_state_year_util biglm(gsp ~ state + year + util, data = Ecdat::Pr…
#>  5 model_state_year_pc   biglm(gsp ~ state + year + pc, data = Ecdat::Prod…
#>  6 model_state_year_emp  biglm(gsp ~ state + year + emp, data = Ecdat::Pro…
#>  7 model_state_year_une… biglm(gsp ~ state + year + unemp, data = Ecdat::P…
#>  8 model_state_pcap_hwy  biglm(gsp ~ state + pcap + hwy, data = Ecdat::Pro…
#>  9 model_state_pcap_wat… biglm(gsp ~ state + pcap + water, data = Ecdat::P…
#> 10 model_state_pcap_util biglm(gsp ~ state + pcap + util, data = Ecdat::Pr…
#> # … with 161 more rows

We also need to define functions for summaries and plots.

get_rmspe <- function(model_fit, data){
y <- data$gsp yhat <- as.numeric(predict(model_fit, newdata = data)) terms <- attr(model_fit$terms, "term.labels")
tibble(
rmspe = sqrt(mean((y - yhat)^2)), # nolint
X1 = terms,
X2 = terms,
X3 = terms
)
}

plot_rmspe <- function(rmspe){
ggplot(rmspe) +
geom_histogram(aes(x = rmspe), bins = 15)
}

We have a report.Rmd file to summarize our results at the end.

drake_example("gsp")
file.copy(from = "gsp/report.Rmd", to = ".", overwrite = TRUE)
#>  TRUE

We can inspect the project before we run it.

config <- drake_config(plan)
vis_drake_graph(config)

Now, we can run the project.

make(plan, verbose = 0L)

## 8.5 Results

Here are the root mean squared prediction errors of all the models.

results <- readd(rmspe)
library(ggplot2)
plot_rmspe(rmspe = results) And here are the best models. The best variables are in the top row under X1, X2, and X3.

head(results[order(results\$rmspe, decreasing = FALSE), ])
#> # A tibble: 6 x 5
#>   model rmspe X1    X2    X3
#>   <chr> <dbl> <chr> <chr> <chr>
#> 1 17    2614. state hwy   emp
#> 2 21    2665. state water emp
#> 3 24    2666. state util  emp
#> 4 26    2666. state pc    emp
#> 5 12    2675. state pcap  emp
#> 6 28    2693. state emp   unemp

## 8.6 Comparison with GNU Make

If we were using Make instead of drake with the same set of targets, the analogous Makefile would look something like this pseudo-code sketch.

models = model_state_year_pcap.rds model_state_year_hwy.rds ... # 84 of these

model_%
Rscript -e 'saveRDS(lm(...), ...)'

rmspe_%: model_%
Rscript -e 'saveRDS(get_rmspe(...), ...)'

rmspe.rds: rmspe_%
Rscript -e 'saveRDS(dplyr::bind_rows(...), ...)'

rmspe.pdf: rmspe.rds
Rscript -e 'ggplot2::ggsave(plot_rmspe(readRDS("rmspe.rds")), "rmspe.pdf")'

report.md: report.Rmd
Rscript -e 'knitr::knit("report.Rmd")'


There are three main disadvantages to this approach.

1. Every target requires a new call to Rscript, which means that more time is spent initializing R sessions than doing the actual work.
2. The user must micromanage nearly one hundred output files (in this case, *.rds files), which is cumbersome, messy, and inconvenient. drake, on the other hand, automatically manages storage using a storr cache.
3. The user needs to write the names of the 84 models near the top of the Makefile, which is less convenient than maintaining a data frame in R.
Copyright Eli Lilly and Company