Chapter 5 Standards [SEEKING FEEDBACK]

This Chapter is divided between:

  • General Standards” which may be applied to all software considered within this project, irrespective of how it may be categorized under the times of categories of statistical software listed above; and

  • Specific Standards” which apply to different degrees to statistical software depending on the software category.

It is likely that standards developed under the first category may subsequently be deemed to be genuinely Statistical Standards yet which are applicable across all categories, and it may also be likely that the development of category-specific standards reveals aspects which are common across all categories, and which may subsequently be deemed general standards. We accordingly anticipate a degree of fluidity between these two broad categories.

There is also a necessary relationship between the Standards described here, and processes of Assessment described below in Chapter 8. We consider the latter to describe concrete and generally quantitative aspects of post hoc software assessment, while the present Standards provides guides and benchmarks against which to prospectively compare software during development. As this entire document is intended to serve as the defining reference for our Standards, that term may in turn be interpreted to reflect this entire document, with the current section explicitly describing aspects of Standards not covered elsewhere.

As described above, we anticipate the ongoing development of this current document to employ a versioning system, with software reviewed and hosted under the system mandated to flag the latest version of these standards to which it complies.

5.1 Other Standards

Among the noteworthy instances of software standards which might be adapted for our purposes, and in addition to entries in our Annotated Bibliography, the following are particularly relevant:

  1. The Core Infrastructure Initiative’s Best Practices Badge, which is granted to software meeting an extensive list of criteria. This list of criteria provides a singularly useful reference for software standards.
  2. The Software Sustainability Institute’s Software Evaluation Guide, in particular their guide to Criteria-based software evaluation, which considers two primary categories of Usability and Sustainability and Maintainability, each of which is divided into numerous sub-categories. The guide identifies numerous concrete criteria for each sub-category, explicitly detailed below in order to provide an example of the kind of standards that might be adapted and developed for application to the present project.
  3. The Transparent Statistics Guidelines, by the “HCI (Human Computer Interaction) Working Group”. While currently only in its beginning phases, that document aims to provide concrete guidance on “transparent statistical communication.” If its development continues, it is likely to provide useful guidelines on best practices for how statistical software produces and reports results.
  4. The more technical considerations of the Object Management Group’s Automated Source Code CISQ Maintainability Measure (where CISQ refers to the Consortium for IT Software Quality). This guide describes a number of measures which can be automatically extracted and used to quantify the maintainability of source code. None of these measures are not already considered in one or both of the preceding two documents, but the identification of measures particularly amenable to automated assessment provides a particularly useful reference.

There is also rOpenSci’s guide on package development, maintenance, and peer review, which provides standards of this type for R packages, primarily within its first chapter. Another notable example is the tidyverse design guide, and the section on Conventions for R Modeling Packages which provides guidance for model-fitting APIs.

Specific standards for neural network algorithms have also been developed as part of a google 2019 Summer Of Code project, resulting in a dedicated R package, NNbenchmark, and accompanying results—their so-called “notebooks”—of applying their benchmarks to a suite of neural network packages.

5.2 General Standards for Statistical Software

Click for notes on nomenclature of Data Types.

These standards refer to Data Types as the fundamental types defined by the R language itself between the following:

  • Logical
  • Integer
  • Continuous (class = "numeric" / typeof = "double")
  • Complex
  • String / character

The base R system also includes what are considered here to be direct extensions of fundamental types to include:

  • Factor
  • Ordered Factor
  • Date/Time

The continuous type has a typeof of “double” because that represents the storage mode in the C representation of such objects, while the class as defined within R is referred to as “numeric”. While typeof is not the same as class, with reference to continuous variables, “numeric” may be considered identical to “double” throughout.

The term “character” is interpreted here to refer to a vector each element of which is an individual “character” object. The term “string” does not relate to any official R nomenclature, but is used here to refer for convenience to a character vector of length one; in other words, a “string” is the sole element of a single-length “character” vector.



Examples of application of the following standards may be viewed as separate hackmd.io files by clicking on the following links:

Each of those files compares both general and category-specific standards against selected R packages within those categories. These comparisons are intended for illustrative purposes only, and are in no way intended to represent evaluations of the software. They are presented in the hope of demonstrating how the standards presented here may be applied to software, and what the results of such application may look like.

5.2.1 Documentation

  • G1.0 Statistical Software should list at least one primary reference from published academic literature.

We consider that statistical software submitted under our system will either (i) implement or extend prior methods, in which case the primary reference will be to the most relevant published version(s) of prior methods; or (ii) be an implementation of some new method. In the second case, it will be expected that the software will eventually form the basis of an academic publication. Until that time, the most suitable reference for equivalent algorithms or implementations should be provided.

5.2.1.1 Statistical Terminology

  • G1.1 All statistical terminology should be clarified and unambiguously defined.

Developers should not presume anywhere in the documentation of software that specific statistical terminology may be “generally understood”, and therefore not need explicit clarification. Even terms which many may consider sufficiently generic as to not require such clarification, such as “null hypotheses” or “confidence intervals”, will generally need explicit clarification. For example, both the estimation and interpretation of confidence intervals are dependent on distributional properties and associated assumptions. Any particular implementation of procedures to estimate or report on confidence intervals will accordingly reflect assumptions on distributional properties (among other aspects), both the nature and implications of which must be explicitly clarified.

Standards will include requirements for form and completeness of documentation. As with interface, several sources already provide starting points for reasonable documentation. Some documentation requirements will be specific to the statistical context. For instance, it is likely we will have requirements for referencing appropriate literature or references for theoretical support of implementations. Another area of importance is correctness and clarity of definitions of statistical quantities produced by the software, e.g., the definition of null hypotheses or confidence intervals. Data included in software – that used in examples or tests – will also have documentation requirements. It is worth noting that the roxygen2 system for documenting R packages is readily extensible, as exemplified through the roxytest package for specifying tests in-line.

5.2.1.2 Function-level Documentation

  • G1.2 Software should use roxygen2 to document all functions.
    • G1.2a All internal (non-exported) functions should also be documented in standard roxygen2 format, along with a final @noRd tag to suppress automatic generation of .Rd files.

5.2.1.3 Supplementary Documentation

The following standards describe several forms of what might be considered “Supplementary Material”. While there are many places within an R package where such material may be included, common locations include vignettes, or in additional directories (such as data-raw) listed in .Rbuildignore to prevent inclusion within installed packages.

Where software supports a publication, all claims made in the publication with regard to software performance (for example, claims of algorithmic scaling or efficiency; or claims of accuracy), the following standard applies:

  • G1.3 Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.

Where claims regarding aspects of software performance are made with respect to other extant R packages, the following standard applies:

  • G1.4 Software should include code necessary to compare performance claims with alternative implementations in other R packages.

5.2.2 Input Structures

This section considers general standards for Input Structures. These standards may often effectively be addressed through implementing class structures, although this is not a general requirement. Developers are nevertheless encouraged to examine the guide to S3 vectors in the vctrs package as an example of the kind of assurances and validation checks that are possible with regard to input data. Systems like those demonstrated in that vignette provide a very effective way to ensure that software remains robust to diverse and unexpected classes and types of input data.

5.2.2.1 Uni-variate (Vector) Input

It is important to note for univariate data that single values in R are vectors with a length of one, and that 1 is of exactly the same data type as 1:n. Given this, inputs expected to be univariate should:

  • G2.0 Implement assertions on lengths of inputs, particularly through asserting that inputs expected to be single- or multi-valued are indeed so.
    • G2.0a Provide explicit secondary documentation of any expectations on lengths of inputs
  • G2.1 Implement assertions on types of inputs (see the initial point on nomenclature above).
    • G2.1a Provide explicit secondary documentation of expectations on data types of all vector inputs.
  • G2.2 Appropriately prohibit or restrict submission of multivariate input to parameters expected to be univariate.
  • G2.3 For univariate character input:
    • G2.3a Use match.arg() or equivalent where applicable to only permit expected values.
    • G2.3b Either: use tolower() or equivalent to ensure input of character parameters is not case dependent; or explicitly document that parameters are strictly case-sensitive.
  • G2.4 Provide appropriate mechanisms to convert between different data types, potentially including:
    • G2.4a explicit conversion to integer via as.integer()
    • G2.4b explicit conversion to continuous via as.numeric()
    • G2.4c explicit conversion to character via as.character() (and not paste or paste0)
    • G2.4d explicit conversion to factor via as.factor()
    • G2.4e explicit conversion from factor via as...() functions
  • G2.5 Where inputs are expected to be of factor type, secondary documentation should explicitly state whether these should be ordered or not, and those inputs should provide appropriate error or other routines to ensure inputs follow these expectations.

A few packages implement R versions of “static type” forms common in other languages, whereby the type of a variable must be explicitly specified prior to assignment. Use of such approaches is encouraged, including but not restricted to approaches documented in packages such as vctrs, or the experimental package typed.

5.2.2.2 Tabular Input

This sub-section concerns input in “tabular data” forms, implying the two primary distinctions within R itself between array or matrix representations, and data.frame and associated representations. Among important differences between these two forms are that array/matrix classes are restricted to storing data of a single uniform type (for example, all integer or all character values), whereas data.frame as associated representations store each column as a list item, allowing different columns to hold values of different types. Further noting that a matrix may, as of R version 4.0, be considered as a strictly two-dimensional array, tabular inputs for the purposes of these standards are considered to imply data represented in one or more of the following forms:

Given this, tabular inputs may be in one or or more of the following forms:

  • matrix form when referring to specifically two-dimensional data of one uniform type
  • array form as a more general expression, or when referring to data that are not necessarily or strictly two-dimensional
  • data.frame
  • Extensions such as

Both matrix and array forms are actually stored as vectors with a single storage.mode, and so all of the preceding standards G2.0G2.5 apply. The other rectangular forms are not stored as vectors, and do not necessarily have a single storage.mode for all columns. These forms are referred to throughout these standards as “data.frame-type tabular forms”, which may be assumed to refer to data represented in either the base::data.frame format, and/or any of the classes listed in the final of the above points.

General Standards applicable to software which is intended to accept any one or more of these data.frame-type tabular inputs are then that:

  • G2.6 Software should accept as input as many of the above standard tabular forms as possible, including extension to domain-specific forms.
  • G2.7 Software should provide appropriate conversion or dispatch routines as part of initial pre-processing to ensure that all other sub-functions of a package receive inputs of a single defined class or type.
  • G2.8 Software should issue diagnostic messages for type conversion in which information is lost (such as conversion of variables from factor to character; standardisation of variable names; or removal of meta-data such as those associated with sf-format data) or added (such as insertion of variable or column names where none were provided).

The next standard concerns the following inconsistencies between three common tabular classes in regard the column extraction operator, [.

class (x) # x is any kind of `data.frame` object
#> [1] "data.frame"
class (x [, 1])
#> [1] "integer"
class (x [, 1, drop = TRUE]) # default
#> [1] "integer"
class (x [, 1, drop = FALSE])
#> [1] "data.frame"

x <- tibble::tibble (x)
class (x [, 1])
#> [1] "tbl_df"     "tbl"        "data.frame"
class (x [, 1, drop = TRUE])
#> [1] "integer"
class (x [, 1, drop = FALSE]) # default
#> [1] "tbl_df"     "tbl"        "data.frame"

x <- data.table::data.table (x)
class (x [, 1])
#> [1] "data.table" "data.frame"
class (x [, 1, drop = TRUE]) # no effect
#> [1] "data.table" "data.frame"
class (x [, 1, drop = FALSE]) # default
#> [1] "data.table" "data.frame"
  • Extracting a single column from a data.frame returns a vector by default, and a data.frame if drop = FALSE.
  • Extracting a single column from a tibble returns a single-column tibble by default, and a vector is drop = TRUE.
  • Extracting a single column from a data.table always returns a data.table, and the drop argument has no effect.

Given such inconsistencies,

  • G2.9 Software should ensure that extraction or filtering of single columns from tabular inputs should not presume any particular default behaviour, and should ensure all column-extraction operations behave consistently regardless of the class of tabular data used as input.

Adherence to the above standard G2.6 will ensure that any implicitly or explicitly assumed default behaviour will yield consistent results regardless of input classes.

5.2.2.3 Missing or Undefined Values

  • G2.10 Statistical Software should implement appropriate checks for missing data as part of initial pre-processing prior to passing data to analytic algorithms.
  • G2.11 Where possible, all functions should provide options for users to specify how to handle missing (NA) data, with options minimally including:
    • G2.11a error on missing data
    • G2.11b ignore missing data with default warnings or messages issued
    • G2.11c replace missing data with appropriately imputed values
  • G2.12 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters (such as mean(), sd() or cor()).
  • G2.13 All functions should also provide options to handle undefined values (e.g., NaN, Inf and -Inf), including potentially ignoring or removing such values.

5.2.3 Output Structures

  • G3.0 Statistical Software which enables outputs to be written to local files should parse parameters specifying file names to ensure appropriate file suffices are automatically generated where not provided.

5.2.4 Testing

All packages should follow rOpenSci standards on testing and continuous integration, including aiming for high test coverage. Extant R packages which may be useful for testing include testthat, tinytest, roxytest, and xpectr.

5.2.4.1 Test Data Sets

  • G4.0 Where applicable or practicable, tests should use standard data sets with known properties (for example, the NIST Standard Reference Datasets, or data sets provided by other widely-used R packages).
  • G4.1 Data sets created within, and used to test, a package should be exported (or otherwise made generally available) so that users can confirm tests and run examples.

5.2.4.2 Responses to Unexpected Input

  • G4.2 Appropriate error and warning behaviour of all functions should be explicitly demonstrated through tests. In particular,
    • G4.2a Every message produced within R code by stop(), warning(), message(), or equivalent should be unique
    • G4.2b Explicit tests should demonstrate conditions which trigger every one of those messages, and should compare the result with expected values.
  • G4.3 For functions which are expected to return objects containing no missing (NA) or undefined (NaN, Inf) values, the absence of any such values in return objects should be explicitly tested.

5.2.4.3 Algorithm Tests

For testing statistical algorithms, tests should include tests of the following types:

  • G4.4 Correctness tests to test that statistical algorithms produce expected results to some fixed test data sets (potentially through comparisons using binding frameworks such as RStata).
    • G4.4a For new methods, it can be difficult to separate out correctness of the method from the correctness of the implementation, as there may not be reference for comparison. In this case, testing may be implemented against simple, trivial cases or against multiple implementations such as an initial R implementation compared with results from a C/C++ implementation.
    • G4.4b For new implementations of existing methods, correctness tests should include tests against previous implementations. Such testing may explicitly call those implementations in testing, preferably from fixed-versions of other software, or use stored outputs from those where that is not possible.
    • G4.4c Where applicable, stored values may be drawn from published paper outputs when applicable and where code from original implementations is not available
  • G4.5 Correctness tests should be run with a fixed random seed
  • G4.6 Parameter recovery tests to test that the implementation produce expected results given data with known properties. For instance, a linear regression algorithm should return expected coefficient values for a simulated data set generated from a linear model.
    • G4.6a Parameter recovery tests should generally be expected to succeed within a defined tolerance rather than recovering exact values.
    • G4.6b Parameter recovery tests should be run with multiple random seeds when either data simulation or the algorithm contains a random component. (When long-running, such tests may be part of an extended, rather than regular, test suite; see G4.10-4.12, below).
  • G4.7 Algorithm performance tests to test that implementation performs as expected as properties of data change. For instance, a test may show that parameters approach correct estimates within tolerance as data size increases, or that convergence times decrease for higher convergence thresholds.
  • G4.8 Edge condition tests to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties including but not limited to:
    • G4.8a Zero-length data
    • G4.8b Data of unsupported types (e.g., character or complex numbers in for functions designed only for numeric data)
    • G4.8c Data with all-NA fields or columns or all identical fields or columns
    • G4.8d Data outside the scope of the algorithm (for example, data with more fields (columns) than observations (rows) for some regression algorithms)
  • G4.9 Noise susceptibility tests Packages should test for expected stochastic behaviour, such as through the following conditions:
    • G4.9a Adding trivial noise (for example, at the scale of .Machine$double.eps) to data does not meaningfully change results
    • G4.9b Running under different random seeds or initial conditions does not meaningfully change results

5.2.4.4 Extended tests

Thorough testing of statistical software may require tests on large data sets, tests with many permutations, or other conditions leading to long-running tests. In such cases it may be neither possible nor advisable to execute tests continuously, or with every code change. Software should nevertheless test any and all conditions regardless of how long tests may take, and in doing so should adhere to the following standards:

  • G4.10 Extended tests should included and run under a common framework with other tests but be switched on by flags such as as a <MYPKG>_EXTENDED_TESTS=1 environment variable.
  • G4.11 Where extended tests require large data sets or other assets, these should be provided for downloading and fetched as part of the testing workflow.
    • G4.11a When any downloads of additional data necessary for extended tests fail, the tests themselves should not fail, rather be skipped and implicitly succeed with an appropriate diagnostic message.
  • G4.12 Any conditions necessary to run extended tests such as platform requirements, memory, expected runtime, and artefacts produced that may need manual inspection, should be described in developer documentation such as a CONTRIBUTING.md or tests/README.md file.

5.3 Bayesian and Monte Carlo Software

Click on the following link to view a demonstration Application of Bayesian and Monte Carlo Standards.

Bayesian and Monte Carlo Software (hereafter referred to for simplicity as “Bayesian Software”) is presumed to perform one or more of the following steps:

  1. Document how to specify inputs including:
    • 1.1 Data
    • 1.2 Hyperparameters determining prior distributions
    • 1.3 Parameters determining the computational processes
  2. Accept and validate all of forms of input
  3. Apply data transformation and pre-processing steps
  4. Apply one or more analytic algorithms, generally sampling algorithms used to generate estimates of posterior distributions
  5. Return the result of that algorithmic application
  6. Offer additional functionality such as printing or summarising return results

This document details standards for each of these steps, each prefixed with “BS”.

5.3.1 Documentation of Inputs

Prior to actual standards for documentation of inputs, we note one terminological standard for Bayesian software:

  • BS1.0 Bayesian software should use the term “hyperparameter” exclusively to refer to parameters determining the form of prior distributions, and should use either the generic term “parameter” or some conditional variant(s) such as “computation parameters” to refer to all other parameters.

Bayesian Software should provide the following documentation of how to specify inputs:

  • BS1.1 Description of how to enter data, both in textual form and via code examples. Both of these should consider the simplest cases of single objects representing independent and dependent data, and potentially more complicated cases of multiple independent data inputs.
  • BS1.2 Description of how to specify prior distributions, both in textual form describing the general principles of specifying prior distributions, along with more applied descriptions and examples, within:
    • B31.2a The main package README, either as textual description or example code
    • B31.2b At least one package vignette, both as general and applied textual descriptions, and example code
    • B31.2c Function-level documentation, preferably with code included in examples
  • BS1.3 Description of all parameters which control the computational process (typically those determining aspects such as numbers and lengths of sampling processes, seeds used to start them, thinning parameters determining post-hoc sampling from simulated values, and convergence criteria). In particular:
    • BS1.3a Bayesian Software should document, both in text and examples, how to use the output of previous simulations as starting points of subsequent simulations.
    • BS1.3b Where applicable, Bayesian software should document, both in text and examples, how to use different sampling algorithms for a given model.
  • BS1.4 For Bayesian Software which implements or otherwise enables convergence checkers, documentation should explicitly describe and provide examples of use with and without convergence checkers.
  • BS1.5 For Bayesian Software which implements or otherwise enables multiple convergence checkers, differences between these should be explicitly tested.

5.3.2 Input Data Structures and Validation

This section contains standards primarily intended to ensure that input data, including model specifications, are validated prior to passing through to the main computational algorithms.

5.3.2.1 Input Data

Bayesian Software is commonly designed to accept generic one- or two-dimensional forms such as vector, matrix, or data.frame objects. The first standards concerns the range of possible generic forms for input data:

  • BS2.0 Bayesian Software which accepts one-dimensional input should ensure values are appropriately pre-processed regardless of class structures. The units package provides a good example, in creating objects that may be treated as vectors, yet which have a class structure that does not inherit from the vector class. Using these objects as input often causes software to fail. The storage.mode of the underlying objects may nevertheless be examined, and the objects transformed or processed accordingly to ensure such inputs do not lead to errors.
  • BS2.1 Bayesian Software which accepts two-dimension input should implement pre-processing routines to ensure conversion of as many possible forms as possible to some standard format which is then passed to all analytic functions. In particular, tests should demonstrate that:
    • BS2.1a data.frame or equivalent objects which have columns which do not themselves have standard class attributes (typically, vector) are appropriately processed, and do not error without reason. This behaviour should be tested. Again, columns created by the units package provide a good test case.
    • BS2.1b data.frame or equivalent objects which have list columns should ensure that those columns are appropriately pre-processed either through being removed, converted to equivalent vector columns where appropriate, or some other appropriate treatment. This behaviour should be tested.
  • BS2.2 Bayesian Software should implement pre-processing routines to ensure all input data is dimensionally commensurate, for example by ensuring commensurate lengths of vectors or numbers of rows of tabular inputs.

5.3.2.2 Prior Distributions, Model Specifications, and Hyperparameters

The second set of standards in this section concern specification of prior distributions, model structures, or other equivalent ways of specifying hypothesised relationships among input data structures. R already has a diverse range of Bayesian Software with distinct approaches to this task, commonly either through specifying a model as a character vector representing an R function, or an external file either as R code, or encoded according to some alternative system (such as for rstan).

As explicated above, the term “hyperparameters” is interpreted here to refer to parameters which define prior distributions, while a “model specification”, or simply “model”, is an encoded description of how those hyperparameters are hypothesised to transform to a posterior distribution.

Bayesian Software should:

  • BS2.3 Ensure that all appropriate validation and pre-processing of hyperparameters are implemented as distinct pre-processing steps prior to submitting to analytic routines, and especially prior to submitting to multiple parallel computational chains.
  • BS2.4 Ensure that lengths of hyperparameter vectors are checked, with no excess values silently discarded (unless such output is explicitly suppressed, as detailed below).
  • BS2.5 Ensure that lengths of hyperparameter vectors are commensurate with expected model input (see example immediately below)
  • BS2.6 Where possible, implement pre-processing checks to validate appropriateness of numeric values submitted for hyperparameters; for example, by ensuring that hyperparameters defining second-order moments such as distributional variance or shape parameters, or any parameters which are logarithmically transformed, are non-negative.

The following example demonstrates how standards like the above (BS2.5-2.6) might be addressed. Consider the following function which defines a log-likelihood estimator for a linear regression, controlled via a vector of three hyperparameters, p:

ll <- function (x, y, p) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)

Pre-processing stages should be used to determine:

  1. That the dimensions of the input data, x and y, are commensurate (BS2.2); non-commensurate inputs should error by default.
  2. The length of the vector p (BS2.4)

The latter task is not necessarily straightforward, because the definition of the function, ll(), will itself generally be part of the input to an actual Bayesian Software function. This functional input thus needs to be examined to determine expected lengths of hyperparameter vectors. The following code illustrates one way to achieve this, relying on utilities for parsing function calls in R, primarily through the getParseData function from the utils package. The parse data for a function can be extracted with the following line:

x <- getParseData (parse (text = deparse (ll)))

The object x is a data.frame of every R token (such as an expression, symbol, or operator) parsed from the function ll. The following section illustrates how this data can be used to determine the expected lengths of vector inputs to the function, ll().

click to see details

Input arguments used to define parameter vectors in any R software are accessed through R’s standard vector access syntax of vec[i], for some element i of a vector vec. The parse data for such begins with the SYMBOL of vec, the [, a NUM_CONST for the value of i, and a closing ]. The following code can be used to extract elements of the parse data which match this pattern, and ultimately to extract the various values of i used to access members of vec.

vector_length <- function (x, i) {
    xn <- x [which (x$token %in% c ("SYMBOL", "NUM_CONST", "'['", "']'")), ]
    # split resultant data.frame at first "SYMBOL" entry
    xn <- split (xn, cumsum (xn$token == "SYMBOL"))
    # reduce to only those matching the above pattern
    xn <- xn [which (vapply (xn, function (j)
                             j$text [1] == i & nrow (j) > 3,
                             logical (1)))]
    ret <- NA_integer_ # default return value
    if (length (xn) > 0) {
        # get all values of NUM_CONST as integers
        n <- vapply (xn, function (j)
                         as.integer (j$text [j$token == "NUM_CONST"] [1]),
                         integer (1), USE.NAMES = FALSE)
        # and return max of these
        ret <- max (n)
    }
    return (ret)
}

That function can then be used to determine the length of any inputs which are used as hyperparameter vectors:

ll <- function (p, x, y) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)
p <- parse (text = deparse (ll))
x <- utils::getParseData (p)

# extract the names of the parameters:
params <- unique (x$text [x$token == "SYMBOL"])
lens <- vapply (params, function (i) vector_length (x, i), integer (1))
lens
#>  y  p  x 
#> NA  3 NA

And the vector p is used as a hyperparameter vector containing three parameters. Any initial value vectors can then be examined to ensure that they have this same length.



Not all Bayesian Software is designed to accept model inputs expressed as R code. The rstan package, for example, implements its own model specification language, and only allows hyperparameters to be named, and not addressed by index. While this largely avoids problems of mismatched lengths of parameter vectors, the software (at v2.21.1) does not ensure the existence of named parameters prior to starting the computational chains. This ultimately results in each chain generating an error when a model specification refers to a non-existent or undefined hyperparameter. Such controls should be part of a single pre-processing stage, and so should only generate a single error.

5.3.2.3 Computational Parameters

Computational parameters are considered here as those passed to Bayesian functions other than hyperparameters determining the forms of prior distributions. They typically include parameters controlling lengths of runs, lengths of burn-in periods, numbers of parallel computations, other parameters controlling how samples are to be generated, or convergence criteria. All Computational Parameters should be checked for general “sanity” prior to calling primary computational algorithms. The standards for such sanity checks include that Bayesian Software should:

  • BS2.7 Check that values for parameters are positive (except where negative values may be accepted)
  • BS2.8 Check lengths and/or dimensions of inputs, and either automatically reject or provide appropriate diagnostic messaging for parameters of inappropriate length or dimension; for example passing a vector of length > 1 to a parameter presumed to define a single value (unless such output is explicitly suppressed, as detailed below)
  • BS2.9 Check that arguments are of expected classes or types (for example, check that integer-type arguments are indeed integer, with explicit conversion via as.integer where not)
  • BS2.10 Automatically reject parameters of inappropriate type (for example character values passed for integer-type parameters that are unable to be appropriately converted).

The following two sub-sections consider particular cases of computational parameters.

5.3.2.4 Seed Parameters

Bayesian software should:

  • BS2.11 Enable seeds to be passed as a parameter (through a direct seed argument or similar), or as a vector of parameters, one for each chain.
  • BS2.12 Enable results of previous runs to be used as starting points for subsequent runs

Bayesian Software which implements parallel processing should:

  • BS2.13 Ensure each chain is started with a different seed by default
  • BS2.14 Issue diagnostic messages when identical seeds are passed to distinct computational chains
  • BS2.15 Explicitly document advice not* to use set.seed()*
  • BS2.16 Provide the parameter with a plural* name: for example, “starting_values” and not “starting_value”*

To avoid potential confusion between separate parameters to control random seeds and starting values, we recommended a single “starting values” rather than “seeds” argument, with appropriate translation of these parameters into seeds where necessary.

5.3.2.5 Output Verbosity

All Bayesian Software should implement computational parameters to control output verbosity. Bayesian computations are often time-consuming, and often performed as batch computations. The following standards should be adhered to in regard to output verbosity:

  • BS2.17 Bayesian Software should implement at least one parameter controlling the verbosity of output, defaulting to verbose output of all appropriate messages, warnings, errors, and progress indicators.
  • BS2.18 Bayesian Software should enable suppression of messages and progress indicators, while retaining verbosity of warnings and errors. This should be tested.
  • BS2.19 Bayesian Software should enable suppression of warnings where appropriate. This should be tested.
  • BS2.20 Bayesian Software should explicitly enable errors to be caught, and appropriately processed either through conversion to warnings, or otherwise captured in return values. This should be tested.

5.3.3 Pre-processing and Data Transformation

5.3.3.1 Missing Values

Bayesian Software should:

  • BS3.0 Explicitly document assumptions made in regard to missing values; for example that data is assumed to contain no missing (NA, Inf) values, and that such values, or entire rows including any such values, will be automatically removed from input data.
  • BS3.1 Implement appropriate routines to pre-process missing values prior to passing data through to main computational algorithms.

5.3.3.2 Perfect Collinearity

Where appropriate, Bayesian Software should:

  • BS3.2 Implement pre-processing routines to diagnose perfect collinearity, and provide appropriate diagnostic messages or warnings
  • BS3.3 Provide distinct routines for processing perfectly collinear data, potentially bypassing sampling algorithms

An appropriate test for BS3.3 would confirm that system.time() or equivalent timing expressions for perfectly collinear data should be less than equivalent routines called with non-collinear data. Alternatively, a test could ensure that perfectly collinear data passed to a function with a stopping criteria generated no results, while specifying a fixed number of iterations may generate results.

5.3.4 Analytic Algorithms

As mentioned, analytic algorithms for Bayesian Software are commonly algorithms to simulate posterior distributions, and to draw samples from those simulations. Numerous extent R packages implement and offer sampling algorithms, and not all Bayesian Software will internally implement sampling algorithms. The following standards apply to packages which do implement internal sampling algorithms:

  • BS4.0 Packages should document sampling algorithms (generally via literary citation, or reference to other software)
  • BS4.1 Packages should provide explicit comparisons with external samplers which demonstrate intended advantage of implementation (generally via tests, vignettes, or both).

Regardless of whether or not Bayesian Software implements internal sampling algorithms, it should:

  • BS4.2 Implement at least one means to validate posterior estimates (for example through the functionality of the BayesValidate package, noting that that package has not been updated for almost 15 years, and such approaches may need adapting; or the Simulation Based Calibration approach implemented in the rstan function sbc).

Where possible or applicable, Bayesian Software should:

  • BS4.3 Implement at least one type of convergence checker, and provide a documented reference for that implementation.
  • BS4.3 Enable computations to be stopped on convergence (although not necessarily by default).
  • BS4.5 Ensure that appropriate mechanisms are provided for models which do not converge. This is often achieved by having default behaviour to stop after specified numbers of iterations regardless of convergence.
  • BS4.6 Implement tests to confirm that results with convergence checker are statistically equivalent to results from equivalent fixed number of samples without convergence checking.
  • BS4.7 Where convergence checkers are themselves parametrised, the effects of such parameters should also be tested. For threshold parameters, for example, lower values should result in longer sequence lengths.

5.3.5 Return Values

Unlike software in many other categories, Bayesian Software should generally return several kinds of distinct data, both the raw data derived from statistical algorithms, and associated metadata. Such distinct and generally disparate forms of data will be generally best combined into a single object through implementing a defined class structure, although other options are possible, including (re-)using extant class structures (see the CRAN Task view on Bayesian Inference. https://cran.r-project.org/web/views/Bayesian.html) for reference to other packages and class systems). Regardless of the precise form of return object, and whether or not defined class structures are used or implemented, the objects returned from Bayesian Software should include:

  • BS5.0 Seed(s) or starting value(s), including values for each sequences where multiple sequences are included
  • BS5.1 Appropriate metadata on types (or classes) and dimensions of input data

With regard to the input function, or alternative means of specifying prior distributions:

  • BS5.2 Bayesian Software should either:
    • BS5.2a Return the input function or prior distributional specification in the return object; or
    • BS5.2b Enable direct access to such via additional functions which accept the return object as single argument.

Where convergence checkers are implemented or provided, Bayesian Software should:

  • BS5.3 Return convergence statistics or equivalent
  • BS5.4 Where multiple checkers are enabled, return details of convergence checker used
  • BS5.5 Appropriate diagnostic statistics to indicate absence of convergence are either returned or immediately able to be accessed.

5.3.6 Additional Functionality

Bayesian Software should:

  • BS6.0 Implement a default print method for return objects
  • BS6.1 Implement a default plot method for return objects
  • BS6.2 Provide and document straightforward abilities to plot sequences of posterior samples, with burn-in periods clearly distinguished
  • BS6.3 Provide and document straightforward abilities to plot posterior distributional estimates

Bayesian Software may:

  • BS6.4 Provide summary methods for return objects
  • BS6.5 Provide abilities to plot both sequences of posterior samples and distributional estimates together in single graphic

5.3.7 Tests

5.3.7.1 Parameter Recovery Tests

Bayesian software should implement the following tests which demonstrate and confirm an ability to recover parameters:

  • BS7.0 Recovery of parametric estimates of a prior distribution
  • BS7.1 Recovery of a prior distribution in the absence of any additional data or information
  • BS7.2 Recovery of a expected posterior distribution given a specified prior and some input data

5.3.7.2 Algorithmic Scaling Tests

  • BS7.3 Bayesian software should include tests which demonstrate and confirm the scaling of algorithmic efficiency with sizes of input data; for example, that computation times increase approximately logarithmically with increasing sizes of input data.

5.3.7.3 Scaling of Input to Output Data

  • BS7.4 Bayesian software should implement tests which confirm that predicted or fitted values are on (approximately) the same scale as input values.
    • BS7.4a The implications of any assumptions on scales on input objects should be explicitly tested in this context; for example that the scales of inputs which do not have means of zero will not be able to be recovered.

5.4 Regression and Supervised Learning

Click on the following link to view a demonstration Application of Regression and Supervised Learning Standards.

This document details standards for Regression and Supervised Learning Software – referred to from here on for simplicity as “Regression Software”. Regression Software implements algorithms which aim to construct or analyse one or more mappings between two defined data sets (for example, a set of “independent” data, \(X\), and a set of “dependent” data, \(Y\)). In contrast, the analogous category of Unsupervised Learning Software aims to construct or analyse one or more mappings between a defined set of input or independent data, and a second set of “output” data which are not necessarily known or given prior to the analysis.

Common purposes of Regression Software are to fit models to estimate relationships or to make predictions between specified inputs and outputs. Regression Software includes tools with inferential or predictive foci, Bayesian, frequentist, or probability-free Machine Learning (ML) approaches, parametric or or non-parametric approaches, discrete outputs (such as in classification tasks) or continuous outputs, and models and algorithms specific to applications or data such as time series or spatial data. In many cases other standards specific to these subcategories may apply.

The following standards are divided among several sub-categories, with each standard prefixed with “RE”.

5.4.1 Input data structures and validation

  • RE1.0 Regression Software should enable models to be specified via a formula interface, unless reasons for not doing so are explicitly documented.
  • RE1.1 Regression Software should document how formula interfaces are converted to matrix representations of input data. See Max Kuhn’s RStudio blog post for examples.
  • RE1.2 Regression Software should document expected format (types or classes) for inputting predictor variables, including descriptions of types or classes which are not accepted; for example, specification that software accepts only numeric inputs in vector or matrix form, or that all inputs must be in data.frame form with both column and row names.
  • RE1.3 Regression Software should transfer all relevant aspects of input data, notably including row and column names, and potentially information from other attributes(), to corresponding aspects of return objects (see RE4, below).
    • RE1.3a Where otherwise relevant information is not transferred, this should be explicitly documented.
  • RE1.4 Regression Software should document any assumptions made with regard to input data; for example distributional assumptions, or assumptions that predictor data have mean values of zero. Implications of violations of these assumptions should be both documented and tested.

5.4.2 Pre-processing and Variable Transformation

  • RE2.0 Regression Software should document any transformations applied to input data, for example conversion of label-values to factor, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).
  • RE2.1 Regression Software should implement explicit parameters controlling the processing of missing values, ideally distinguishing NA or NaN values from Inf values (for example, through use of na.omit() and related functions from the stats package).
  • RE2.2 Regression Software should provide different options for processing missing values in predictor and response data. For example, it should be possible to fit a model with no missing predictor data in order to generate values for all associated response points, even where submitted response values may be missing.
  • RE2.3 Where applicable, Regression Software should enable data to be centred (for example, through converting to zero-mean equivalent values; or to z-scores) or offset (for example, to zero-intercept equivalent values) via additional parameters, with the effects of any such parameters clearly documented and tested.
  • RE2.4 Regression Software should implement pre-processing routines to identify whether aspects of input data are perfectly collinear, notably including:
    • RE2.4a Perfect collinearity among predictor variables
    • RE2.4b Perfect collinearity between independent and dependent variables

These pre-processing routines should also be tested as described below.

5.4.3 Algorithms

The following standards apply to the model fitting algorithms of Regression Software which implements or relies on iterative algorithms which are expected to converge to generate model statistics. Regression Software which implements or relies on iterative convergence algorithms should:

  • RE3.0 Issue appropriate warnings or other diagnostic messages for models which fail to converge.
  • RE3.1 Enable such messages to be optionally suppressed, yet should ensure that the resultant model object nevertheless includes sufficient data to identify lack of convergence.
  • RE3.2 Ensure that convergence thresholds have sensible default values, demonstrated through explicit documentation.
  • RE3.3 Allow explicit setting of convergence thresholds, unless reasons against doing so are explicitly documented.

5.4.4 Return Results

  • RE4.0 Regression Software should return some form of “model” object, generally through using or modifying existing class structures for model objects (such as lm, glm, or model objects from other packages), or creating a new class of model objects.
  • RE4.1 Regression Software may enable an ability to generate a model object without actually fitting values. This may be useful for controlling batch processing of computationally intensive fitting algorithms.

5.4.4.1 Accessor Methods

Regression Software should provide functions to access or extract as much of the following kinds of model data as possible or practicable. Access should ideally rely on class-specific methods which extend, or implement otherwise equivalent versions of, the methods from the stats package which are named in parentheses in each of the following standards.

Model objects should include, or otherwise enable effectively immediate access to the following descriptors. It is acknowledged that not all regression models can sensibly provide access to these descriptors, yet should include access provisions to all those that are applicable.

  • RE4.2 Model coefficients (via coeff() / coefficients())
  • RE4.3 Confidence intervals on those coefficients (via confint())
  • RE4.4 The specification of the model, generally as a formula (via formula())
  • RE4.5 Numbers of observations submitted to model (via nobs())
  • RE4.6 The variance-covariance matrix of the model parameters (via vcov())
  • RE4.7 Where appropriate, convergence statistics

Regression Software should provide simple and direct methods to return or otherwise access the following form of data and metadata, where the latter includes information on any transformations which may have been applied to the data prior to submission to modelling routines.

  • RE4.8 Response variables, and associated “metadata” where applicable.
  • RE4.9 Modelled values of response variables.
  • RE4.10 Model Residuals, including sufficient documentation to enable interpretation of residuals, and to enable users to submit residuals to their own tests.
  • RE4.11 Goodness-of-fit and other statistics associated such as effect sizes with model coefficients.
  • RE4.12 Where appropriate, functions used to transform input data, and associated inverse transform functions.

Regression software may provide simple and direct methods to return or otherwise access the following:

  • RE4.13 Predictor variables, and associated “metadata” where applicable.

5.4.4.2 Prediction, Extrapolation, and Forecasting

Not all regression software is intended to, or can, provide distinct abilities to extrapolate or forecast. Moreover, identifying cases in which a regression model is used to extrapolate or forecast may often be a non-trivial exercise. It may nevertheless be possible, for example when input data used to construct a model are unidimensional, and data on which a prediction is to be based extend beyond the range used to construct the model. Where reasonably unambiguous identification of extrapolation or forecasting using a model is possible, the following standards apply:

  • RE4.14 Where possible, values should also be provided for extrapolation or forecast errors.
  • RE4.15 Sufficient documentation and/or testing should be provided to demonstrate that forecast errors, confidence intervals, or equivalent values increase with forecast horizons.

Distinct from extrapolation or forecasting abilities, the following standard applies to regression software which relies on, or otherwise provides abilities to process, categorical grouping variables:

  • RE4.16 Regression Software which models distinct responses for different categorical groups should include the ability to submit new groups to predict() methods.

5.4.4.3 Reporting Return Results

  • RE4.17 Model objects returned by Regression Software should implement or appropriately extend a default print method which provides an on-screen summary of model (input) parameters and (output) coefficients.
  • RE4.18 Regression Software may also implement summary methods for model objects, and in particular should implement distinct summary methods for any cases in which calculation of summary statistics is computationally non-trivial (for example, for bootstrapped estimates of confidence intervals).

5.4.5 Documentation

Beyond the general standards for documentation, Regression Software should explicitly describe the following aspects, and ideally provide extended documentation including summary graphical reports of:

  • RE5.0 Scaling relationships between sizes of input data (numbers of observations, with potential extension to numbers of variables/columns) and speed of algorithm.

5.4.6 Visualization

  • RE6.0 Model objects returned by Regression Software (see RE3.0) should have default plot methods, either through explicit implementation, extension of methods for existing model objects, or through ensuring default methods work appropriately.
  • RE6.1 Where the default plot method is NOT a generic plot method dispatched on the class of return objects (that is, through a plot.<myclass> function), that method dispatch should nevertheless exist in order to explicitly direct users to the appropriate function.
  • RE6.2 The default plot method should produce a plot of the fitted values of the model, with optional visualisation of confidence intervals or equivalent.

The following standard applies only to software fulfilling RE4.14-4.15, and the conditions described prior to those standards.

  • RE6.3 Where a model object is used to generate a forecast (for example, through a predict() method), the default plot method should provide clear visual distinction between modelled (interpolated) and forecast (extrapolated) values.

5.4.7 Testing

5.4.7.1 Input Data

Tests for Regression Software should include the following conditions and cases:

  • RE7.0 Tests with noiseless, exact relationships between predictor (independent) data.
    • RE7.0a In particular, these tests should confirm ability to reject perfectly noiseless input data.
  • RE7.1 Tests with noiseless, exact relationships between predictor (independent) and response (dependent) data.
    • RE7.1a In particular, these tests should confirm that model fitting is at least as fast or (preferably) faster than testing with equivalent noisy data (see RE2.4b).

5.4.7.2 Diagnostic Messages

  • RE7.2 All error and warning messages should be explicitly triggered in tests, including explicit testing for the content of those diagnostic messages.

5.4.7.3 Return Results

Tests for Regression Software should

  • RE7.3 Demonstrate that output objects retain aspects of input data such as row or case names (see RE1.3).
  • RE7.4 Demonstrate and test expected behaviour when objects returned from regression software are submitted to the accessor methods of RE4.2RE4.7.
  • RE7.5 Extending directly from RE4.15, where appropriate, tests should demonstrate and confirm that forecast errors, confidence intervals, or equivalent values increase with forecast horizons.

5.5 Dimensionality Reduction, Clustering, and Unsupervised Learning

Click on the following link to view a demonstration Application of Dimensionality Reduction, Clustering, and Unsupervised Learning Standards.

This document details standards for Dimensionality Reduction, Clustering, and Unsupervised Learning Software – referred to from here on for simplicity as “Unsupervised Learning Software”. Software in this category is distinguished from Regression Software though the latter aiming to construct or analyse one or more mappings between two defined data sets (for example, a set of “independent” data, \(X\), and a set of “dependent” data, “Y”), whereas Unsupervised Learning Software aims to construct or analyse one or more mappings between a defined set of input or independent data, and a second set of “output” data which are not necessarily known or given prior to the analysis. A key distinction in Unsupervised Learning Software and Algorithms is between that for which output data represent (generally numerical) transformations of the input data set, and that for which output data are discrete labels applied to the input data. Examples of the former type include dimensionality reduction and ordination software and algorithms, and examples of the latter include clustering and discrete partitioning software and algorithms.

5.5.1 Input Data Structures and Validation

  • UL1.0 Unsupervised Learning Software should explicitly document expected format (types or classes) for input data, including descriptions of types or classes which are not accepted; for example, specification that software accepts only numeric inputs in vector or matrix form, or that all inputs must be in data.frame form with both column and row names.
  • UL1.1 Unsupervised Learning Software should provide distinct sub-routines to assert that all input data is of the expected form, and issue informative error messages when incompatible data are submitted.

The following code demonstrates an example of a routine from the base stats package which fails to meet this standard.

d <- dist (USArrests) # example from help file for 'hclust' function
hc <- hclust (d) # okay
hc <- hclust (as.matrix (d))
## Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536"): missing value where TRUE/FALSE needed

The latter fails, yet issues an uninformative error message that clearly indicates a failure to provide sufficient checks on the class of input data.

  • UL1.2 Unsupervised learning which uses row or column names to label output objects should assert that input data have non-default row or column names, and issue an informative message when these are not provided. (Such messages need not necessarily be provided by default, but should at least be optionally available.)

The following code provides simple examples of checks whether row and column names appear to have generic default values.

x <- data.frame (matrix (1:10, ncol = 2))
x
##   X1 X2
## 1  1  6
## 2  2  7
## 3  3  8
## 4  4  9
## 5  5 10

Generic row names are almost always simple integer sequences, which the following condition confirms.

identical (rownames (x), as.character (seq (nrow (x))))
## [1] TRUE

Generic column names may come in a variety of formats. The following code uses a grep expression to match any number of characters plus an optional leading zero followed by a generic sequence of column numbers, appropriate for matching column names produced by generic construction of data.frame objects.

all (vapply (seq (ncol (x)), function (i)
             grepl (paste0 ("[[:alpha:]]0?", i), colnames (x) [i]), logical (1)))
## [1] TRUE

Messages should be issued in both of these cases. The following code illustrates that the hclust function does not implement any such checks or assertions, rather it silently returns an object with default labels.

u <- USArrests
rownames (u) <- seq (nrow (u))
hc <- hclust (dist (u))
head (hc$labels)
## [1] "1" "2" "3" "4" "5" "6"
  • UL1.3 Unsupervised Learning Software should transfer all relevant aspects of input data, notably including row and column names, and potentially information from other attributes(), to corresponding aspects of return objects.
    • UL1.3a Where otherwise relevant information is not transferred, this should be explicitly documented.

An example of a function according with UL1.3 is stats::cutree()

hc <- hclust (dist (USArrests))
head (cutree (hc, 10))
##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##          1          2          3          4          5          4

The row names of USArrests are transferred to the output object. In contrast, some routines from the cluster package do not comply with this standard:

library (cluster)
ac <- agnes (USArrests) # agglomerative nesting
head (cutree (ac, 10))
## [1] 1 2 3 4 3 4

The case labels are not appropriately carried through to the object returned by agnes() to enable them to be transferred within cutree(). (The labels are transferred to the object returned by agnes, just not in a way that enables cutree to inherit them.)

  • UL1.4 Unsupervised Learning Software should explicitly document whether input data may include missing values.
  • UL1.5 Functions in Unsupervised Learning Software which do not admit input data with missing values should provide informative error messages when data with missing values are submitted.
  • UL1.6 Unsupervised Learning Software should document any assumptions made with regard to input data; for example assumptions about distributional forms or locations (such as that data are centred or on approximately equivalent distributional scales). Implications of violations of these assumptions should be both documented and tested, in particular:
    • UL1.6a Software which responds qualitatively differently to input data which has components on markedly different scales should explicitly document such differences, and implications of submitting such data.
    • UL1.6b Examples or other documentation should not use scale() or equivalent transformations without explaining why scale is applied, and explicitly illustrating and contrasting the consequences of not applying such transformations.

5.5.2 Pre-processing and Variable Transformation

  • UL2.0 Routines likely to give unreliable or irreproducible results in response to violations of assumptions regarding input data (see UL1.6) should implement pre-processing steps to diagnose potential violations, and issue appropriately informative messages, and/or include parameters to enable suitable transformations to be applied (such as the center and scale. parameters of the stats::prcomp() function).
  • UL2.1 Unsupervised Learning Software should document any transformations applied to input data, for example conversion of label-values to factor, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).
  • UL2.2 For Unsupervised Learning Software which accepts missing values in input data, functions should implement explicit parameters controlling the processing of missing values, ideally distinguishing NA or NaN values from Inf values (for example, through use of na.omit() and related functions from the stats package).
  • UL2.3 Unsupervised Learning Software should implement pre-processing routines to identify whether aspects of input data are perfectly collinear.

5.5.3 Algorithms

5.5.3.1 Labelling

  • UL3.1 Algorithms which apply sequential labels to input data (such as clustering or partitioning algorithms) should ensure that the sequence follows decreasing group sizes (so labels of “1”, “a”, or “A” describe the largest group, “2”, “b”, or “B” the second largest, and so on.)

Note that the stats::cutree() function does not accord with this standard:

hc <- hclust (dist (USArrests))
table (cutree (hc, k = 10))
## 
##  1  2  3  4  5  6  7  8  9 10 
##  3  3  3  6  5 10  2  5  5  8

The cutree() function applies arbitrary integer labels to the groups, yet the order of labels is not related to the order of group sizes.

  • UL3.2 Dimensionality reduction or equivalent algorithms which label dimensions should ensure that that sequences of labels follows decreasing “importance” (for example, eigenvalues or variance contributions).

The stats::prcomp function accords with this standard:

z <- prcomp (eurodist, rank = 5) # return maximum of 5 components
summary (z)
## Importance of first k=5 (out of 21) components:
##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     2529.6298 2157.3434 1459.4839 551.68183 369.10901
## Proportion of Variance    0.4591    0.3339    0.1528   0.02184   0.00977
## Cumulative Proportion     0.4591    0.7930    0.9458   0.96764   0.97741

The proportion of variance explained by each component decreasing with increasing numeric labelling of the components.

  • UL3.3 Unsupervised Learning Software for which input data does not generally include labels (such as array-like data with no row names) should provide an additional parameter to enable cases to be labelled.

5.5.3.2 Prediction

  • UL3.4 Where applicable, Unsupervised Learning Software should implement routines to predict the properties (such as numerical ordinates, or cluster memberships) of additional new data without re-running the entire algorithm.

While many algorithms such as Hierarchical clustering can not (readily) be used to predict memberships of new data, other algorithms can nevertheless be applied to perform this task. The following demonstrates how the output of stats::hclust can be used to predict membership of new data using the class:knn() function. (This is intended to illustrate only one of many possible approaches.)

library (class)
## 
## Attaching package: 'class'
## The following object is masked from 'package:igraph':
## 
##     knn
set.seed (1)
hc <- hclust (dist (iris [, -5]))
groups <- cutree (hc, k = 3)
# function to randomly select part of a data.frame and # add some randomness
sample_df <- function (x, n = 5) {
    x [sample (nrow (x), size = n), ] + runif (ncol (x) * n)
}
iris_new <- sample_df (iris [, -5], n = 5)
# use knn to predict membership of those new points:
knnClust <- knn (train = iris [, -5], test = iris_new , k = 1, cl = groups)
knnClust
## [1] 2 2 1 1 2
## Levels: 1 2 3

The stats::prcomp() function implements its own predict() method which conforms to this standard:

res <- prcomp (USArrests)
arrests_new <- sample_df (USArrests, n = 5)
predict (res, newdata = arrests_new)
##                      PC1        PC2        PC3       PC4
## North Carolina 165.17494 -30.693263 -11.682811  1.304563
## Maryland       129.44401  -4.132644  -2.161693  1.258237
## Ohio           -49.51994  12.748248   2.104966 -2.777463
## Colorado        35.78896  14.023774  12.869816  1.233391
## Georgia         41.28054  -7.203986   3.987152 -7.818416

5.5.3.3 Group Distributions and Associated Statistics

Many unsupervised learning algorithms serve to label, categorise, or partition data. Software which performs any of these tasks will commonly output some kind of labelling or grouping schemes. The above example of principal components illustrates that the return object records the standard deviations associated with each component:

res <- prcomp (USArrests)
print(res)
## Standard deviations (1, .., p=4):
## [1] 83.732400 14.212402  6.489426  2.482790
## 
## Rotation (n x k) = (4 x 4):
##                 PC1         PC2         PC3         PC4
## Murder   0.04170432 -0.04482166  0.07989066 -0.99492173
## Assault  0.99522128 -0.05876003 -0.06756974  0.03893830
## UrbanPop 0.04633575  0.97685748 -0.20054629 -0.05816914
## Rape     0.07515550  0.20071807  0.97408059  0.07232502
summary (res)
## Importance of components:
##                            PC1      PC2    PC3     PC4
## Standard deviation     83.7324 14.21240 6.4894 2.48279
## Proportion of Variance  0.9655  0.02782 0.0058 0.00085
## Cumulative Proportion   0.9655  0.99335 0.9991 1.00000

Such output accords with the following standard:

  • UL3.5 Objects returned from Unsupervised Learning Software which labels, categorise, or partitions data into discrete groups should include, or provide immediate access to, quantitative information on intra-group variances or equivalent, as well as on inter-group relationships where applicable.

The above example of principal components is one where there are no inter-group relationships, and so that standard is fulfilled by providing information on intra-group variances alone. Discrete clustering algorithms, in contrast, yield results for which inter-group relationships are meaningful, and such relationships can generally be meaningfully provided. The hclust() routine, like many clustering routines, simply returns a scheme for devising an arbitrary number of clusters, and so can not meaningfully provide variances or relationships between such. The cutree() function, however, does yield defined numbers of clusters, yet devoid of any quantitative information on variances or equivalent.

res <- hclust (dist (USArrests))
str (cutree (res, k = 5))
##  Named int [1:50] 1 1 1 2 1 2 3 1 4 2 ...
##  - attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...

Compare that with the output of a largely equivalent routine, the clara() function from the cluster package.

library (cluster)
cl <- clara (USArrests, k = 10) # direct clustering into specified number of clusters
cl$clusinfo
##       size  max_diss   av_diss isolation
##  [1,]    4 24.708298 14.284874 1.4837745
##  [2,]    6 28.857755 16.759943 1.7329563
##  [3,]    6 44.640565 23.718040 0.9677229
##  [4,]    6 28.005892 17.382196 0.8442061
##  [5,]    6 15.901258  9.363471 1.1037219
##  [6,]    7 29.407822 14.817031 0.9080598
##  [7,]    4 11.764353  6.781659 0.8165753
##  [8,]    3  8.766984  5.768183 0.3547323
##  [9,]    3 18.848077 10.101505 0.7176276
## [10,]    5 16.477257  8.468541 0.6273603

That object contains information on dissimilarities between each observation and cluster medoids, which in the context of UL3.4 is “information on intra-group variances or equivalent”. Moreover, inter-group information is also available as the “silhouette” of the clustering scheme.

5.5.4 Return Results

  • UL4.0 Unsupervised Learning Software should return some form of “model” object, generally through using or modifying existing class structures for model objects, or creating a new class of model objects.
  • UL4.1 Unsupervised Learning Software may enable an ability to generate a model object without actually fitting values. This may be useful for controlling batch processing of computationally intensive fitting algorithms.
  • UL4.2 The return object from Unsupervised Learning Software should include, or otherwise enable immediate extraction of, all parameters used to control the algorithm used.

5.5.4.1 Reporting Return Results

  • UL4.2 Model objects returned by Unsupervised Learning Software should implement or appropriately extend a default print method which provides an on-screen summary of model (input) parameters and methods used to generate results. The print method may also summarise statistical aspects of the output data or results.
    • UL4.2a The default print method should always ensure only a restricted number of rows of any result matrices or equivalent are printed to the screen.

The prcomp objects returned from the function of the same name include potential large matrices of component coordinates which are by default printed in their entirety to the screen. This is because the default print behaviour for most tabular objects in R (matrix, data.frame, and objects from the Matrix package, for example) is to print objects in their entirety (limited only by such options as getOption("max.print"), which determines maximal numbers of printed objects, such as lines of data.frame objects). Such default behaviour ought be avoided, particularly in Unsupervised Learning Software which commonly returns objects containing large numbers of numeric entries.

  • UL4.3 Unsupervised Learning Software should also implement summary methods for model objects which should summarise the primary statistics used in generating the model (such as numbers of observations, parameters of methods applied). The summary method may also provide summary statistics from the resultant model.

5.5.5 Documentation

5.5.6 Visualization

  • UL6.0 Objects returned by Unsupervised Learning Software should have default plot methods, either through explicit implementation, extension of methods for existing model objects, through ensuring default methods work appropriately, or through explicit reference to helper packages such as factoextra and associated functions.
  • UL6.1 Where the default plot method is NOT a generic plot method dispatched on the class of return objects (that is, through a plot.<myclass> function), that method dispatch should nevertheless exist in order to explicitly direct users to the appropriate function.
  • UL6.2 Where default plot methods include labelling components of return objects (such as cluster labels), routines should ensure that labels are automatically placed to ensure readability, and/or that appropriate diagnostic messages are issued where readability is likely to be compromised (for example, through attempting to place too many labels).

5.5.7 Testing

Unsupervised Learning Software should test the following properties and behaviours:

  • UL7.0 Inappropriate types of input data are rejected with expected error messages.

5.5.7.1 Input Scaling

The following tests should be implement for Unsupervised Learning Software for which inputs are presumed or required to be scaled in any particular ways (such as having mean values of zero).

  • UL7.1 Tests should demonstrate that violations of assumed input properties yield unreliable or invalid outputs, and should clarify how such unreliability or invalidity is manifest through the properties of returned objects.

5.5.7.2 Output Labelling

With regard to labelling of output data, tests for Unsupervised Learning Software should:

  • UL7.2 Demonstrate that labels placed on output data follow decreasing group sizes (UL3.1)
  • UL7.3 Demonstrate that labels on input data are propagated to, or may be recovered from, output data (see UL3.3).

5.5.7.3 Prediction

With regard to prediction, tests for Unsupervised Learning Software should:

  • UL7.4 Demonstrate that submission of new data to a previously fitted model can generate results more efficiently than initial model fitting.

5.5.7.4 Batch Processing

For Unsupervised Learning Software which implements batch processing routines:

  • UL7.5 Batch processing routines should be explicitly tested, commonly via extended tests (see G4.10G4.12).
    • UL7.5a Tests of batch processing routines should demonstrate that equivalent results are obtained from direct (non-batch) processing.

5.6 Exploratory Data Analysis

Click on the following link to view a demonstration Application of Exploratory Data Analysis Standards.

Exploration is a part of all data analyses, and Exploratory Data Analysis (EDA) is not something that is entered into and exited from at some point prior to “real” analysis. Exploratory Analyses are also not strictly limited to Data, but may extend to exploration of Models of those data. The category could thus equally be termed, “Exploratory Data and Model Analysis”, yet we opt to utilise the standard acronym of EDA in this document.

EDA is nevertheless somewhat different to many other categories included within rOpenSci’s program for peer-reviewing statistical software. Primary differences include:

  • EDA software often has a strong focus upon visualization, which is a category which we have otherwise explicitly excluded from the scope of the project at the present stage.
  • The assessment of EDA software requires addressing more general questions than software in most other categories, notably including the important question of intended audience(s).

The following standards are accordingly somewhat differently structured than equivalent standards developed to date for other categories, particularly through being more qualitative and abstract. In particular, while documentation is an important component of standards for all categories, clear and instructive documentation is of paramount importance for EDA Software, and so warrants its own sub-section within this document.

5.6.1 Documentation Standards

The following refer to Primary Documentation, implying in main package README or vignette(s), and Secondary Documentation, implying function-level documentation.

The Primary Documentation (README and/or vignette(s)) of EDA software should:

  • EA1.0 Identify one or more target audiences for whom the software is intended
  • EA1.1 Identify the kinds of data the software is capable of analysing (see Kinds of Data* below).*
  • EA1.2 Identify the kinds of questions the software is intended to help explore; for example, are these questions:
    • inferential?
    • predictive?
    • associative?
    • causal?
    • (or other modes of statistical enquiry?)

The Secondary Documentation (within individual functions) of EDA software should:

  • EA1.3 Identify the kinds of data each function is intended to accept as input

5.6.2 Input Data

A further primary difference of EDA software from that of our other categories is that input data for statistical software may be generally presumed of one or more specific types, whereas EDA software often accepts data of more general and varied types. EDA software should aim to accept and appropriately transform as many diverse kinds of input data as possible, through addressing the following standards, considered in terms of the two cases of input data in uni- and multi-variate form. All of the general standards for kinds of input (G2.0 - G2.7) apply to input data for EDA Software.

5.6.2.1 Index Columns

The following standards refer to an index column, which is understood to imply an explicitly named or identified column which can be used to provide a unique index index into any and all rows of that table. Index columns ensure the universal applicability of standard table join operations, such as those implemented via the dplyr package.

  • EA2.0 EDA Software which accepts standard tabular data and implements or relies upon extensive table filter and join operations should utilise an index column system
  • EA2.1 All values in an index column must be unique, and this uniqueness should be affirmed as a pre-processing step for all input data.
  • EA2.2 Index columns should be explicitly identified, either:
    • EA2.2a by using an appropriate class system, or
    • EA2.2b through setting an attribute on a table, x, of attr(x, "index") <- <index_col_name>.

For EDA software which either implements custom classes or explicitly sets attributes specifying index columns, these attributes should be used as the basis of all table join operations, and in particular:

  • EA2.3 Table join operations should not be based on any assumed variable or column names

5.6.2.2 Multi-tabular input

EDA software designed to accept multi-tabular input should:

  • EA2.4 Use and demand an explicit class system for such input (for example, via the DM package).
  • EA2.5 Ensure all individual tables follow the above standards for Index Columns

5.6.2.3 Classes and Sub-Classes

Classes are understood here to be the classes define single input objects, while Sub-Classes refer to the class definitions of components of input objects (for example, of columns of an input data.frame). EDA software which is intended to receive input in general vector formats (see Uni-variate Input section of General Standards) should ensure:

  • EA2.6 Routines appropriately process vector input of custom classes, including those which do not inherit from the vector class
  • EA2.7 Routines should appropriately process vector data regardless of additional attributes

The following code illustrates some ways by which “metadata” defining classes and additional attributes associated with a standard vector object may by modified.

x <- 1:10
class (x) <- "notvector"
attr (x, "extra_attribute") <- "another attribute"
attr (x, "vector attribute") <- runif (5)
attributes (x)
#> $class
#> [1] "notvector"
#> 
#> $extra_attribute
#> [1] "another attribute"
#> 
#> $`vector attribute`
#> [1] 0.03521663 0.49418081 0.60129563 0.75804346 0.16073301

All statistical software should appropriately deal with such input data, as exemplified by the storage.mode(), length(), and sum() functions of the base package, which return the appropriate values regardless of redefinition of class or additional attributes.

storage.mode (x)
#> [1] "integer"
length (x)
#> [1] 10
sum (x)
#> [1] 55
storage.mode (sum (x))
#> [1] "integer"

Tabular inputs in data.frame class may contain columns which are themselves defined by custom classes, and which possess additional attributes. EDA Software which accepts tabular inputs should accordingly ensure:

  • EA2.8 EDA routines appropriately process tabular input of custom classes, ideally by means of a single pre-processing routine which converts tabular input to some standard form subsequently passed to all analytic routines.
  • EA2.9 EDA routines accept and appropriately process tabular input in which individual columns may be of custom sub-classes including additional attributes.

5.6.3 Analytic Algorithms

(There are no specific standards for analytic algorithms in EDA Software.)

5.6.4 Return Results / Output Data

  • EA4.0 EDA Software should ensure all return results have types which are consistent with input types. For example, sum, min, or max values applied to integer-type vectors should return integer values, while mean or var will generally return numeric types.
  • EA4.1 EDA Software should implement parameters to enable explicit control of numeric precision
  • EA4.2 The primary routines of EDA Software should return objects for which default print and plot methods give sensible results. Default summary methods may also be implemented.

5.6.5 Visualization and Summary Output

Visualization commonly represents one of the primary functions of EDA Software, and thus visualization output is given greater consideration in this category than in other categories in which visualization may nevertheless play an important role. In particular, one component of this sub-category is Summary Output, taken to refer to all forms of screen-based output beyond conventional graphical output, including tabular and other text-based forms. Standards for visualization itself are considered in the two primary sub-categories of static and dynamic visualization, where the latter includes interactive visualization.

Prior to these individual sub-categories, we consider a few standards applicable to visualization in general, whether static or dynamic.

  • EA5.0 Graphical presentation in EDA software should be as accessible as possible or practicable. In particular, EDA software should consider accessibility in terms of:
    • EA5.0a *Typeface sizes should default to sizes which explicitly enhance accessibility
    • EA5.0b Default colour schemes should be carefully constructed to ensure accessibility.*
  • EA5.1 Any explicit specifications of typefaces which override default values should consider accessibility

5.6.5.1 Summary and Screen-based Output

  • EA5.2 Screen-based output should never rely on default print formatting of numeric types, rather should also use some version of round(., digits), formatC, sprintf, or similar functions for numeric formatting according the parameter described in EDA4.2.
  • EA5.3 Column-based summary statistics should always indicate the storage.mode, class, or equivalent defining attribute of each column (as, for example, implemented in the default print.tibble method).

5.6.5.2 General Standards for Visualization (Static and Dynamic)

  • EA5.4 All visualisations should include units on all axes, with sensibly rounded values (for example, as produced by the pretty() function).

5.6.5.3 Dynamic Visualization

Dynamic visualization routines are commonly implemented as interfaces to javascript routines. Unless routines have been explicitly developed as an internal part of an R package, standards shall not be considered to apply to the code itself, rather only to decisions present as user-controlled parameters exposed within the R environment. That said, one standard may nevertheless be applied, with an aim to minimise

  • EA5.5 Any packages which internally bundle libraries used for dynamic visualization and which are also bundled in other, pre-existing R packages, should explain the necessity and advantage of re-bundling that library.

5.6.6 Testing

5.6.6.1 Return Values

  • EA6.0 Return values from all functions should be tested, including tests for the following characteristics:
    • EA6.0a Classes and types of objects
    • EA6.0b Dimensions of tabular objects
    • EA6.0c Column names (or equivalent) of tabular objects
    • EA6.0d Classes or types of all columns contained within data.frame-type tabular objects
    • EA6.0e Values of single-valued objects; for numeric values either using testthat::expect_equal() or equivalent with a defined value for the tolerance parameter, or using round(..., digits = x) with some defined value of x prior to testing equality.

5.6.6.2 Graphical Output

  • EA6.1 The properties of graphical output from EDA software should be explicitly tested, for example via the vdiffr package or equivalent.

Tests for graphical output are frequently only run as part of an extended test suite.

5.7 Time Series Software

Time series software is presumed to perform one or more of the following steps:

  1. Accept and validate input data
  2. Apply data transformation and pre-processing steps
  3. Apply one or more analytic algorithms
  4. Return the result of that algorithmic application
  5. Offer additional functionality such as printing or summarising return results

This document details standards for each of these steps, each prefixed with “TS”.

5.7.1 Input data structures and validation

Input validation is an important software task, and an important part of our standards. While there are many ways to approach validation, the class systems of R offer a particularly convenient and effective means. For Time Series Software in particular, a range of class systems have been developed, for which we refer to the section “Time Series Classes” in the CRAN Task view on Time Series Analysis", and the class-conversion package tsbox. Software which uses and relies on defined classes can often validate input through affirming appropriate class(es). Software which does not use or rely on class systems will generally need specific routines to validate input data structures. In particular, because of the long history of time series software in R, and the variety of class systems for representing time series data, new time series packages should accept as many different classes of input as possible by according with the following standards:

  • TS1.0 Time Series Software should use and rely on explicit class systems developed for representing time series data, and should not permit generic, non-time-series input

The core algorithms of time-series software are often ultimately applied to simple vector objects, and some time series software accepts simple vector inputs, assuming these to represent temporally sequential data. Permitting such generic inputs nevertheless prevents any such assumptions from being asserted or tested. Missing values pose particular problems in this regard. A simple na.omit() call or similar will shorten the length of the vector by removing any NA values, and will change the explicit temporal relationship between elements. The use of explicit classes for time series generally ensures an ability to explicitly assert properties such as strict temporal regularity, and to control for any deviation from expected properties.

  • TS1.1 Time Series Software should explicitly document the types and classes of input data able to be passed to each function.
  • TS1.2 Time Series Software should accept input data in as many time series specific classes as possible.
  • TS1.3 Time Series Software should implement validation routines to confirm that inputs are of acceptable classes (or represented in otherwise appropriate ways for software which does not use class systems).
  • TS1.4 Time Series Software should implement a single pre-processing routine to validate input data, and to appropriately transform it to a single uniform type to be passed to all subsequent data-processing functions (the tsbox package provides one convenient approach for this).
  • TS1.5 The pre-processing function described above should maintain all time- or date-based components or attributes of input data.

For Time Series Software which relies on or implements custom classes or types for representing time-series data, the following standards should be adhered to:

  • TS1.6 The software should ensure strict ordering of the time, frequency, or equivalent ordering index variable.
  • TS1.7 Any violations of ordering should be caught in the pre-processing stages of all functions.

5.7.1.1 Time Intervals and Relative Time

While most common packages and classes for time series data assume absolute temporal scales such as those represented in POSIX classes for dates or times, time series may also be quantified on relative scales where the temporal index variable quantifies intervals rather than absolute times or dates. Many analytic routines which accept time series inputs in absolute form are also appropriately applied to analogous data in relative form, and thus many packages should accept time series inputs both in absolute and relative forms. Software which can or should accept times series inputs in relative form should:

  • TS1.8 Accept inputs defined via the units package for attributing SI units to R vectors.
  • TS1.9 Where time intervals or periods may be days or months, be explicit about the system used to represent such, particularly regarding whether a calendar system is used, or whether a year is presumed to have 365 days, 365.2422 days, or some other value.

5.7.2 Pre-processing and Variable Transformation

5.7.2.1 Missing Data

One critical pre-processing step for Time Series Software is the appropriate handling of missing data. It is convenient to distinguish between implicit and explicit missing data. For regular time series, explicit missing data may be represented by NA values, while for irregular time series, implicit missing data may be represented by missing rows. The difference is demonstrated in the following table.

Missing Values
Time value
08:43 0.71
08:44 NA
08:45 0.28
08:47 0.34
08:48 0.07

The value for 08:46 is implicitly missing, while the value for 08:44 is explicitly missing. These two forms of missingness may connote different things, and may require different forms of pre-processing. With this in mind, the following standards apply:

  • TS2.0 Appropriate checks for missing data, and associated transformation routines, should be performed as part of initial pre-processing prior to passing data to analytic algorithms.
  • TS2.1 Time Series Software which presumes or requires regular data should only allow explicit* missing values, and should issue appropriate diagnostic messages, potentially including errors, in response to any implicit missing values.*
  • TS2.2 Where possible, all functions should provide options for users to specify how to handle missing data, with options minimally including:
    • TS2.2a error on missing data.
    • TS2.2b warn or ignore missing data, and proceed to analyse irregular data, ensuring that results from function calls with regular yet missing data return identical values to submitting equivalent irregular data with no missing values.
    • TS2.2c replace missing data with appropriately imputed values.
  • TS2.3 Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default na.rm = FALSE-type parameters (such as mean(), sd() or var()).

5.7.2.2 Stationarity

Time Series Software should explicitly document assumptions or requirements made with respect to the stationarity or otherwise of all input data. In particular, any (sub-)functions which assume or rely on stationarity should:

  • TS2.4 Consider stationarity of all relevant moments - typically first (mean) and second (variance) order, or otherwise document why such consideration may be restricted to lower orders only.
  • TS2.5 Explicitly document all assumptions and/or requirements of stationarity
  • TS2.6 Implement appropriate checks for all relevant forms of stationarity, and either:
    • TS2.6a issue diagnostic messages or warnings; or
    • TS2.6b enable or advise on appropriate transformations to ensure stationarity.

The two options in the last point (TS2.6b) respectively translate to enabling transformations to ensure stationarity by providing appropriate routines, generally triggered by some function parameter, or advising on appropriate transformations, for example by directing users to additional functions able to implement appropriate transformations.

5.7.2.3 Covariance Matrices

Where covariance matrices are constructed or otherwise used within or as input to functions, they should:

  • TS2.7 Incorporate a system to ensure that both row and column orders follow the same ordering as the underlying time series data. This may, for example, be done by including the index attribute of the time series data as an attribute of the covariance matrix.
  • TS2.8 Where applicable, covariance matrices should also include specification of appropriate units.

5.7.3 Analytic Algorithms

Analytic algorithms are considered here to reflect the core analytic components of Time Series Software. These may be many and varied, and we explicitly consider only a small subset here.

5.7.3.1 Forecasting

Statistical software which implements forecasting routines should:

  • TS3.0 Provide tests to demonstrate at least one case in which errors widen appropriately with forecast horizon.
  • TS3.1 If possible, provide at least one test which violates TS3.0
  • TS3.2 Document the general drivers of forecast errors or horizons, as demonstrated via the particular cases of TS3.0 and TS3.1
  • TS3.3 Either:
    • TS3.3a Document, preferable via an example, how to trim forecast values based on a specified error margin or equivalent; or
    • TS3.3b Provide an explicit mechanism to trim forecast values to a specified error margin, either via an explicit post-processing function, or via an input parameter to a primary analytic function.

5.7.4 Return Results

For (functions within) Time Series Software which return time series data:

  • TS4.0 Return values should either:
    • TS4.0a Be in same class as input data, for example by using the tsbox package to re-convert from standard internal format (see 1.4, above); or
    • TS4.0b Be in a unique, preferably class-defined, format.
  • TS4.1 Any units included as attributes of input data should also be included within return values.
  • TS4.2 The type and class of all return values should be explicitly documented.

For (functions within) Time Series Software which return data other than direct series:

  • TS4.3 Return values should explicitly include all appropriate units and/or time scales

5.7.4.1 Data Transformation

Time Series Software which internally implements routines for transforming data to achieve stationarity and which returns forecast values should:

  • TS4.4 Document the effect of any such transformations on forecast data, including potential effects on both first- and second-order estimates.
  • TS4.5 In decreasing order of preference, either:
    • TS4.5a Provide explicit routines or options to back-transform data commensurate with original, non-stationary input data
    • TS4.5b Demonstrate how data may be back-transformed to a form commensurate with original, non-stationary input data.
    • TS4.5c Document associated limitations on forecast values

5.7.4.2 Forecasting

Where Time Series Software implements or otherwise enables forecasting abilities, it should return one of the following three kinds of information. These are presented in decreasing order of preference, such that software should strive to return the first kind of object, failing that the second, and only the third as a last resort.

  • TS4.6 Time Series Software which implements or otherwise enables forecasting should return either:
    • TS4.6a A distribution object, for example via one of the many packages described in the CRAN Task View on Probability Distributions (or the new distributional package as used in the fable package for time-series forecasting).
    • TS4.6b For each variable to be forecast, predicted values equivalent to first- and second-order moments (for example, mean and standard error values).
    • TS4.6c Some more general indication of error involved with forecast estimates.

Beyond these particular standards for return objects, Time Series Software which implements or otherwise enables forecasting should:

  • TS4.7 Ensure that forecast (modelled) values are clearly distinguished from observed (model or input) values, either (in this case in no order of preference) by
    • TS4.7a Returning forecast values alone
    • TS4.7b Returning distinct list items for model and forecast values
    • TS4.7c Combining model and forecast values into a single return object with an appropriate additional column clearly distinguishing the two kinds of data.

5.7.5 Visualization

Time Series Software should:

  • TS5.0 Implement default plot methods for any implemented class system.
  • TS5.1 When representing results in temporal domain(s), ensure that one axis is clearly labelled “time” (or equivalent), with continuous units.
  • TS5.2 Default to placing the “time” (or equivalent) variable on the horizontal axis.
  • TS5.3 Ensure that units of the time, frequency, or index variable are printed by default on the axis.
  • TS5.4 For frequency visualization, abscissa spanning \([-\pi, \pi]\) should be avoided in favour positive units of \([0, 2\pi]\) or \([0, 0.5]\), in all cases with appropriate additional explanation of units.
  • TS5.5 Provide options to determine whether plots of data with missing values should generate continuous or broken lines.

For the results of forecast operations, Time Series Software should

  • TS5.6 By default indicate distributional limits of forecast on plot
  • TS5.7 By default include model (input) values in plot, as well as forecast (output) values
  • TS5.8 By default provide clear visual distinction between model (input) values and forecast (output) values.

5.8 Machine Learning Software

Click on the following link to view a demonstration Application of Machine Learning Software Standards.

R has an extensive and diverse ecosystem of Machine Learning (ML) software which is very well described in the corresponding CRAN Task View. Unlike most other categories of statistical software considered here, the primary distinguishing feature of ML software is not (necessarily or directly) algorithmic, rather pertains to a workflow typical of machine learning tasks. In particular, we consider ML software to approach data analysis via the two primary steps of:

  1. Passing a set of training data to an algorithm in order to generate a candidate mapping between that data and some form of pre-specified output or response variable. Such mappings will be referred to here as “models”, with a single analysis of a single set of training data generating one model.
  2. Passing a set of test data to the model(s) generated by the first step in order to derive some measure of predictive accuracy for that model.

A single ML task generally yields two distinct outputs:

  1. The model derived in the first of the previous steps; and
  2. Associated statistics of model performance (as evaluated within the context of the test data used to assess that performance).

A Machine Learning Workflow

Given those initial considerations, we now attempt the difficult task of envisioning a typical standard workflow for inherently diverse ML software. The following workflow ought to be considered an “extensive” workflow, with shorter versions, and correspondingly more restricted sets of standards, possible dependent upon envisioned areas of application. For example, the workflow presumes input data to be too large to be stored as a single entity in local memory. Adaptation to situations in which all training data can be loaded into memory may mean that some of the following workflow stages, and therefore corresponding standards, may not apply.

Just as typical workflows are potentially very diverse, so are outputs of ML software, which depend on areas of application and intended purpose of software. The following refers to the “desired output” of ML software, a phrase which is intentionally left non-specific, but which it intended to connote any and all forms of “response variable” and other “pre-specified outputs” such as categorical labels or validation data, along with outputs which may not necessarily be able to be pre-specified in simple uni- or multi-variate form, such as measures of distance between sets of training and validation data.

Such “desired outputs” are presumed to be quantified in terms of a “loss” or “cost” function (hereafter, simply “loss function”) quantifying some measure of distance between a model estimate (resulting from applying the model to one or more components of a training data set) and a pre-defined “valid” output (during training), or a test data set (following training).

Given the foregoing considerations, we consider a typical ML workflow to progress through (at least some of) the following steps:

  1. Input Data Specification Obtain a local copy of input data, often as multiple objects (either on-disk or in memory) in some suitably structured form such as in a series of sub-directories or accompanied by additional data defining the structural properties of input objects. Regardless of form, multiple objects are commonly given generic labels which distinguish between training and test data, along with optional additional categories and labels such as validation data used, for example, to determine accuracy of models applied to training data yet prior to testing.
  2. Pre-Processing Define transformations of input data, including but not restricted to, broadcasting dimensions (as defined below) and standardising data ranges (typically to defined values of mean and standard deviation).
  3. Model and Algorithm Specification Specify the model and associated processes which will be applied to map the input data on to the desired output. This step minimally includes the following distinct stages (generally in no particular order):
    1. Specify the kind of model which will be applied to the training data. ML software often allows the use of pre-trained models, in which case this this step includes downloading or otherwise obtaining a pre-trained model, along with specification of which aspects of those models are to be modified through application to a particular set of training and validation data.
    2. Specify the kind of algorithm which will be used to explore the search space (for example some kind of gradient descent algorithm), along with parameters controlling how that algorithm will be applied (for example a learning rate, as defined above).
    3. Specify the kind of loss function will be used to quantify distance between model estimates and desired output.
  4. Model Training Apply the specified model to the training data to generate a series of estimates from the specified loss function. This stage may also include specifying parameters such as stopping or exit criteria, and parameters controlling batch processing of input data. Moreover, this stage may involve retaining some of the following additional data:
    1. Potential “pre-processing” stages such as initial estimates of optimal learning rates (see above).
    2. Details of summaries of actual paths taken through the search space towards convergence on local or global minimum.
  5. Model Output and Performance Measure the performance of the trained model when applied to the test data set, generally requiring the specification of a metric of model performance or accuracy.

Importantly, ML workflows may be partly iterative. This may in turn potentially confound distinctions between training and test data, and accordingly confound expectations commonly placed upon statistical analyses of statistical independence of response variables. ML routines such as cross-validation repeatedly (re-)partition data between training and test sets. Resultant models can then not be considered to have been developed through application to any single set of truly “independent” data. In the context of the standards that follow, these considerations admit a potential lack of clarity in any notional categorical distinction between training and test data, and between model specification and training.

The preceding workflow mentioned a couple of concepts the definitions of which may be seen by clicking on the corresponding items below. Following that, we proceed to standards for ML software, enumerated and developed with reference to the preceding workflow steps. As described above, these steps may not be applicable to all ML software, and so all of the following standards should be considered to be conditioned on “where applicable.” In order that the following standards initially adhere to the enumeration of workflow steps given above, more general standards pertaining to aspects such as documentation and testing are given following the initial five “workflow” standards.

Click for a definition of broadcasting, referred to in Step 2, above.

The following definition comes from a vignette for the rray package named Broadcasting.

  • Broadcasting is, “repeating the dimensions of one object to match the dimensions of another.”

This concept runs counter to aspects of standards in other categories, which often suggest that functions should error when passed input objects which do not have commensurate dimensions. Broadcasting is a pre-processing step which enables objects with incommensurate dimensions to be dimensionally reconciled.

The following demonstration is taken directly from the rray package (which is not currently on CRAN).

library (rray)
a <- array(c(1, 2), dim = c(2, 1))
b <- array(c(3, 4), dim = c(1, 2))
# rbind (a, b) # error!
rray_bind (a, b, .axis = 1)
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> [3,]    3    4
rray_bind (a, b, .axis = 2)
#>      [,1] [,2] [,3]
#> [1,]    1    3    4
#> [2,]    2    3    4

Broadcasting is commonly employed in ML software because it enables ML operations to be implemented on objects with incommensurate dimensions. One example is image analysis, in which training data may all be dimensionally commensurate, yet test images may have different dimensions. Broadcasting allows data to be submitted to ML routines regardless of potentially incommensurate dimensions.

Click for a definition of learning rate, referred to in Step 5, above.

  • Learning Rate (generally) determines the step size used to search for local optima as a fraction of the local gradient.

This parameter is particularly important for training ML algorithms like neural networks, the results of which can be very sensitive to variations in learning rates. A useful overview of the importance of learning rates, and a useful approach to automatically determining appropriate values, is given in this blog post.

Partly because of widespread and current relevance, the category of Machine Learning software is one for which there have been other notable attempts to develop standards. A particularly useful reference is the MLPerf organization which, among other activities, hosts several github repositories providing reference datasets and benchmark conditions for comparing performance aspects of ML software. While such reference or benchmark standards are not explicitly referred to in the current version of the following standards, we expect them to be gradually adapted and incorporated as we start to apply and refine our standards in application to software submitted to our review system.

5.8.1 Input Data Specification

Many of the following standards refer to the labelling of input data as “testing” or “training” data, along with potentially additional labels such as “validation” data. In regard to such labelling, the following two standards apply,

  • ML1.0 Documentation should make a clear conceptual distinction between training and test data (even where such may ultimately be confounded as described above.)
    • ML1.0a Where these terms are ultimately eschewed, these should nevertheless be used in initial documentation, along with clear explanation of, and justification for, alternative terminology.
  • ML1.1 Absent clear justification for alternative design decisions, input data should be expected to be labelled “test”, “training”, and, where applicable, “validation” data.
    • ML1.1a The presence and use of these labels should be explicitly confirmed via pre-processing steps (and tested in accordance with ML7.0, below).
    • ML1.1b Matches to expected labels should be case-insensitive and based on partial matching such that, for example, “Test”, “test”, or “testing” should all suffice.

The following three standards (ML1.2ML1.4) represent three possible design intentions for ML software. Only one of these three will generally be applicable to any one piece of software, although it is nevertheless possible that more than one of these standards may apply. The first of these three standards applies to ML software which is intended to process, or capable of processing, input data as a single (generally tabular) object.

  • ML1.2 Training and test data sets for ML software should be able to be input as a single, generally tabular, data object, with the training and test data distinguished either by
    • A specified variable containing, for example, TRUE/FALSE or 0/1 values, or which uses some other system such as missing (NA) values to denote test data); and/or
    • An additional parameter designating case or row numbers, or labels of test data.

The second of these three standards applies to ML software which is intended to process, or capable of processing, input data represented as multiple objects which exist in local memory.

  • ML1.3 Input data should be clearly partitioned between training and test data (for example, through having each passed as a distinct list item), or should enable an additional means of categorically distinguishing training from test data (such as via an additional parameter which provides explicit labels). Where applicable, distinction of validation and any other data should also accord with this standard.

The third of these three standards for data input applies to ML software for which data are expected to be input as references to multiple external objects, generally expected to be read from either local or remote connections.

  • ML1.4 Training and test data sets, along with other necessary components such as validation data sets, should be stored in their own distinctly labelled sub-directories (for distinct files), or according to an explicit and distinct labelling scheme (for example, for database connections). Labelling should in all cases adhere to ML1.1, above.

The following standard applies to all ML software regardless of the applicability or otherwise of the preceding three standards.

  • ML1.5 ML software should implement a single function which summarises the contents of test and training (and other) data sets, minimally including counts of numbers of cases, records, or files, and potentially extending to tables or summaries of file or data types, sizes, and other information (such as unique hashes for each component).

5.8.1.1 Missing Values

Missing data are handled differently by different ML routines, and it is also difficult to suggest generally applicable standards for pre-processing missing values in ML software. The following standards attempt to cover a practical range of typical approaches and applications.

  • ML1.6 ML software which does not admit missing values, and which expects no missing values, should implement explicit pre-processing routines to identify whether data has any missing values, and should generally error appropriately and informatively when passed data with missing values. In addition, ML software which does not admit missing values should:
    • ML1.6a Explain why missing values are not admitted.
    • ML1.6b Provide explicit examples (in function documentation, vignettes, or both) for how missing values may be imputed, rather than simply discarded.
  • ML1.7 ML software which admits missing values should clearly document how such values are processed.
    • ML1.7a Where missing values are imputed, software should offer multiple user-defined ways to impute missing data.
    • ML1.7b Where missing values are imputed, the precise imputation steps should also be explicitly documented, either in tests (see ML7.2 below), function documentation, or vignettes.
  • ML1.8 ML software should enable equal treatment of missing values for both training and test data, with optional user ability to control application to either one or both.

5.8.2 Pre-processing

As reflected in the workflow envisioned at the outset, ML software operates somewhat differently to statistical software in many other categories. In particular, ML software often requires explicit specification of a workflow, including specification of input data (as per the standards of the preceding sub-section), and of both transformations and statistical models to be applied to those data. This section of standards refers exclusively to the transformation of input data as a pre-processing step prior to any specification of, or submission to, actual models.

  • ML2.0 A dedicated function should enable pre-processing steps to be defined and parametrized.
    • ML2.0a That function should return an object which can be directly submitted to a specified model (see section 3, below).
    • ML2.0b Absent explicit justification otherwise, that return object should have a defined class minimally intended to implement a default print method which summarizes the input data set (as per ML1.5 above) and associated transformations (see the following standard).

Standards for most other categories of statistical software suggest that pre-processing routines should ensure that input data sets are commensurate, for example, through having equal numbers of cases or rows. In contrast, ML software is commonly intended to accept input data which can not be guaranteed to be dimensionally commensurate, such as software intended to process rectangular image files which may be of different sizes.

  • ML2.1 ML software which uses broadcasting to reconcile dimensionally incommensurate input data should offer an ability to at least optionally record transformations applied to each input file.

Beyond broadcasting and dimensional transformations, the following standards apply to the pre-processing stages of ML software.

  • ML2.2 ML software which requires or relies upon numeric transformations of input data (such as change in mean values or variances) should allow optimal explicit specification of target values, rather than restricting transformations to default generic values only (such as transformations to z-scores).
    • ML2.2a Where the parameters have default values, reasons for those particular defaults should be explicitly described.
    • ML2.2b Any extended documentation (such as vignettes) which demonstrates the use of explicit values for numeric transformations should explicitly describe why particular values are used.

For all transformations applied to input data, whether of dimension (ML2.1) or scale (ML2.2),

  • ML2.3 The values associated with all transformations should be recorded in the object returned by the function described in the preceding standard (ML2.0).
  • ML2.4 Default values of all transformations should be explicitly documented, both in documentation of parameters where appropriate (such as for numeric transformations), and in extended documentation such as vignettes.
  • ML2.5 ML software should provide options to bypass or otherwise switch off all default transformations.
  • ML2.6 Where transformations are implemented via distinct functions, these should be exported to a package’s namespace so they can be applied in other contexts.
  • ML2.7 Where possible, documentation should be provided for how transformations may be reversed. For example, documentation may demonstrate how the values retained via ML2.3, above, can be used along with transformations either exported via ML2.6 or otherwise exemplified in demonstration code to independently transform data, and then to reverse those transformations.

5.8.3 Model and Algorithm Specification

A “model” in the context of ML software is understood to be a means of specifying a mapping between input and output data, generally applied to training and validation data. Model specification is the step of specifying how such a mapping is to be constructed. The specification of what the values of such a model actually are occurs through training the model, and is described in the following sub-section. These standards also refer to control parameters which specify how models are trained. These parameters commonly include values specifying numbers of iterations, training rates, and parameters controlling algorithmic processes such as re-sampling or cross-validation.

  • ML3.0 Model specification should be implemented as a distinct stage subsequent to specification of pre-processing routines (see Section 2, above) and prior to actual model fitting or training (see Section 4, below). In particular,
    • ML3.0a A dedicated function should enable models to be specified without actually fitting or training them, or if this (ML3) and the following (ML4) stages are controlled by a single function, that function should have a parameter enabling models to be specified yet not fitted (for example, nofit = FALSE).
    • ML3.0b That function should accept as input the objects produced by the previous Input Data Specification stage, and defined according to ML2.0, above.
    • ML3.0c The function described above (ML3.0a) should return an object which can be directly trained as described in the following sub-section (ML4).
    • ML3.0d That return object should have a defined class minimally intended to implement a default print method which summarises the model specification, including values of all relevant parameters.
  • ML3.1 ML software should allow the use of both untrained models, specified through model parameters only, as well as pre-trained models. Use of the latter commonly entails an ability to submit a previously-trained model object to the function defined according to ML3.0a, above.
  • ML3.2 ML software should enable different models to be applied to the object specifying data inputs and transformations (see sub-sections 1–2, above) without needing to re-define those preceding steps.

A function fulfilling ML3.0–3.2 might, for example, permit the following arguments:

  1. data: Input data specification constructed according to ML1
  2. model: An optional previously-trained model
  3. control: A list of parameters controlling how the model algorithm is to be applied during the subsequent training phase (ML4).

A function with the arguments defined above would fulfil the preceding three standards, because the data stage would represent the output of ML1, while the model stage would allow for different pre-trained models to be submitted using the same data and associated specifications (ML3.1). The provision of a separate .data argument would fulfil ML3.2 by allowing one or both model or control parameters to be re-defined while submitting the same data object.

  • ML3.3 Where ML software implements its own distinct classes of model objects, the properties and behaviours of those specific classes of objects should be explicitly compared with objects produced by other ML software. In particular, where possible, ML software should provide extended documentation (as vignettes or equivalent) comparing model objects with those from other ML software, noting both unique abilities and restrictions of any implemented classes.
  • ML3.4 Where training rates are used, ML software should provide explicit documentation both in all functions which use training rates, and in extended form such as vignettes, of the importance of, and/or sensitivity to, different values of training rates. In particular,
    • ML3.4a Unless explicitly justified otherwise, ML software should offer abilities to automatically determine appropriate or optimal training rates, either as distinct pre-processing stages, or as implicit stages of model training.
    • ML3.4b ML software which provides default values for training rates should clearly document anticipated restrictions of validity of those default values; for example through clear suggestions that user-determined and -specified values may generally be necessary or preferable.

5.8.3.1 Control Parameters

Control parameters are considered here to specify how a model is to be applied to a set of training data. These are generally distinct from parameters specifying the actual model (such as model architecture). While we recommend that control parameters be submitted as items of a single named list, this is neither a firm expectation nor an explicit part of the current standards.

  • ML3.5 Parameters controlling optimization algorithms should minimally include:
    • ML3.5a Specification of the type of algorithm used to explore the search space (commonly, for example, some kind of gradient descent algorithm)
    • ML3.5b The kind of loss function used to assess distance between model estimates and desired output.
  • ML3.6 Unless explicitly justified otherwise (for example because ML software under consideration is an implementation of one specific algorithm), ML software should:
    • ML3.6a Implement or otherwise permit usage of multiple ways of exploring search space
    • ML3.6b Implement or otherwise permit usage of multiple loss functions.

5.8.3.2 CPU and GPU processing

ML software often involves manipulation of large numbers of rectangular arrays for which graphics processing units (GPUs) are often more efficient than central processing units (CPUs). ML software thus commonly offers options to train models using either CPUs or GPUs. While these standards do not currently suggest any particular design choice in this regard, we do note the following:

  • ML3.7 For ML software in which algorithms are coded in C++, user-controlled use of either CPUs or GPUs (on NVIDIA processors at least) should be implemented through direct use of libcudacxx.

This library can be “switched on” through activating a single C++ header file to switch from CPU to GPU.

5.8.4 Model Training

Model training is the stage of the ML workflow envisioned here in which the actual computation is performed by applying a model specified according to ML3 to data specified according to ML1 and ML2.

  • ML4.0 ML software should generally implement a unified single-function interface to model training, able to receive as input a model specified according to all preceding standards. In particular, models with categorically different specifications, such as different model architectures or optimization algorithms, should be able to be submitted to the same model training function.
  • ML4.1 ML software should at least optionally retain explicit information on paths taken as an optimizer advances towards minimal loss. Such information should minimally include:
    • ML4.1a Specification of all model-internal parameters, or equivalent hashed representation.
    • ML4.1b The value of the loss function at each point
    • ML4.1c Information used to advance to next point, for example quantification of local gradient.
  • ML4.2 The subsequent extraction of information retained according to the preceding standard should be explicitly documented, including through example code.

5.8.4.1 Batch Processing

The following standards apply to ML software which implements batch processing, commonly to train models on data sets too large to be loaded in their entirety into memory.

  • ML4.3 All parameters controlling batch processing and associated terminology should be explicitly documented, and it should not, for example, be presumed that users will understand the definition of “epoch” as implemented in any particular ML software.

According to that standard, it would for example be inappropriate to have a parameter, nepochs, described as “Number of epochs used in model training”. Rather, the definition and particular implementation of “epoch” must be explicitly defined.

  • ML4.4 Explicit guidance should be provided on selection of appropriate values for parameter controlling batch processing, for example, on trade-offs between batch sizes and numbers of epochs (with both terms provided as Control Parameters in accordance with the preceding standard, ML3).
  • ML4.5 ML software may optionally include a function to estimate likely time to train a specified model, through estimating initial timings from a small sample of the full batch.
  • ML4.6 ML software should by default provide explicit information on the progress of batch jobs (even where those jobs may be implemented in parallel on GPUs). That information may be optionally suppressed through additional parameters.

5.8.4.2 Re-sampling

As described at the outset, ML software does not always rely on pre-specified and categorical distinctions between training and test data. For example, models may be fit to what is effectively one single data set in which specified cases or rows are used as training data, and the remainder as test data. Re-sampling generally refers to the practice of re-defining categorical distinctions between training and test data. One training run accordingly connotes training a model on one particular set of training data and then applying that model to the specified set of test data. Re-sampling starts that process anew, through constructing an alternative categorical partition between test and training data.

Even where test and training data are distinguished by more than a simple data-internal category (such as a labelling column), for example, by being stored in distinctly-named sub-directories, re-sampling may be implemented by effectively shuffling data between training and test sub-directories.

  • ML4.7 ML software should provide an ability to combine results from multiple re-sampling iterations using a single parameter specifying numbers of iterations.
  • ML4.8 Absent any additional specification, re-sampling algorithms should by default partition data according to proportions of original test and training data.
    • ML4.8a Re-sampling routines of ML software should nevertheless offer an ability to explicitly control or override such default proportions of test and training data.

5.8.5 Model Output and Performance

Model output is considered here as a stage distinct from model performance. Model output refers to the end result of model training (ML4), while model performance involves the assessment of a trained model against a test data set. The present section first describes standards for model output, which are standards guiding the form of a model trained according to the preceding standards (ML4). Model Performance is then considered as a separate stage.

5.8.5.1 Model Output

  • ML5.0 The result of applying the training processes described above should be contained within a single model object returned by the function defined according to ML4.0, above. Even where the output reflects application to a test data set, the resultant object need not include any information on model performance (see ML5.3ML5.4, below).
    • ML5.0a That object should either have its own class, or extend some previously-defined class.
    • ML5.0b That class should have a defined print method which summarises important aspects of the model object, including but not limited to summaries of input data and algorithmic control parameters.
  • ML5.1 As for the untrained model objects produced according to the above standards, and in particular as a direct extension of ML3.3, the properties and behaviours of trained models produced by ML software should be explicitly compared with equivalent objects produced by other ML software. (Such comparison will generally be done in terms of comparing model performance, as described in the following standard ML5.3ML5.4).
  • ML5.2 The structure and functionality of objects representing trained ML models should be thoroughly documented. In particular,
    • ML5.2a Either all functionality extending from the class of model object should be explicitly documented, or a method for listing or otherwise accessing all associated functionality explicitly documented and demonstrated in example code.
    • ML5.2b Documentation should include examples of how to save and re-load trained model objects for their re-use in accordance with ML3.1, above.
    • ML5.2c Where general functions for saving or serializing objects, such as saveRDS are not appropriate for storing local copies of trained models, an explicit function should be provided for that purpose, and should be demonstrated with example code.

The R6 system for representing classes in R is an example of a system with explicit functionality, all components of which are accessible by a simple ls() call. Adherence to ML5.2a would nevertheless require explicit description of the ability of ls() to supply a list of all functions associated with an object. The mlr package, for example, uses R6 classes, yet neither explicitly describes the use of ls() to list all associated functions, nor explicitly lists those functions.

5.8.5.2 Model Performance

Model performance refers to the quantitative assessment of a trained model when applied to a set of test data.

  • ML5.3 Assessment of model performance should be implemented as one or more functions distinct from model training.
  • ML5.4 Model performance should be able to be assessed according to a variety of metrics.
    • ML5.4a All model performance metrics represented by functions internal to a package must be clearly and distinctly documented.
    • ML5.4b It should be possible to submit custom metrics to a model assessment function, and the ability to do so should be clearly documented including through example code.

The remaining sub-sections specify general standards beyond the preceding workflow-specific ones.

5.8.6 Documentation

  • ML6.0 Descriptions of ML software should make explicit reference to a workflow which separates training and testing stages, and which clearly indicates a need for distinct training and test data sets.

The following standard applies to packages which are intended or other able to only encompass a restricted subset of the six primary workflow steps enumerated at the outset. Envisioned here are packages explicitly intended to aid one particular aspect of the general workflow envisioned here, such as implementations of ML optimization functions, or specific loss measures.

  • ML6.1 ML software intentionally designed to address only a restricted subset of the workflow described here should clearly document how it can be embedded within a typical full ML workflow in the sense considered here.
    • ML6.1 Such demonstrations should include and contrast embedding within a full workflow using at least two other packages to implement that workflow.

5.8.7 Testing

5.8.7.1 Input Data

  • ML7.0 Test should explicitly confirm partial and case-insensitive matching of “test”, “train”, and, where applicable, “validation” data.
  • ML7.1 Tests should demonstrate effects of different numeric scaling of input data (see ML2.2).
  • ML7.2 For software which imputes missing data, tests should compare internal imputation with explicit code which directly implements imputation steps (even where such imputation is a single-step implemented via some external package). These tests serve as an explicit reference for how imputation is performed.

5.8.7.2 Model Classes

The following standard applies to models in both untrained and trained forms, considered to be the respective outputs of the preceding standards ML3 and ML4.

  • ML7.3 Where model objects are implemented as distinct classes, tests should explicitly compare the functionality of these classes with functionality of equivalent classes for ML model objects from other packages.
    • ML7.3a These tests should explicitly identify restrictions on the functionality of model objects in comparison with those of other packages.
    • ML7.3b These tests should explicitly identify functional advantages and unique abilities of the model objects in comparison with those of other packages.

5.8.7.3 Model Training

  • ML7.4 ML software should explicit document the effects of different training rates, and in particular should demonstrate divergence from optima with inappropriate training rates.
  • ML7.5 ML software which implements routines to determine optimal training rates (see ML3.4, above) should implement tests to confirm the optimality of resultant values.
  • ML7.6 ML software which implement independent training “epochs” should demonstrate in tests the effects of lesser versus greater numbers of epochs.
  • ML7.7 ML software should explicitly test different optimization algorithms, even where software is intended to implement one specific algorithm.
  • ML7.8 ML software should explicitly test different loss functions, even where software is intended to implement one specific measure of loss.
  • ML7.9 Tests should explicitly compare all possible combinations in categorical differences in model architecture, such as different model architectures with same optimization algorithms, same model architectures with different optimization algorithms, and differences in both.
    • ML7.9a Such combinations will generally be formed from multiple categorical factors, for which explicit use of functions such as expand.grid() is recommended.

The following example illustrates:

architechture <- c ("archA", "archB")
optimizers <- c ("optA", "optB", "optC")
cost_fns <- c ("costA", "costB", "costC")
expand.grid (architechture, optimizers, cost_fns)
##     Var1 Var2  Var3
## 1  archA optA costA
## 2  archB optA costA
## 3  archA optB costA
## 4  archB optB costA
## 5  archA optC costA
## 6  archB optC costA
## 7  archA optA costB
## 8  archB optA costB
## 9  archA optB costB
## 10 archB optB costB
## 11 archA optC costB
## 12 archB optC costB
## 13 archA optA costC
## 14 archB optA costC
## 15 archA optB costC
## 16 archB optB costC
## 17 archA optC costC
## 18 archB optC costC

All possible combinations of these categorical parameters could then be tested by iterating over the rows of that output.

  • ML7.10 The successful extraction of information on paths taken by optimizers (see ML5.1, above), should be tested, including testing the general properties, but not necessarily actual values of, such data.

5.8.7.4 Model Performance

  • ML7.11 All performance metrics available for a given class of trained model should be thoroughly tested and compared.
    • ML7.11a Tests which compare metrics should do so over a range of inputs (generally implying differently trained models) to demonstrate relative advantages and disadvantages of different metrics.