# Chapter 5 Standards [SEEKING FEEDBACK]

This Chapter is divided between:

“

*General Standards*” which may be applied to all software considered within this project, irrespective of how it may be categorized under the times of categories of statistical software listed above; and“

*Specific Standards*” which apply to different degrees to statistical software depending on the software category.

It is likely that standards developed under the first category may subsequently
be deemed to be genuinely *Statistical Standards* yet which are applicable
across all categories, and it may also be likely that the development of
category-specific standards reveals aspects which are common across all
categories, and which may subsequently be deemed general standards. We
accordingly anticipate a degree of fluidity between these two broad categories.

There is also a necessary relationship between the Standards described here,
and processes of Assessment described below in Chapter 8. We
consider the latter to describe concrete *and generally quantitative* aspects
of *post hoc* software assessment, while the present Standards provides guides
and benchmarks against which to *prospectively* compare software during
development. As this entire document is intended to serve as the defining
reference for our Standards, that term may in turn be interpreted to reflect
this entire document, with the current section explicitly describing aspects of
Standards not covered elsewhere.

As described above, we anticipate the ongoing development of this current document to employ a versioning system, with software reviewed and hosted under the system mandated to flag the latest version of these standards to which it complies.

## 5.1 Other Standards

Among the noteworthy instances of software standards which might be adapted for
our purposes, and in addition to entries in our *Annotated
Bibliography*, the following are particularly relevant:

- The Core Infrastructure Initiative’s Best Practices Badge, which is granted to software meeting an extensive list of criteria. This list of criteria provides a singularly useful reference for software standards.
- The Software Sustainability Institute’s
*Software Evaluation Guide*, in particular their guide to*Criteria-based software evaluation*, which considers two primary categories of*Usability*and*Sustainability and Maintainability*, each of which is divided into numerous sub-categories. The guide identifies numerous concrete criteria for each sub-category, explicitly detailed below in order to provide an example of the kind of standards that might be adapted and developed for application to the present project. - The
*Transparent Statistics Guidelines*, by the “HCI (Human Computer Interaction) Working Group”. While currently only in its beginning phases, that document aims to provide concrete guidance on “transparent statistical communication.” If its development continues, it is likely to provide useful guidelines on best practices for how statistical software produces and reports results. - The more technical considerations of the Object Management
Group’s
*Automated Source Code CISQ Maintainability Measure*(where CISQ refers to the*Consortium for IT Software Quality*). This guide describes a number of measures which can be automatically extracted and used to quantify the maintainability of source code. None of these measures are not already considered in one or both of the preceding two documents, but the identification of measures particularly amenable to automated assessment provides a particularly useful reference.

There is also rOpenSci’s guide on package development, maintenance, and peer review, which provides standards of this type for R packages, primarily within its first chapter. Another notable example is the tidyverse design guide, and the section on Conventions for R Modeling Packages which provides guidance for model-fitting APIs.

Specific standards for neural network algorithms have also been developed as
part of a google 2019 Summer Of Code
project, resulting in a dedicated
R package, `NNbenchmark`

,
and accompanying results—their so-called
“notebooks”—of
applying their benchmarks to a suite of neural network packages.

## 5.2 General Standards for Statistical Software

## Click for notes on nomenclature of Data Types.

These standards refer to **Data Types** as the fundamental types defined by the
R language itself between
the following:

- Logical
- Integer
- Continuous (
`class = "numeric"`

/`typeof = "double"`

) - Complex
- String / character

The base R system also includes what are considered here to be direct extensions of fundamental types to include:

- Factor
- Ordered Factor
- Date/Time

The continuous type has a `typeof`

of “double” because that represents the
storage mode in the C representation of such objects, while the `class`

as
defined within R is referred to as “numeric”. While `typeof`

is not the same as
`class`

, with reference to continuous variables, “numeric” may be considered
identical to “double” throughout.

The term “character” is interpreted here to refer to a vector each element of which is an individual “character” object. The term “string” does not relate to any official R nomenclature, but is used here to refer for convenience to a character vector of length one; in other words, a “string” is the sole element of a single-length “character” vector.

Examples of application of the following standards may be viewed as separate
`hackmd.io`

files by clicking on the following links:

- Application of Bayesian and Monte Carlo Standards
- Application of Regression and Supervised Learning Standards
- Application of Dimensionality Reduction, Clustering, and Unsupervised Learning Standards
- Application of Exploratory Data Analysis Standards
- Application of Machine Learning Software Standards

Each of those files compares both general and category-specific standards against selected R packages within those categories. These comparisons are intended for illustrative purposes only, and are in no way intended to represent evaluations of the software. They are presented in the hope of demonstrating how the standards presented here may be applied to software, and what the results of such application may look like.

### 5.2.1 Documentation

**G1.0***Statistical Software should list at least one primary reference from published academic literature.*

We consider that statistical software submitted under our system will either
(i) implement or extend prior methods, in which case the *primary reference*
will be to the most relevant published version(s) of prior methods; or (ii) be
an implementation of some new method. In the second case, it will be expected
that the software will eventually form the basis of an academic publication.
Until that time, the most suitable reference for equivalent algorithms or
implementations should be provided.

#### 5.2.1.1 Statistical Terminology

**G1.1***All statistical terminology should be clarified and unambiguously defined.*

Developers should not presume anywhere in the documentation of software that specific statistical terminology may be “generally understood”, and therefore not need explicit clarification. Even terms which many may consider sufficiently generic as to not require such clarification, such as “null hypotheses” or “confidence intervals”, will generally need explicit clarification. For example, both the estimation and interpretation of confidence intervals are dependent on distributional properties and associated assumptions. Any particular implementation of procedures to estimate or report on confidence intervals will accordingly reflect assumptions on distributional properties (among other aspects), both the nature and implications of which must be explicitly clarified.

Standards will include requirements for form and completeness of documentation.
As with interface, several sources already provide starting points for
reasonable documentation. Some documentation requirements will be specific to
the statistical context. For instance, it is likely we will have requirements
for referencing appropriate literature or references for theoretical support of
implementations. Another area of importance is correctness and clarity of
definitions of statistical quantities produced by the software, e.g., the
definition of null hypotheses or confidence intervals. Data included in
software – that used in examples or tests – will also have documentation
requirements. It is worth noting that the
`roxygen2`

system for documenting R packages is
readily extensible, as exemplified through the `roxytest`

package for specifying tests *in-line*.

#### 5.2.1.2 Function-level Documentation

#### 5.2.1.3 Supplementary Documentation

The following standards describe several forms of what might be considered
“Supplementary Material”. While there are many places within an R package where
such material may be included, common locations include vignettes, or in
additional directories (such as `data-raw`

) listed in `.Rbuildignore`

to
prevent inclusion within installed packages.

Where software supports a publication, all claims made in the publication with regard to software performance (for example, claims of algorithmic scaling or efficiency; or claims of accuracy), the following standard applies:

**G1.3***Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.*

Where claims regarding aspects of software performance are made with respect to other extant R packages, the following standard applies:

**G1.4***Software should include code necessary to compare performance claims with alternative implementations in other R packages.*

### 5.2.2 Input Structures

This section considers general standards for *Input Structures*. These
standards may often effectively be addressed through implementing class
structures, although this is not a general requirement. Developers are
nevertheless encouraged to examine the guide to S3
vectors
in the `vctrs`

package as an example of the kind of
assurances and validation checks that are possible with regard to input data.
Systems like those demonstrated in that vignette provide a very effective way
to ensure that software remains robust to diverse and unexpected classes and
types of input data.

#### 5.2.2.1 Uni-variate (Vector) Input

It is important to note for univariate data that single values in R are vectors
with a length of one, and that `1`

is of exactly the same *data type* as `1:n`

.
Given this, inputs expected to be univariate should:

**G2.0***Implement assertions on lengths of inputs, particularly through asserting that inputs expected to be single- or multi-valued are indeed so.***G2.0a**Provide explicit secondary documentation of any expectations on lengths of inputs

**G2.1***Implement assertions on types of inputs (see the initial point on nomenclature above).***G2.1a***Provide explicit secondary documentation of expectations on data types of all vector inputs.*

**G2.2***Appropriately prohibit or restrict submission of multivariate input to parameters expected to be univariate.***G2.3***For univariate character input:***G2.3a***Use*`match.arg()`

or equivalent where applicable to only permit expected values.**G2.3b***Either: use*`tolower()`

or equivalent to ensure input of character parameters is not case dependent; or explicitly document that parameters are strictly case-sensitive.

**G2.4***Provide appropriate mechanisms to convert between different data types, potentially including:***G2.4a***explicit conversion to*`integer`

via`as.integer()`

**G2.4b***explicit conversion to continuous via*`as.numeric()`

**G2.4c***explicit conversion to character via*`as.character()`

(and not`paste`

or`paste0`

)**G2.4d***explicit conversion to factor via*`as.factor()`

**G2.4e***explicit conversion from factor via*`as...()`

functions

**G2.5***Where inputs are expected to be of*`factor`

type, secondary documentation should explicitly state whether these should be`ordered`

or not, and those inputs should provide appropriate error or other routines to ensure inputs follow these expectations.

A few packages implement R versions of “static type” forms common in other
languages, whereby the type of a variable must be explicitly specified prior to
assignment. Use of such approaches is encouraged, including but not restricted
to approaches documented in packages such as
`vctrs`

, or the experimental package
`typed`

.

#### 5.2.2.2 Tabular Input

This sub-section concerns input in “tabular data” forms, implying the two
primary distinctions within R itself between `array`

or `matrix`

representations, and `data.frame`

and associated representations. Among
important differences between these two forms are that `array`

/`matrix`

classes
are restricted to storing data of a single uniform type (for example, all
`integer`

or all `character`

values), whereas `data.frame`

as associated
representations store each column as a list item, allowing different columns to
hold values of different types. Further noting that
a `matrix`

may, as of R version
4.0,
be considered as a strictly two-dimensional array, tabular inputs for the
purposes of these standards are considered to imply data represented in one or
more of the following forms:

Given this, tabular inputs may be in one or or more of the following forms:

`matrix`

form when referring to specifically two-dimensional data of one uniform type`array`

form as a more general expression, or when referring to data that are not necessarily or strictly two-dimensional`data.frame`

- Extensions such as
`tibble`

`data.table`

- domain-specific classes such as
`tsibble`

for time series, or`sf`

for spatial data.

Both `matrix`

and `array`

forms are actually stored as vectors with a single
`storage.mode`

, and so all of the preceding standards **G2.0**–**G2.5** apply.
The other rectangular forms are not stored as vectors, and do not necessarily
have a single `storage.mode`

for all columns. These forms are referred to
throughout these standards as “`data.frame`

-type tabular forms”, which may be
assumed to refer to data represented in either the `base::data.frame`

format,
and/or any of the classes listed in the final of the above points.

General Standards applicable to software which is intended to accept any one or
more of these `data.frame`

-type tabular inputs are then that:

**G2.6***Software should accept as input as many of the above standard tabular forms as possible, including extension to domain-specific forms.***G2.7***Software should provide appropriate conversion or dispatch routines as part of initial pre-processing to ensure that all other sub-functions of a package receive inputs of a single defined class or type.***G2.8***Software should issue diagnostic messages for type conversion in which information is lost (such as conversion of variables from factor to character; standardisation of variable names; or removal of meta-data such as those associated with*`sf`

-format data) or added (such as insertion of variable or column names where none were provided).

The next standard concerns the following inconsistencies between three common
tabular classes in regard the column extraction operator, `[`

.

```
class (x) # x is any kind of `data.frame` object
#> [1] "data.frame"
class (x [, 1])
#> [1] "integer"
class (x [, 1, drop = TRUE]) # default
#> [1] "integer"
class (x [, 1, drop = FALSE])
#> [1] "data.frame"
x <- tibble::tibble (x)
class (x [, 1])
#> [1] "tbl_df" "tbl" "data.frame"
class (x [, 1, drop = TRUE])
#> [1] "integer"
class (x [, 1, drop = FALSE]) # default
#> [1] "tbl_df" "tbl" "data.frame"
x <- data.table::data.table (x)
class (x [, 1])
#> [1] "data.table" "data.frame"
class (x [, 1, drop = TRUE]) # no effect
#> [1] "data.table" "data.frame"
class (x [, 1, drop = FALSE]) # default
#> [1] "data.table" "data.frame"
```

- Extracting a single column from a
`data.frame`

returns a`vector`

by default, and a`data.frame`

if`drop = FALSE`

. - Extracting a single column from a
`tibble`

returns a single-column`tibble`

by default, and a`vector`

is`drop = TRUE`

. - Extracting a single column from a
`data.table`

always returns a`data.table`

, and the`drop`

argument has no effect.

Given such inconsistencies,

**G2.9***Software should ensure that extraction or filtering of single columns from tabular inputs should not presume any particular default behaviour, and should ensure all column-extraction operations behave consistently regardless of the class of tabular data used as input.*

Adherence to the above standard G2.6 will ensure that any implicitly or explicitly assumed default behaviour will yield consistent results regardless of input classes.

#### 5.2.2.3 Missing or Undefined Values

**G2.10***Statistical Software should implement appropriate checks for missing data as part of initial pre-processing prior to passing data to analytic algorithms.***G2.11***Where possible, all functions should provide options for users to specify how to handle missing (*`NA`

) data, with options minimally including:**G2.11a***error on missing data***G2.11b***ignore missing data with default warnings or messages issued***G2.11c***replace missing data with appropriately imputed values*

**G2.12***Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default*`na.rm = FALSE`

-type parameters (such as`mean()`

,`sd()`

or`cor()`

).**G2.13***All functions should also provide options to handle undefined values (e.g.,*`NaN`

,`Inf`

and`-Inf`

), including potentially ignoring or removing such values.

### 5.2.3 Output Structures

**G3.0***Statistical Software which enables outputs to be written to local files should parse parameters specifying file names to ensure appropriate file suffices are automatically generated where not provided.*

### 5.2.4 Testing

All packages should follow rOpenSci standards on
testing and continuous
integration, including aiming for high
test coverage. Extant R packages which may be useful for testing include
`testthat`

,
`tinytest`

,
`roxytest`

, and
`xpectr`

.

#### 5.2.4.1 Test Data Sets

**G4.0***Where applicable or practicable, tests should use standard data sets with known properties (for example, the NIST Standard Reference Datasets, or data sets provided by other widely-used R packages).***G4.1***Data sets created within, and used to test, a package should be exported (or otherwise made generally available) so that users can confirm tests and run examples.*

#### 5.2.4.2 Responses to Unexpected Input

**G4.2***Appropriate error and warning behaviour of all functions should be explicitly demonstrated through tests. In particular,***G4.2a***Every message produced within R code by*`stop()`

,`warning()`

,`message()`

, or equivalent should be unique**G4.2b***Explicit tests should demonstrate conditions which trigger every one of those messages, and should compare the result with expected values.*

**G4.3***For functions which are expected to return objects containing no missing (*`NA`

) or undefined (`NaN`

,`Inf`

) values, the absence of any such values in return objects should be explicitly tested.

#### 5.2.4.3 Algorithm Tests

For testing *statistical algorithms*, tests should include tests of the
following types:

**G4.4****Correctness tests***to test that statistical algorithms produce expected results to some fixed test data sets (potentially through comparisons using binding frameworks such as RStata).***G4.4a***For new methods, it can be difficult to separate out correctness of the method from the correctness of the implementation, as there may not be reference for comparison. In this case, testing may be implemented against simple, trivial cases or against multiple implementations such as an initial R implementation compared with results from a C/C++ implementation.***G4.4b***For new implementations of existing methods, correctness tests should include tests against previous implementations. Such testing may explicitly call those implementations in testing, preferably from fixed-versions of other software, or use stored outputs from those where that is not possible.***G4.4c***Where applicable, stored values may be drawn from published paper outputs when applicable and where code from original implementations is not available*

**G4.5***Correctness tests should be run with a fixed random seed***G4.6****Parameter recovery tests***to test that the implementation produce expected results given data with known properties. For instance, a linear regression algorithm should return expected coefficient values for a simulated data set generated from a linear model.***G4.6a***Parameter recovery tests should generally be expected to succeed within a defined tolerance rather than recovering exact values.***G4.6b***Parameter recovery tests should be run with multiple random seeds when either data simulation or the algorithm contains a random component. (When long-running, such tests may be part of an extended, rather than regular, test suite; see G4.10-4.12, below).*

**G4.7****Algorithm performance tests***to test that implementation performs as expected as properties of data change. For instance, a test may show that parameters approach correct estimates within tolerance as data size increases, or that convergence times decrease for higher convergence thresholds.***G4.8****Edge condition tests***to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties including but not limited to:***G4.8a***Zero-length data***G4.8b***Data of unsupported types (e.g., character or complex numbers in for functions designed only for numeric data)***G4.8c***Data with all-*`NA`

fields or columns or all identical fields or columns**G4.8d***Data outside the scope of the algorithm (for example, data with more fields (columns) than observations (rows) for some regression algorithms)*

**G4.9****Noise susceptibility tests***Packages should test for expected stochastic behaviour, such as through the following conditions:***G4.9a***Adding trivial noise (for example, at the scale of*`.Machine$double.eps`

) to data does not meaningfully change results**G4.9b***Running under different random seeds or initial conditions does not meaningfully change results*

#### 5.2.4.4 Extended tests

Thorough testing of statistical software may require tests on large data sets, tests with many permutations, or other conditions leading to long-running tests. In such cases it may be neither possible nor advisable to execute tests continuously, or with every code change. Software should nevertheless test any and all conditions regardless of how long tests may take, and in doing so should adhere to the following standards:

**G4.10***Extended tests should included and run under a common framework with other tests but be switched on by flags such as as a*`<MYPKG>_EXTENDED_TESTS=1`

environment variable.**G4.11***Where extended tests require large data sets or other assets, these should be provided for downloading and fetched as part of the testing workflow.***G4.11a***When any downloads of additional data necessary for extended tests fail, the tests themselves should not fail, rather be skipped and implicitly succeed with an appropriate diagnostic message.*

**G4.12***Any conditions necessary to run extended tests such as platform requirements, memory, expected runtime, and artefacts produced that may need manual inspection, should be described in developer documentation such as a*`CONTRIBUTING.md`

or`tests/README.md`

file.

## 5.3 Bayesian and Monte Carlo Software

Click on the following link to view a demonstration Application of Bayesian and Monte Carlo Standards.

Bayesian and Monte Carlo Software (hereafter referred to for simplicity as “Bayesian Software”) is presumed to perform one or more of the following steps:

- Document how to specify inputs including:
- 1.1 Data
- 1.2 Hyperparameters determining prior distributions
- 1.3 Parameters determining the computational processes

- Accept and validate all of forms of input
- Apply data transformation and pre-processing steps
- Apply one or more analytic algorithms, generally sampling algorithms used to generate estimates of posterior distributions
- Return the result of that algorithmic application
- Offer additional functionality such as printing or summarising return results

This document details standards for each of these steps, each prefixed with “BS”.

### 5.3.1 Documentation of Inputs

Prior to actual standards for documentation of inputs, we note one terminological standard for Bayesian software:

**BS1.0***Bayesian software should use the term “hyperparameter” exclusively to refer to parameters determining the form of prior distributions, and should use either the generic term “parameter” or some conditional variant(s) such as “computation parameters” to refer to all other parameters.*

Bayesian Software should provide the following documentation of how to specify inputs:

**BS1.1***Description of how to enter data, both in textual form and via code examples. Both of these should consider the simplest cases of single objects representing independent and dependent data, and potentially more complicated cases of multiple independent data inputs.***BS1.2***Description of how to specify prior distributions, both in textual form describing the general principles of specifying prior distributions, along with more applied descriptions and examples, within:***B31.2a***The main package*`README`

, either as textual description or example code**B31.2b***At least one package vignette, both as general and applied textual descriptions, and example code***B31.2c***Function-level documentation, preferably with code included in examples*

**BS1.3***Description of all parameters which control the computational process (typically those determining aspects such as numbers and lengths of sampling processes, seeds used to start them, thinning parameters determining post-hoc sampling from simulated values, and convergence criteria). In particular:***BS1.3a***Bayesian Software should document, both in text and examples, how to use the output of previous simulations as starting points of subsequent simulations.***BS1.3b***Where applicable, Bayesian software should document, both in text and examples, how to use different sampling algorithms for a given model.*

**BS1.4***For Bayesian Software which implements or otherwise enables convergence checkers, documentation should explicitly describe and provide examples of use with and without convergence checkers.***BS1.5***For Bayesian Software which implements or otherwise enables multiple convergence checkers, differences between these should be explicitly tested.*

### 5.3.2 Input Data Structures and Validation

This section contains standards primarily intended to ensure that input data, including model specifications, are validated prior to passing through to the main computational algorithms.

#### 5.3.2.1 Input Data

Bayesian Software is commonly designed to accept generic one- or
two-dimensional forms such as vector, matrix, or `data.frame`

objects. The
first standards concerns the range of possible generic forms for input *data*:

**BS2.0***Bayesian Software which accepts one-dimensional input should ensure values are appropriately pre-processed regardless of class structures. The*`units`

package provides a good example, in creating objects that may be treated as vectors, yet which have a class structure that does not inherit from the`vector`

class. Using these objects as input often causes software to fail. The`storage.mode`

of the underlying objects may nevertheless be examined, and the objects transformed or processed accordingly to ensure such inputs do not lead to errors.**BS2.1***Bayesian Software which accepts two-dimension input should implement pre-processing routines to ensure conversion of as many possible forms as possible to some standard format which is then passed to all analytic functions. In particular, tests should demonstrate that:***BS2.1a**`data.frame`

or equivalent objects which have columns which do not themselves have standard class attributes (typically,`vector`

) are appropriately processed, and do not error without reason. This behaviour should be tested. Again, columns created by the`units`

package provide a good test case.**BS2.1b**`data.frame`

or equivalent objects which have list columns should ensure that those columns are appropriately pre-processed either through being removed, converted to equivalent vector columns where appropriate, or some other appropriate treatment. This behaviour should be tested.

**BS2.2***Bayesian Software should implement pre-processing routines to ensure all input data is dimensionally commensurate, for example by ensuring commensurate lengths of vectors or numbers of rows of tabular inputs.*

#### 5.3.2.2 Prior Distributions, Model Specifications, and Hyperparameters

The second set of standards in this section concern specification of prior
distributions, model structures, or other equivalent ways of specifying
hypothesised relationships among input data structures. R already has a diverse
range of Bayesian Software with distinct approaches to this task, commonly
either through specifying a model as a character vector representing an R
function, or an external file either as R code, or encoded according to some
alternative system (such as for `rstan`

).

As explicated above, the term “hyperparameters” is interpreted here to refer to parameters which define prior distributions, while a “model specification”, or simply “model”, is an encoded description of how those hyperparameters are hypothesised to transform to a posterior distribution.

Bayesian Software should:

**BS2.3***Ensure that all appropriate validation and pre-processing of hyperparameters are implemented as distinct pre-processing steps prior to submitting to analytic routines, and especially prior to submitting to multiple parallel computational chains.***BS2.4***Ensure that lengths of hyperparameter vectors are checked, with no excess values silently discarded (unless such output is explicitly suppressed, as detailed below).***BS2.5***Ensure that lengths of hyperparameter vectors are commensurate with expected model input (see example immediately below)***BS2.6***Where possible, implement pre-processing checks to validate appropriateness of numeric values submitted for hyperparameters; for example, by ensuring that hyperparameters defining second-order moments such as distributional variance or shape parameters, or any parameters which are logarithmically transformed, are non-negative.*

The following example demonstrates how standards like the above (BS2.5-2.6)
might be addressed. Consider the following function which defines a
log-likelihood estimator for a linear regression, controlled via a vector of
three hyperparameters, `p`

:

Pre-processing stages should be used to determine:

- That the dimensions of the input data,
`x`

and`y`

, are commensurate (BS2.2); non-commensurate inputs should error by default. - The length of the vector
`p`

(BS2.4)

The latter task is not necessarily straightforward, because the definition of
the function, `ll()`

, will itself generally be part of the input to an actual
Bayesian Software function. This functional input thus needs to be examined to
determine expected lengths of hyperparameter vectors. The following code
illustrates one way to achieve this, relying on utilities for parsing function
calls in R, primarily through the
`getParseData`

function from the `utils`

package. The parse data for a function can be
extracted with the following line:

The object `x`

is a `data.frame`

of every R token (such as an expression,
symbol, or operator) parsed from the function `ll`

. The following section
illustrates how this data can be used to determine the expected lengths of
vector inputs to the function, `ll()`

.

## click to see details

Input arguments used to define parameter vectors in any R software are accessed
through R’s standard vector access syntax of `vec[i]`

, for some element `i`

of
a vector `vec`

. The parse data for such begins with the `SYMBOL`

of `vec`

, the
`[`

, a `NUM_CONST`

for the value of `i`

, and a closing `]`

. The following code
can be used to extract elements of the parse data which match this pattern, and
ultimately to extract the various values of `i`

used to access members of
`vec`

.

```
vector_length <- function (x, i) {
xn <- x [which (x$token %in% c ("SYMBOL", "NUM_CONST", "'['", "']'")), ]
# split resultant data.frame at first "SYMBOL" entry
xn <- split (xn, cumsum (xn$token == "SYMBOL"))
# reduce to only those matching the above pattern
xn <- xn [which (vapply (xn, function (j)
j$text [1] == i & nrow (j) > 3,
logical (1)))]
ret <- NA_integer_ # default return value
if (length (xn) > 0) {
# get all values of NUM_CONST as integers
n <- vapply (xn, function (j)
as.integer (j$text [j$token == "NUM_CONST"] [1]),
integer (1), USE.NAMES = FALSE)
# and return max of these
ret <- max (n)
}
return (ret)
}
```

That function can then be used to determine the length of any inputs which are used as hyperparameter vectors:

```
ll <- function (p, x, y) dnorm (y - (p[1] + x * p[2]), sd = p[3], log = TRUE)
p <- parse (text = deparse (ll))
x <- utils::getParseData (p)
# extract the names of the parameters:
params <- unique (x$text [x$token == "SYMBOL"])
lens <- vapply (params, function (i) vector_length (x, i), integer (1))
lens
#> y p x
#> NA 3 NA
```

And the vector `p`

is used as a hyperparameter vector containing three
parameters. Any initial value vectors can then be examined to ensure that they
have this same length.

Not all Bayesian Software is designed to accept model inputs expressed as R
code. The `rstan`

package, for example,
implements its own model specification language, and only allows
hyperparameters to be named, and not addressed by index. While this largely
avoids problems of mismatched lengths of parameter vectors, the software (at
v2.21.1) does not ensure the existence of named parameters prior to starting
the computational chains. This ultimately results in each chain generating an
error when a model specification refers to a non-existent or undefined
hyperparameter. Such controls should be part of a single pre-processing stage,
and so should only generate a single error.

#### 5.3.2.3 Computational Parameters

Computational parameters are considered here as those passed to Bayesian functions other than hyperparameters determining the forms of prior distributions. They typically include parameters controlling lengths of runs, lengths of burn-in periods, numbers of parallel computations, other parameters controlling how samples are to be generated, or convergence criteria. All Computational Parameters should be checked for general “sanity” prior to calling primary computational algorithms. The standards for such sanity checks include that Bayesian Software should:

**BS2.7***Check that values for parameters are positive (except where negative values may be accepted)***BS2.8***Check lengths and/or dimensions of inputs, and either automatically reject or provide appropriate diagnostic messaging for parameters of inappropriate length or dimension; for example passing a vector of length > 1 to a parameter presumed to define a single value (unless such output is explicitly suppressed, as detailed below)***BS2.9***Check that arguments are of expected classes or types (for example, check that*`integer`

-type arguments are indeed`integer`

, with explicit conversion via`as.integer`

where not)**BS2.10***Automatically reject parameters of inappropriate type (for example*`character`

values passed for`integer`

-type parameters that are unable to be appropriately converted).

The following two sub-sections consider particular cases of computational parameters.

#### 5.3.2.4 Seed Parameters

Bayesian software should:

**BS2.11***Enable seeds to be passed as a parameter (through a direct*`seed`

argument or similar), or as a vector of parameters, one for each chain.**BS2.12***Enable results of previous runs to be used as starting points for subsequent runs*

Bayesian Software which implements parallel processing should:

**BS2.13***Ensure each chain is started with a different seed by default***BS2.14***Issue diagnostic messages when identical seeds are passed to distinct computational chains***BS2.15***Explicitly document advice*not* to use`set.seed()`

***BS2.16***Provide the parameter with a*plural* name: for example, “starting_values” and not “starting_value”*

To avoid potential confusion between separate parameters to control random seeds and starting values, we recommended a single “starting values” rather than “seeds” argument, with appropriate translation of these parameters into seeds where necessary.

#### 5.3.2.5 Output Verbosity

All Bayesian Software should implement computational parameters to control output verbosity. Bayesian computations are often time-consuming, and often performed as batch computations. The following standards should be adhered to in regard to output verbosity:

**BS2.17***Bayesian Software should implement at least one parameter controlling the verbosity of output, defaulting to verbose output of all appropriate messages, warnings, errors, and progress indicators.***BS2.18***Bayesian Software should enable suppression of messages and progress indicators, while retaining verbosity of warnings and errors. This should be tested.***BS2.19***Bayesian Software should enable suppression of warnings where appropriate. This should be tested.***BS2.20***Bayesian Software should explicitly enable errors to be caught, and appropriately processed either through conversion to warnings, or otherwise captured in return values. This should be tested.*

### 5.3.3 Pre-processing and Data Transformation

#### 5.3.3.1 Missing Values

Bayesian Software should:

**BS3.0***Explicitly document assumptions made in regard to missing values; for example that data is assumed to contain no missing (*`NA`

,`Inf`

) values, and that such values, or entire rows including any such values, will be automatically removed from input data.**BS3.1***Implement appropriate routines to pre-process missing values prior to passing data through to main computational algorithms.*

#### 5.3.3.2 Perfect Collinearity

Where appropriate, Bayesian Software should:

**BS3.2***Implement pre-processing routines to diagnose perfect collinearity, and provide appropriate diagnostic messages or warnings***BS3.3***Provide distinct routines for processing perfectly collinear data, potentially bypassing sampling algorithms*

An appropriate test for BS3.3 would confirm that `system.time()`

or equivalent
timing expressions for perfectly collinear data should be *less* than
equivalent routines called with non-collinear data. Alternatively, a test could
ensure that perfectly collinear data passed to a function with a stopping
criteria generated no results, while specifying a fixed number of iterations
may generate results.

### 5.3.4 Analytic Algorithms

As mentioned, analytic algorithms for Bayesian Software are commonly algorithms to simulate posterior distributions, and to draw samples from those simulations. Numerous extent R packages implement and offer sampling algorithms, and not all Bayesian Software will internally implement sampling algorithms. The following standards apply to packages which do implement internal sampling algorithms:

**BS4.0***Packages should document sampling algorithms (generally via literary citation, or reference to other software)***BS4.1***Packages should provide explicit comparisons with external samplers which demonstrate intended advantage of implementation (generally via tests, vignettes, or both).*

Regardless of whether or not Bayesian Software implements internal sampling algorithms, it should:

**BS4.2***Implement at least one means to validate posterior estimates (for example through the functionality of the*`BayesValidate`

package, noting that that package has not been updated for almost 15 years, and such approaches may need adapting; or the Simulation Based Calibration approach implemented in the`rstan`

function`sbc`

).

Where possible or applicable, Bayesian Software should:

**BS4.3***Implement at least one type of convergence checker, and provide a documented reference for that implementation.***BS4.3***Enable computations to be stopped on convergence (although not necessarily by default).***BS4.5***Ensure that appropriate mechanisms are provided for models which do not converge. This is often achieved by having default behaviour to stop after specified numbers of iterations regardless of convergence.***BS4.6***Implement tests to confirm that results with convergence checker are statistically equivalent to results from equivalent fixed number of samples without convergence checking.***BS4.7***Where convergence checkers are themselves parametrised, the effects of such parameters should also be tested. For threshold parameters, for example, lower values should result in longer sequence lengths.*

### 5.3.5 Return Values

Unlike software in many other categories, Bayesian Software should generally return several kinds of distinct data, both the raw data derived from statistical algorithms, and associated metadata. Such distinct and generally disparate forms of data will be generally best combined into a single object through implementing a defined class structure, although other options are possible, including (re-)using extant class structures (see the CRAN Task view on Bayesian Inference. https://cran.r-project.org/web/views/Bayesian.html) for reference to other packages and class systems). Regardless of the precise form of return object, and whether or not defined class structures are used or implemented, the objects returned from Bayesian Software should include:

**BS5.0***Seed(s) or starting value(s), including values for each sequences where multiple sequences are included***BS5.1***Appropriate metadata on types (or classes) and dimensions of input data*

With regard to the input function, or alternative means of specifying prior distributions:

**BS5.2***Bayesian Software should either:***BS5.2a***Return the input function or prior distributional specification in the return object; or***BS5.2b***Enable direct access to such via additional functions which accept the return object as single argument.*

Where convergence checkers are implemented or provided, Bayesian Software should:

**BS5.3***Return convergence statistics or equivalent***BS5.4***Where multiple checkers are enabled, return details of convergence checker used***BS5.5***Appropriate diagnostic statistics to indicate absence of convergence are either returned or immediately able to be accessed.*

### 5.3.6 Additional Functionality

Bayesian Software should:

**BS6.0***Implement a default*`print`

method for return objects**BS6.1***Implement a default*`plot`

method for return objects**BS6.2***Provide and document straightforward abilities to plot sequences of posterior samples, with burn-in periods clearly distinguished***BS6.3***Provide and document straightforward abilities to plot posterior distributional estimates*

Bayesian Software may:

**BS6.4***Provide*`summary`

methods for return objects**BS6.5***Provide abilities to plot both sequences of posterior samples and distributional estimates together in single graphic*

### 5.3.7 Tests

#### 5.3.7.1 Parameter Recovery Tests

Bayesian software should implement the following tests which demonstrate and confirm an ability to recover parameters:

**BS7.0**Recovery of parametric estimates of a prior distribution**BS7.1**Recovery of a prior distribution in the absence of any additional data or information**BS7.2**Recovery of a expected posterior distribution given a specified prior and some input data

#### 5.3.7.2 Algorithmic Scaling Tests

**BS7.3**Bayesian software should include tests which demonstrate and confirm the scaling of algorithmic efficiency with sizes of input data; for example, that computation times increase approximately logarithmically with increasing sizes of input data.

#### 5.3.7.3 Scaling of Input to Output Data

**BS7.4**Bayesian software should implement tests which confirm that predicted or fitted values are on (approximately) the same scale as input values.**BS7.4a**The implications of any assumptions on scales on input objects should be explicitly tested in this context; for example that the scales of inputs which do not have means of zero will not be able to be recovered.

## 5.4 Regression and Supervised Learning

Click on the following link to view a demonstration Application of Regression and Supervised Learning Standards.

This document details standards for Regression and Supervised Learning Software – referred to from here on for simplicity as “Regression Software”. Regression Software implements algorithms which aim to construct or analyse one or more mappings between two defined data sets (for example, a set of “independent” data, \(X\), and a set of “dependent” data, \(Y\)). In contrast, the analogous category of Unsupervised Learning Software aims to construct or analyse one or more mappings between a defined set of input or independent data, and a second set of “output” data which are not necessarily known or given prior to the analysis.

Common purposes of Regression Software are to fit models to estimate relationships or to make predictions between specified inputs and outputs. Regression Software includes tools with inferential or predictive foci, Bayesian, frequentist, or probability-free Machine Learning (ML) approaches, parametric or or non-parametric approaches, discrete outputs (such as in classification tasks) or continuous outputs, and models and algorithms specific to applications or data such as time series or spatial data. In many cases other standards specific to these subcategories may apply.

The following standards are divided among several sub-categories, with each standard prefixed with “RE”.

### 5.4.1 Input data structures and validation

**RE1.0***Regression Software should enable models to be specified via a formula interface, unless reasons for not doing so are explicitly documented.***RE1.1***Regression Software should document how formula interfaces are converted to matrix representations of input data. See Max Kuhn’s RStudio blog post for examples.***RE1.2***Regression Software should document expected format (types or classes) for inputting predictor variables, including descriptions of types or classes which are not accepted; for example, specification that software accepts only numeric inputs in*`vector`

or`matrix`

form, or that all inputs must be in`data.frame`

form with both column and row names.**RE1.3***Regression Software should transfer all relevant aspects of input data, notably including row and column names, and potentially information from other*`attributes()`

, to corresponding aspects of return objects (see RE4, below).**RE1.3a***Where otherwise relevant information is not transferred, this should be explicitly documented.*

**RE1.4***Regression Software should document any assumptions made with regard to input data; for example distributional assumptions, or assumptions that predictor data have mean values of zero. Implications of violations of these assumptions should be both documented and tested.*

### 5.4.2 Pre-processing and Variable Transformation

**RE2.0***Regression Software should document any transformations applied to input data, for example conversion of label-values to*`factor`

, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).**RE2.1***Regression Software should implement explicit parameters controlling the processing of missing values, ideally distinguishing*`NA`

or`NaN`

values from`Inf`

values (for example, through use of`na.omit()`

and related functions from the`stats`

package).**RE2.2***Regression Software should provide different options for processing missing values in predictor and response data. For example, it should be possible to fit a model with no missing predictor data in order to generate values for all associated response points, even where submitted response values may be missing.***RE2.3***Where applicable, Regression Software should enable data to be centred (for example, through converting to zero-mean equivalent values; or to z-scores) or offset (for example, to zero-intercept equivalent values) via additional parameters, with the effects of any such parameters clearly documented and tested.***RE2.4***Regression Software should implement pre-processing routines to identify whether aspects of input data are perfectly collinear, notably including:***RE2.4a***Perfect collinearity among predictor variables***RE2.4b***Perfect collinearity between independent and dependent variables*

These pre-processing routines should also be tested as described below.

### 5.4.3 Algorithms

The following standards apply to the model fitting algorithms of Regression Software which implements or relies on iterative algorithms which are expected to converge to generate model statistics. Regression Software which implements or relies on iterative convergence algorithms should:

**RE3.0***Issue appropriate warnings or other diagnostic messages for models which fail to converge.***RE3.1***Enable such messages to be optionally suppressed, yet should ensure that the resultant model object nevertheless includes sufficient data to identify lack of convergence.***RE3.2***Ensure that convergence thresholds have sensible default values, demonstrated through explicit documentation.***RE3.3***Allow explicit setting of convergence thresholds, unless reasons against doing so are explicitly documented.*

### 5.4.4 Return Results

**RE4.0***Regression Software should return some form of “model” object, generally through using or modifying existing class structures for model objects (such as*`lm`

,`glm`

, or model objects from other packages), or creating a new class of model objects.**RE4.1***Regression Software may enable an ability to generate a model object without actually fitting values. This may be useful for controlling batch processing of computationally intensive fitting algorithms.*

#### 5.4.4.1 Accessor Methods

Regression Software should provide functions to access or extract as much of
the following kinds of model data as possible or practicable. Access should
ideally rely on class-specific methods which extend, or implement otherwise
equivalent versions of, the methods from the `stats`

package which are named in
parentheses in each of the following standards.

Model objects should include, or otherwise enable effectively immediate access to the following descriptors. It is acknowledged that not all regression models can sensibly provide access to these descriptors, yet should include access provisions to all those that are applicable.

**RE4.2***Model coefficients (via*`coeff()`

/`coefficients()`

)**RE4.3***Confidence intervals on those coefficients (via*`confint()`

)**RE4.4***The specification of the model, generally as a formula (via*`formula()`

)**RE4.5***Numbers of observations submitted to model (via*`nobs()`

)**RE4.6***The variance-covariance matrix of the model parameters (via*`vcov()`

)**RE4.7***Where appropriate, convergence statistics*

Regression Software *should* provide simple and direct methods to return or
otherwise access the following form of data and metadata, where the latter
includes information on any transformations which may have been applied to the
data prior to submission to modelling routines.

**RE4.8***Response variables, and associated “metadata” where applicable.***RE4.9***Modelled values of response variables.***RE4.10***Model Residuals, including sufficient documentation to enable interpretation of residuals, and to enable users to submit residuals to their own tests.***RE4.11***Goodness-of-fit and other statistics associated such as effect sizes with model coefficients.***RE4.12***Where appropriate, functions used to transform input data, and associated inverse transform functions.*

Regression software *may* provide simple and direct methods to return or
otherwise access the following:

**RE4.13***Predictor variables, and associated “metadata” where applicable.*

#### 5.4.4.2 Prediction, Extrapolation, and Forecasting

Not all regression software is intended to, or can, provide distinct abilities to extrapolate or forecast. Moreover, identifying cases in which a regression model is used to extrapolate or forecast may often be a non-trivial exercise. It may nevertheless be possible, for example when input data used to construct a model are unidimensional, and data on which a prediction is to be based extend beyond the range used to construct the model. Where reasonably unambiguous identification of extrapolation or forecasting using a model is possible, the following standards apply:

**RE4.14***Where possible, values should also be provided for extrapolation or forecast*errors*.***RE4.15***Sufficient documentation and/or testing should be provided to demonstrate that forecast errors, confidence intervals, or equivalent values increase with forecast horizons.*

Distinct from extrapolation or forecasting abilities, the following standard applies to regression software which relies on, or otherwise provides abilities to process, categorical grouping variables:

**RE4.16***Regression Software which models distinct responses for different categorical groups should include the ability to submit new groups to*`predict()`

methods.

#### 5.4.4.3 Reporting Return Results

**RE4.17***Model objects returned by Regression Software should implement or appropriately extend a default*`print`

method which provides an on-screen summary of model (input) parameters and (output) coefficients.**RE4.18***Regression Software may also implement*`summary`

methods for model objects, and in particular should implement distinct`summary`

methods for any cases in which calculation of summary statistics is computationally non-trivial (for example, for bootstrapped estimates of confidence intervals).

### 5.4.5 Documentation

Beyond the general standards for documentation, Regression Software should explicitly describe the following aspects, and ideally provide extended documentation including summary graphical reports of:

**RE5.0***Scaling relationships between sizes of input data (numbers of observations, with potential extension to numbers of variables/columns) and speed of algorithm.*

### 5.4.6 Visualization

**RE6.0***Model objects returned by Regression Software (see RE3.0) should have default*`plot`

methods, either through explicit implementation, extension of methods for existing model objects, or through ensuring default methods work appropriately.**RE6.1***Where the default*`plot`

method is**NOT**a generic`plot`

method dispatched on the class of return objects (that is, through a`plot.<myclass>`

function), that method dispatch should nevertheless exist in order to explicitly direct users to the appropriate function.**RE6.2***The default*`plot`

method should produce a plot of the`fitted`

values of the model, with optional visualisation of confidence intervals or equivalent.

The following standard applies only to software fulfilling RE4.14-4.15, and the conditions described prior to those standards.

**RE6.3***Where a model object is used to generate a forecast (for example, through a*`predict()`

method), the default`plot`

method should provide clear visual distinction between modelled (interpolated) and forecast (extrapolated) values.

### 5.4.7 Testing

#### 5.4.7.1 Input Data

Tests for Regression Software should include the following conditions and cases:

**RE7.0***Tests with noiseless, exact relationships between predictor (independent) data.***RE7.0a**In particular, these tests should confirm ability to reject perfectly noiseless input data.

**RE7.1***Tests with noiseless, exact relationships between predictor (independent) and response (dependent) data.***RE7.1a***In particular, these tests should confirm that model fitting is at least as fast or (preferably) faster than testing with equivalent noisy data (see RE2.4b).*

#### 5.4.7.2 Diagnostic Messages

**RE7.2**All error and warning messages should be explicitly triggered in tests, including explicit testing for the content of those diagnostic messages.

#### 5.4.7.3 Return Results

Tests for Regression Software should

**RE7.3**Demonstrate that output objects retain aspects of input data such as row or case names (see**RE1.3**).**RE7.4**Demonstrate and test expected behaviour when objects returned from regression software are submitted to the accessor methods of**RE4.2**–**RE4.7**.**RE7.5**Extending directly from**RE4.15**, where appropriate, tests should demonstrate and confirm that forecast errors, confidence intervals, or equivalent values increase with forecast horizons.

## 5.5 Dimensionality Reduction, Clustering, and Unsupervised Learning

Click on the following link to view a demonstration Application of Dimensionality Reduction, Clustering, and Unsupervised Learning Standards.

This document details standards for Dimensionality Reduction, Clustering, and Unsupervised Learning Software – referred to from here on for simplicity as “Unsupervised Learning Software”. Software in this category is distinguished from Regression Software though the latter aiming to construct or analyse one or more mappings between two defined data sets (for example, a set of “independent” data, \(X\), and a set of “dependent” data, “Y”), whereas Unsupervised Learning Software aims to construct or analyse one or more mappings between a defined set of input or independent data, and a second set of “output” data which are not necessarily known or given prior to the analysis. A key distinction in Unsupervised Learning Software and Algorithms is between that for which output data represent (generally numerical) transformations of the input data set, and that for which output data are discrete labels applied to the input data. Examples of the former type include dimensionality reduction and ordination software and algorithms, and examples of the latter include clustering and discrete partitioning software and algorithms.

### 5.5.1 Input Data Structures and Validation

**UL1.0***Unsupervised Learning Software should explicitly document expected format (types or classes) for input data, including descriptions of types or classes which are not accepted; for example, specification that software accepts only numeric inputs in*`vector`

or`matrix`

form, or that all inputs must be in`data.frame`

form with both column and row names.**UL1.1***Unsupervised Learning Software should provide distinct sub-routines to assert that all input data is of the expected form, and issue informative error messages when incompatible data are submitted.*

The following code demonstrates an example of a routine from the base `stats`

package which fails to meet this standard.

```
d <- dist (USArrests) # example from help file for 'hclust' function
hc <- hclust (d) # okay
hc <- hclust (as.matrix (d))
```

`## Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536"): missing value where TRUE/FALSE needed`

The latter fails, yet issues an uninformative error message that clearly indicates a failure to provide sufficient checks on the class of input data.

**UL1.2***Unsupervised learning which uses row or column names to label output objects should assert that input data have non-default row or column names, and issue an informative message when these are not provided. (Such messages need not necessarily be provided by default, but should at least be optionally available.)*

The following code provides simple examples of checks whether row and column names appear to have generic default values.

```
## X1 X2
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
```

Generic row names are almost always simple integer sequences, which the following condition confirms.

`## [1] TRUE`

Generic column names may come in a variety of formats. The following code uses
a `grep`

expression to match any number of characters plus an optional leading
zero followed by a generic sequence of column numbers, appropriate for matching
column names produced by generic construction of `data.frame`

objects.

```
all (vapply (seq (ncol (x)), function (i)
grepl (paste0 ("[[:alpha:]]0?", i), colnames (x) [i]), logical (1)))
```

`## [1] TRUE`

Messages should be issued in both of these cases. The following code
illustrates that the `hclust`

function does not implement any such checks or
assertions, rather it silently returns an object with default labels.

`## [1] "1" "2" "3" "4" "5" "6"`

**UL1.3***Unsupervised Learning Software should transfer all relevant aspects of input data, notably including row and column names, and potentially information from other*`attributes()`

, to corresponding aspects of return objects.**UL1.3a***Where otherwise relevant information is not transferred, this should be explicitly documented.*

An example of a function according with UL1.3 is
`stats::cutree()`

```
## Alabama Alaska Arizona Arkansas California Colorado
## 1 2 3 4 5 4
```

The row names of `USArrests`

are transferred to the output object. In contrast,
some routines from the `cluster`

package do not comply with this standard:

`## [1] 1 2 3 4 3 4`

The case labels are not appropriately carried through to the object returned by
`agnes()`

to enable them to be transferred within
`cutree()`

.
(The labels are transferred to the object returned by `agnes`

, just not in
a way that enables `cutree`

to inherit them.)

**UL1.4***Unsupervised Learning Software should explicitly document whether input data may include missing values.***UL1.5***Functions in Unsupervised Learning Software which do not admit input data with missing values should provide informative error messages when data with missing values are submitted.***UL1.6***Unsupervised Learning Software should document any assumptions made with regard to input data; for example assumptions about distributional forms or locations (such as that data are centred or on approximately equivalent distributional scales). Implications of violations of these assumptions should be both documented and tested, in particular:***UL1.6a***Software which responds qualitatively differently to input data which has components on markedly different scales should explicitly document such differences, and implications of submitting such data.***UL1.6b***Examples or other documentation should not use*`scale()`

or equivalent transformations without explaining why scale is applied, and explicitly illustrating and contrasting the consequences of not applying such transformations.

### 5.5.2 Pre-processing and Variable Transformation

**UL2.0***Routines likely to give unreliable or irreproducible results in response to violations of assumptions regarding input data (see UL1.6) should implement pre-processing steps to diagnose potential violations, and issue appropriately informative messages, and/or include parameters to enable suitable transformations to be applied (such as the*`center`

and`scale.`

parameters of the`stats::prcomp()`

function).**UL2.1***Unsupervised Learning Software should document any transformations applied to input data, for example conversion of label-values to*`factor`

, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).**UL2.2***For Unsupervised Learning Software which accepts missing values in input data, functions should implement explicit parameters controlling the processing of missing values, ideally distinguishing*`NA`

or`NaN`

values from`Inf`

values (for example, through use of`na.omit()`

and related functions from the`stats`

package).**UL2.3***Unsupervised Learning Software should implement pre-processing routines to identify whether aspects of input data are perfectly collinear.*

### 5.5.3 Algorithms

#### 5.5.3.1 Labelling

**UL3.1***Algorithms which apply sequential labels to input data (such as clustering or partitioning algorithms) should ensure that the sequence follows decreasing group sizes (so labels of “1”, “a”, or “A” describe the largest group, “2”, “b”, or “B” the second largest, and so on.)*

Note that the `stats::cutree()`

function
does not accord with this standard:

```
##
## 1 2 3 4 5 6 7 8 9 10
## 3 3 3 6 5 10 2 5 5 8
```

The `cutree()`

function
applies arbitrary integer labels to the groups, yet the order of labels is not
related to the order of group sizes.

**UL3.2***Dimensionality reduction or equivalent algorithms which label dimensions should ensure that that sequences of labels follows decreasing “importance” (for example, eigenvalues or variance contributions).*

The
`stats::prcomp`

function accords with this standard:

```
## Importance of first k=5 (out of 21) components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 2529.6298 2157.3434 1459.4839 551.68183 369.10901
## Proportion of Variance 0.4591 0.3339 0.1528 0.02184 0.00977
## Cumulative Proportion 0.4591 0.7930 0.9458 0.96764 0.97741
```

The proportion of variance explained by each component decreasing with increasing numeric labelling of the components.

**UL3.3***Unsupervised Learning Software for which input data does not generally include labels (such as*`array`

-like data with no row names) should provide an additional parameter to enable cases to be labelled.

#### 5.5.3.2 Prediction

**UL3.4***Where applicable, Unsupervised Learning Software should implement routines to predict the properties (such as numerical ordinates, or cluster memberships) of additional new data without re-running the entire algorithm.*

While many algorithms such as Hierarchical clustering can not (readily) be used
to predict memberships of new data, other algorithms can nevertheless be
applied to perform this task. The following demonstrates how the output of
`stats::hclust`

can be used to predict membership of new data using the `class:knn()`

function.
(This is intended to illustrate only one of many possible approaches.)

```
##
## Attaching package: 'class'
```

```
## The following object is masked from 'package:igraph':
##
## knn
```

```
set.seed (1)
hc <- hclust (dist (iris [, -5]))
groups <- cutree (hc, k = 3)
# function to randomly select part of a data.frame and # add some randomness
sample_df <- function (x, n = 5) {
x [sample (nrow (x), size = n), ] + runif (ncol (x) * n)
}
iris_new <- sample_df (iris [, -5], n = 5)
# use knn to predict membership of those new points:
knnClust <- knn (train = iris [, -5], test = iris_new , k = 1, cl = groups)
knnClust
```

```
## [1] 2 2 1 1 2
## Levels: 1 2 3
```

The `stats::prcomp()`

function
implements its own `predict()`

method which conforms to this standard:

```
res <- prcomp (USArrests)
arrests_new <- sample_df (USArrests, n = 5)
predict (res, newdata = arrests_new)
```

```
## PC1 PC2 PC3 PC4
## North Carolina 165.17494 -30.693263 -11.682811 1.304563
## Maryland 129.44401 -4.132644 -2.161693 1.258237
## Ohio -49.51994 12.748248 2.104966 -2.777463
## Colorado 35.78896 14.023774 12.869816 1.233391
## Georgia 41.28054 -7.203986 3.987152 -7.818416
```

#### 5.5.3.3 Group Distributions and Associated Statistics

Many unsupervised learning algorithms serve to label, categorise, or partition data. Software which performs any of these tasks will commonly output some kind of labelling or grouping schemes. The above example of principal components illustrates that the return object records the standard deviations associated with each component:

```
## Standard deviations (1, .., p=4):
## [1] 83.732400 14.212402 6.489426 2.482790
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Murder 0.04170432 -0.04482166 0.07989066 -0.99492173
## Assault 0.99522128 -0.05876003 -0.06756974 0.03893830
## UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914
## Rape 0.07515550 0.20071807 0.97408059 0.07232502
```

```
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 83.7324 14.21240 6.4894 2.48279
## Proportion of Variance 0.9655 0.02782 0.0058 0.00085
## Cumulative Proportion 0.9655 0.99335 0.9991 1.00000
```

Such output accords with the following standard:

**UL3.5***Objects returned from Unsupervised Learning Software which labels, categorise, or partitions data into discrete groups should include, or provide immediate access to, quantitative information on intra-group variances or equivalent, as well as on inter-group relationships where applicable.*

The above example of principal components is one where there are no inter-group
relationships, and so that standard is fulfilled by providing information on
intra-group variances alone. Discrete clustering algorithms, in contrast, yield
results for which inter-group relationships are meaningful, and such
relationships can generally be meaningfully provided. The `hclust()`

routine,
like many clustering routines, simply returns a *scheme* for devising an
arbitrary number of clusters, and so
can not meaningfully provide variances or relationships between such. The
`cutree()`

function,
however, does yield defined numbers of clusters, yet devoid of any quantitative
information on variances or equivalent.

```
## Named int [1:50] 1 1 1 2 1 2 3 1 4 2 ...
## - attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
```

Compare that with the output of a largely equivalent routine, the `clara()`

function
from the `cluster`

package.

```
library (cluster)
cl <- clara (USArrests, k = 10) # direct clustering into specified number of clusters
cl$clusinfo
```

```
## size max_diss av_diss isolation
## [1,] 4 24.708298 14.284874 1.4837745
## [2,] 6 28.857755 16.759943 1.7329563
## [3,] 6 44.640565 23.718040 0.9677229
## [4,] 6 28.005892 17.382196 0.8442061
## [5,] 6 15.901258 9.363471 1.1037219
## [6,] 7 29.407822 14.817031 0.9080598
## [7,] 4 11.764353 6.781659 0.8165753
## [8,] 3 8.766984 5.768183 0.3547323
## [9,] 3 18.848077 10.101505 0.7176276
## [10,] 5 16.477257 8.468541 0.6273603
```

That object contains information on dissimilarities between each observation and cluster medoids, which in the context of UL3.4 is “information on intra-group variances or equivalent”. Moreover, inter-group information is also available as the “silhouette” of the clustering scheme.

### 5.5.4 Return Results

**UL4.0***Unsupervised Learning Software should return some form of “model” object, generally through using or modifying existing class structures for model objects, or creating a new class of model objects.***UL4.1***Unsupervised Learning Software may enable an ability to generate a model object without actually fitting values. This may be useful for controlling batch processing of computationally intensive fitting algorithms.***UL4.2***The return object from Unsupervised Learning Software should include, or otherwise enable immediate extraction of, all parameters used to control the algorithm used.*

#### 5.5.4.1 Reporting Return Results

**UL4.2***Model objects returned by Unsupervised Learning Software should implement or appropriately extend a default*`print`

method which provides an on-screen summary of model (input) parameters and methods used to generate results. The`print`

method may also summarise statistical aspects of the output data or results.**UL4.2a***The default*`print`

method should always ensure only a restricted number of rows of any result matrices or equivalent are printed to the screen.

The `prcomp`

objects
returned from the function of the same name include potential large matrices of
component coordinates which are by default printed in their entirety to the
screen. This is because the default print behaviour for most tabular objects in
R (`matrix`

, `data.frame`

, and objects from the `Matrix`

package, for example)
is to print objects in their entirety (limited only by such options as
`getOption("max.print")`

, which determines maximal numbers of printed objects,
such as lines of `data.frame`

objects). Such default behaviour ought be
avoided, particularly in Unsupervised Learning Software which commonly returns
objects containing large numbers of numeric entries.

**UL4.3***Unsupervised Learning Software should also implement*`summary`

methods for model objects which should summarise the primary statistics used in generating the model (such as numbers of observations, parameters of methods applied). The`summary`

method may also provide summary statistics from the resultant model.

### 5.5.5 Documentation

### 5.5.6 Visualization

**UL6.0***Objects returned by Unsupervised Learning Software should have default*`plot`

methods, either through explicit implementation, extension of methods for existing model objects, through ensuring default methods work appropriately, or through explicit reference to helper packages such as`factoextra`

and associated functions.**UL6.1***Where the default*`plot`

method is**NOT**a generic`plot`

method dispatched on the class of return objects (that is, through a`plot.<myclass>`

function), that method dispatch should nevertheless exist in order to explicitly direct users to the appropriate function.**UL6.2***Where default plot methods include labelling components of return objects (such as cluster labels), routines should ensure that labels are automatically placed to ensure readability, and/or that appropriate diagnostic messages are issued where readability is likely to be compromised (for example, through attempting to place too many labels).*

### 5.5.7 Testing

Unsupervised Learning Software should test the following properties and behaviours:

**UL7.0***Inappropriate types of input data are rejected with expected error messages.*

#### 5.5.7.1 Input Scaling

The following tests should be implement for Unsupervised Learning Software for which inputs are presumed or required to be scaled in any particular ways (such as having mean values of zero).

**UL7.1***Tests should demonstrate that violations of assumed input properties yield unreliable or invalid outputs, and should clarify how such unreliability or invalidity is manifest through the properties of returned objects.*

#### 5.5.7.2 Output Labelling

With regard to labelling of output data, tests for Unsupervised Learning Software should:

**UL7.2***Demonstrate that labels placed on output data follow decreasing group sizes (***UL3.1**)**UL7.3***Demonstrate that labels on input data are propagated to, or may be recovered from, output data (see***UL3.3**).

#### 5.5.7.3 Prediction

With regard to prediction, tests for Unsupervised Learning Software should:

**UL7.4***Demonstrate that submission of new data to a previously fitted model can generate results more efficiently than initial model fitting.*

#### 5.5.7.4 Batch Processing

For Unsupervised Learning Software which implements batch processing routines:

**UL7.5***Batch processing routines should be explicitly tested, commonly via extended tests (see***G4.10**–**G4.12**).**UL7.5a***Tests of batch processing routines should demonstrate that equivalent results are obtained from direct (non-batch) processing.*

## 5.6 Exploratory Data Analysis

Click on the following link to view a demonstration Application of Exploratory Data Analysis Standards.

Exploration is a part of all data analyses, and Exploratory Data Analysis (EDA)
is not something that is entered into and exited from at some point prior to
“real” analysis. Exploratory Analyses are also not strictly limited to *Data*,
but may extend to exploration of *Models* of those data. The category could
thus equally be termed, “*Exploratory Data and Model Analysis*”, yet we opt to
utilise the standard acronym of EDA in this document.

EDA is nevertheless somewhat different to many other categories included within rOpenSci’s program for peer-reviewing statistical software. Primary differences include:

- EDA software often has a strong focus upon visualization, which is a category which we have otherwise explicitly excluded from the scope of the project at the present stage.
- The assessment of EDA software requires addressing more general questions than software in most other categories, notably including the important question of intended audience(s).

The following standards are accordingly somewhat differently structured than equivalent standards developed to date for other categories, particularly through being more qualitative and abstract. In particular, while documentation is an important component of standards for all categories, clear and instructive documentation is of paramount importance for EDA Software, and so warrants its own sub-section within this document.

### 5.6.1 Documentation Standards

The following refer to *Primary Documentation*, implying in main package
`README`

or vignette(s), and *Secondary Documentation*, implying function-level
documentation.

The *Primary Documentation* (`README`

and/or vignette(s)) of EDA software
should:

**EA1.0***Identify one or more target audiences for whom the software is intended***EA1.1***Identify the kinds of data the software is capable of analysing (see*Kinds of Data* below).***EA1.2***Identify the kinds of questions the software is intended to help explore; for example, are these questions:**inferential?**predictive?**associative?**causal?**(or other modes of statistical enquiry?)*

The *Secondary Documentation* (within individual functions) of EDA software
should:

**EA1.3***Identify the kinds of data each function is intended to accept as input*

### 5.6.2 Input Data

A further primary difference of EDA software from that of our other categories is that input data for statistical software may be generally presumed of one or more specific types, whereas EDA software often accepts data of more general and varied types. EDA software should aim to accept and appropriately transform as many diverse kinds of input data as possible, through addressing the following standards, considered in terms of the two cases of input data in uni- and multi-variate form. All of the general standards for kinds of input (G2.0 - G2.7) apply to input data for EDA Software.

#### 5.6.2.1 Index Columns

The following standards refer to an *index column*, which is understood to
imply an explicitly named or identified column which can be used to provide a
unique index index into any and all rows of that table. Index columns ensure
the universal applicability of standard table join operations, such as those
implemented via the `dplyr`

package.

**EA2.0***EDA Software which accepts standard tabular data and implements or relies upon extensive table filter and join operations should utilise an***index column**system**EA2.1***All values in an index column must be unique, and this uniqueness should be affirmed as a pre-processing step for all input data.***EA2.2***Index columns should be explicitly identified, either:***EA2.2a***by using an appropriate class system, or***EA2.2b***through setting an*`attribute`

on a table,`x`

, of`attr(x, "index") <- <index_col_name>`

.

For EDA software which either implements custom classes or explicitly sets attributes specifying index columns, these attributes should be used as the basis of all table join operations, and in particular:

**EA2.3***Table join operations should not be based on any assumed variable or column names*

#### 5.6.2.2 Multi-tabular input

EDA software designed to accept multi-tabular input should:

**EA2.4***Use and demand an explicit class system for such input (for example, via the*`DM`

package).**EA2.5***Ensure all individual tables follow the above standards for Index Columns*

#### 5.6.2.3 Classes and Sub-Classes

*Classes* are understood here to be the classes define single input objects,
while *Sub-Classes* refer to the class definitions of components of input
objects (for example, of columns of an input `data.frame`

). EDA software which
is intended to receive input in general vector formats (see *Uni-variate Input*
section of *General Standards*) should ensure:

**EA2.6***Routines appropriately process vector input of custom classes, including those which do not inherit from the*`vector`

class**EA2.7***Routines should appropriately process vector data regardless of additional attributes*

The following code illustrates some ways by which “metadata” defining classes and additional attributes associated with a standard vector object may by modified.

```
x <- 1:10
class (x) <- "notvector"
attr (x, "extra_attribute") <- "another attribute"
attr (x, "vector attribute") <- runif (5)
attributes (x)
#> $class
#> [1] "notvector"
#>
#> $extra_attribute
#> [1] "another attribute"
#>
#> $`vector attribute`
#> [1] 0.03521663 0.49418081 0.60129563 0.75804346 0.16073301
```

All statistical software should appropriately deal with such input
data, as exemplified by the `storage.mode()`

, `length()`

, and `sum()`

functions
of the `base`

package, which return the appropriate values regardless of
redefinition of class or additional attributes.

```
storage.mode (x)
#> [1] "integer"
length (x)
#> [1] 10
sum (x)
#> [1] 55
storage.mode (sum (x))
#> [1] "integer"
```

Tabular inputs in `data.frame`

class may contain columns which are
themselves defined by custom classes, and which possess additional attributes.
EDA Software which accepts tabular inputs should accordingly ensure:

**EA2.8***EDA routines appropriately process tabular input of custom classes, ideally by means of a single pre-processing routine which converts tabular input to some standard form subsequently passed to all analytic routines.***EA2.9***EDA routines accept and appropriately process tabular input in which individual columns may be of custom sub-classes including additional attributes.*

### 5.6.3 Analytic Algorithms

(There are no specific standards for analytic algorithms in EDA Software.)

### 5.6.4 Return Results / Output Data

**EA4.0***EDA Software should ensure all return results have types which are consistent with input types. For example,*`sum`

,`min`

, or`max`

values applied to`integer`

-type vectors should return`integer`

values, while`mean`

or`var`

will generally return`numeric`

types.**EA4.1***EDA Software should implement parameters to enable explicit control of numeric precision***EA4.2***The primary routines of EDA Software should return objects for which default*`print`

and`plot`

methods give sensible results. Default`summary`

methods may also be implemented.

### 5.6.5 Visualization and Summary Output

Visualization commonly represents one of the primary functions of EDA Software,
and thus visualization output is given greater consideration in this category
than in other categories in which visualization may nevertheless play an
important role. In particular, one component of this sub-category is *Summary
Output*, taken to refer to all forms of screen-based output beyond conventional
graphical output, including tabular and other text-based forms. Standards for
visualization itself are considered in the two primary sub-categories of static
and dynamic visualization, where the latter includes interactive visualization.

Prior to these individual sub-categories, we consider a few standards applicable to visualization in general, whether static or dynamic.

**EA5.0***Graphical presentation in EDA software should be as accessible as possible or practicable. In particular, EDA software should consider accessibility in terms of:***EA5.0a***Typeface sizes should default to sizes which explicitly enhance accessibility**EA5.0b**Default colour schemes should be carefully constructed to ensure accessibility.*

**EA5.1***Any explicit specifications of typefaces which override default values should consider accessibility*

#### 5.6.5.1 Summary and Screen-based Output

**EA5.2***Screen-based output should never rely on default print formatting of*`numeric`

types, rather should also use some version of`round(., digits)`

,`formatC`

,`sprintf`

, or similar functions for numeric formatting according the parameter described in EDA4.2.**EA5.3***Column-based summary statistics should always indicate the*`storage.mode`

,`class`

, or equivalent defining attribute of each column (as, for example, implemented in the default`print.tibble`

method).

#### 5.6.5.2 General Standards for Visualization (Static and Dynamic)

**EA5.4***All visualisations should include units on all axes, with sensibly rounded values (for example, as produced by the*`pretty()`

function).

#### 5.6.5.3 Dynamic Visualization

Dynamic visualization routines are commonly implemented as interfaces to
`javascript`

routines. Unless routines have been explicitly developed as an
internal part of an R package, standards shall not be considered to apply to
the code itself, rather only to decisions present as user-controlled parameters
exposed within the R environment. That said, one standard may nevertheless be
applied, with an aim to minimise

**EA5.5***Any packages which internally bundle libraries used for dynamic visualization and which are also bundled in other, pre-existing R packages, should explain the necessity and advantage of re-bundling that library.*

### 5.6.6 Testing

#### 5.6.6.1 Return Values

**EA6.0***Return values from all functions should be tested, including tests for the following characteristics:***EA6.0a***Classes and types of objects***EA6.0b***Dimensions of tabular objects***EA6.0c***Column names (or equivalent) of tabular objects***EA6.0d***Classes or types of all columns contained within*`data.frame`

-type tabular objects**EA6.0e***Values of single-valued objects; for*`numeric`

values either using`testthat::expect_equal()`

or equivalent with a defined value for the`tolerance`

parameter, or using`round(..., digits = x)`

with some defined value of`x`

prior to testing equality.

#### 5.6.6.2 Graphical Output

**EA6.1***The properties of graphical output from EDA software should be explicitly tested, for example via the*`vdiffr`

package or equivalent.

Tests for graphical output are frequently only run as part of an extended test suite.

## 5.7 Time Series Software

Time series software is presumed to perform one or more of the following steps:

- Accept and validate input data
- Apply data transformation and pre-processing steps
- Apply one or more analytic algorithms
- Return the result of that algorithmic application
- Offer additional functionality such as printing or summarising return results

This document details standards for each of these steps, each prefixed with “TS”.

### 5.7.1 Input data structures and validation

Input validation is an important software task, and an important part of our
standards. While there are many ways to approach validation, the class systems
of R offer a particularly convenient and effective means. For Time Series
Software in particular, a range of class systems have been developed, for which
we refer to the section “Time Series Classes” in the CRAN Task view on Time
Series Analysis", and
the class-conversion package `tsbox`

. Software which
uses and relies on defined classes can often validate input through affirming
appropriate class(es). Software which does not use or rely on class systems
will generally need specific routines to validate input data structures. In
particular, because of the long history of time series software in R, and the
variety of class systems for representing time series data, new time series
packages should accept as many different classes of input as possible by
according with the following standards:

**TS1.0***Time Series Software should use and rely on explicit class systems developed for representing time series data, and should not permit generic, non-time-series input*

The core algorithms of time-series software are often ultimately applied to
simple vector objects, and some time series software accepts simple vector
inputs, assuming these to represent temporally sequential data. Permitting such
generic inputs nevertheless prevents any such assumptions from being asserted
or tested. Missing values pose particular problems in this regard. A simple
`na.omit()`

call or similar will shorten the length of the vector by removing
any `NA`

values, and will change the explicit temporal relationship between
elements. The use of explicit classes for time series generally ensures an
ability to explicitly assert properties such as strict temporal regularity, and
to control for any deviation from expected properties.

**TS1.1***Time Series Software should explicitly document the types and classes of input data able to be passed to each function.***TS1.2***Time Series Software should accept input data in as many time series specific classes as possible.***TS1.3***Time Series Software should implement validation routines to confirm that inputs are of acceptable classes (or represented in otherwise appropriate ways for software which does not use class systems).***TS1.4***Time Series Software should implement a single pre-processing routine to validate input data, and to appropriately transform it to a single uniform type to be passed to all subsequent data-processing functions (the*`tsbox`

package provides one convenient approach for this).**TS1.5***The pre-processing function described above should maintain all time- or date-based components or attributes of input data.*

For Time Series Software which relies on or implements custom classes or types for representing time-series data, the following standards should be adhered to:

**TS1.6***The software should ensure strict ordering of the time, frequency, or equivalent ordering index variable.***TS1.7***Any violations of ordering should be caught in the pre-processing stages of all functions.*

#### 5.7.1.1 Time Intervals and Relative Time

While most common packages and classes for time series data assume *absolute*
temporal scales such as those represented in `POSIX`

classes
for dates or times, time series may also be quantified on *relative* scales
where the temporal index variable quantifies intervals rather than absolute
times or dates. Many analytic routines which accept time series inputs in
absolute form are also appropriately applied to analogous data in relative
form, and thus many packages should accept time series inputs both in absolute
and relative forms. Software which can or should accept times series inputs in
relative form should:

**TS1.8***Accept inputs defined via the*`units`

package for attributing SI units to R vectors.**TS1.9***Where time intervals or periods may be days or months, be explicit about the system used to represent such, particularly regarding whether a calendar system is used, or whether a year is presumed to have 365 days, 365.2422 days, or some other value.*

### 5.7.2 Pre-processing and Variable Transformation

#### 5.7.2.1 Missing Data

One critical pre-processing step for Time Series Software is the appropriate
handling of missing data. It is convenient to distinguish between *implicit*
and *explicit* missing data. For regular time series, explicit missing data may
be represented by `NA`

values, while for irregular time series, implicit
missing data may be represented by missing rows. The difference is demonstrated
in the following table.

Time | value |

08:43 | 0.71 |

08:44 | NA |

08:45 | 0.28 |

08:47 | 0.34 |

08:48 | 0.07 |

The value for 08:46 is *implicitly missing*, while the value for 08:44 is
*explicitly missing*. These two forms of missingness may connote different
things, and may require different forms of pre-processing. With this in mind,
the following standards apply:

**TS2.0***Appropriate checks for missing data, and associated transformation routines, should be performed as part of initial pre-processing prior to passing data to analytic algorithms.***TS2.1***Time Series Software which presumes or requires regular data should only allow*explicit* missing values, and should issue appropriate diagnostic messages, potentially including errors, in response to any*implicit*missing values.***TS2.2***Where possible, all functions should provide options for users to specify how to handle missing data, with options minimally including:***TS2.2a***error on missing data.***TS2.2b***warn or ignore missing data, and proceed to analyse irregular data, ensuring that results from function calls with regular yet missing data return identical values to submitting equivalent irregular data with no missing values.***TS2.2c***replace missing data with appropriately imputed values.*

**TS2.3***Functions should never assume non-missingness, and should never pass data with potential missing values to any base routines with default*`na.rm = FALSE`

-type parameters (such as`mean()`

,`sd()`

or`var()`

).

#### 5.7.2.2 Stationarity

Time Series Software should explicitly document assumptions or requirements made with respect to the stationarity or otherwise of all input data. In particular, any (sub-)functions which assume or rely on stationarity should:

**TS2.4***Consider stationarity of all relevant moments - typically first (mean) and second (variance) order, or otherwise document why such consideration may be restricted to lower orders only.***TS2.5***Explicitly document all assumptions and/or requirements of stationarity***TS2.6***Implement appropriate checks for all relevant forms of stationarity, and either:***TS2.6a***issue diagnostic messages or warnings; or***TS2.6b***enable or advise on appropriate transformations to ensure stationarity.*

The two options in the last point (TS2.6b) respectively translate to *enabling*
transformations to ensure stationarity by providing appropriate routines,
generally triggered by some function parameter, or *advising* on appropriate
transformations, for example by directing users to additional functions able to
implement appropriate transformations.

#### 5.7.2.3 Covariance Matrices

Where covariance matrices are constructed or otherwise used within or as input to functions, they should:

**TS2.7***Incorporate a system to ensure that both row and column orders follow the same ordering as the underlying time series data. This may, for example, be done by including the*`index`

attribute of the time series data as an attribute of the covariance matrix.**TS2.8***Where applicable, covariance matrices should also include specification of appropriate units.*

### 5.7.3 Analytic Algorithms

Analytic algorithms are considered here to reflect the core analytic components of Time Series Software. These may be many and varied, and we explicitly consider only a small subset here.

#### 5.7.3.1 Forecasting

Statistical software which implements forecasting routines should:

**TS3.0***Provide tests to demonstrate at least one case in which errors widen appropriately with forecast horizon.***TS3.1***If possible, provide at least one test which violates TS3.0***TS3.2***Document the general drivers of forecast errors or horizons, as demonstrated via the particular cases of TS3.0 and TS3.1***TS3.3***Either:***TS3.3a***Document, preferable via an example, how to trim forecast values based on a specified error margin or equivalent; or***TS3.3b***Provide an explicit mechanism to trim forecast values to a specified error margin, either via an explicit post-processing function, or via an input parameter to a primary analytic function.*

### 5.7.4 Return Results

For (functions within) Time Series Software which return time series data:

**TS4.0***Return values should either:***TS4.0a***Be in same class as input data, for example by using the*`tsbox`

package to re-convert from standard internal format (see 1.4, above); or**TS4.0b***Be in a unique, preferably class-defined, format.*

**TS4.1***Any units included as attributes of input data should also be included within return values.***TS4.2***The type and class of all return values should be explicitly documented.*

For (functions within) Time Series Software which return data other than direct series:

**TS4.3***Return values should explicitly include all appropriate units and/or time scales*

#### 5.7.4.1 Data Transformation

Time Series Software which internally implements routines for transforming data to achieve stationarity and which returns forecast values should:

**TS4.4***Document the effect of any such transformations on forecast data, including potential effects on both first- and second-order estimates.***TS4.5***In decreasing order of preference, either:***TS4.5a***Provide explicit routines or options to back-transform data commensurate with original, non-stationary input data***TS4.5b***Demonstrate how data may be back-transformed to a form commensurate with original, non-stationary input data.***TS4.5c***Document associated limitations on forecast values*

#### 5.7.4.2 Forecasting

Where Time Series Software implements or otherwise enables forecasting abilities, it should return one of the following three kinds of information. These are presented in decreasing order of preference, such that software should strive to return the first kind of object, failing that the second, and only the third as a last resort.

**TS4.6***Time Series Software which implements or otherwise enables forecasting should return either:***TS4.6a***A distribution object, for example via one of the many packages described in the CRAN Task View on**Probability Distributions*(or the new`distributional`

package as used in the`fable`

package for time-series forecasting).**TS4.6b***For each variable to be forecast, predicted values equivalent to first- and second-order moments (for example, mean and standard error values).***TS4.6c***Some more general indication of error involved with forecast estimates.*

Beyond these particular standards for return objects, Time Series Software which implements or otherwise enables forecasting should:

**TS4.7***Ensure that forecast (modelled) values are clearly distinguished from observed (model or input) values, either (in this case in no order of preference) by***TS4.7a***Returning forecast values alone***TS4.7b***Returning distinct list items for model and forecast values***TS4.7c***Combining model and forecast values into a single return object with an appropriate additional column clearly distinguishing the two kinds of data.*

### 5.7.5 Visualization

Time Series Software should:

**TS5.0***Implement default*`plot`

methods for any implemented class system.**TS5.1***When representing results in temporal domain(s), ensure that one axis is clearly labelled “time” (or equivalent), with continuous units.***TS5.2***Default to placing the “time” (or equivalent) variable on the horizontal axis.***TS5.3***Ensure that units of the time, frequency, or index variable are printed by default on the axis.***TS5.4***For frequency visualization, abscissa spanning \([-\pi, \pi]\) should be avoided in favour positive units of \([0, 2\pi]\) or \([0, 0.5]\), in all cases with appropriate additional explanation of units.***TS5.5***Provide options to determine whether plots of data with missing values should generate continuous or broken lines.*

For the results of forecast operations, Time Series Software should

**TS5.6***By default indicate distributional limits of forecast on plot***TS5.7***By default include model (input) values in plot, as well as forecast (output) values***TS5.8***By default provide clear visual distinction between model (input) values and forecast (output) values.*

## 5.8 Machine Learning Software

Click on the following link to view a demonstration Application of Machine Learning Software Standards.

R has an extensive and diverse ecosystem of Machine Learning (ML) software
which is very well described in the corresponding CRAN Task
View. Unlike most
other categories of statistical software considered here, the primary
distinguishing feature of ML software is not (necessarily or directly)
algorithmic, rather pertains to a *workflow* typical of machine learning tasks.
In particular, we consider ML software to approach data analysis via the two
primary steps of:

- Passing a set of
*training*data to an algorithm in order to generate a candidate mapping between that data and some form of pre-specified output or response variable. Such mappings will be referred to here as “models”, with a single analysis of a single set of training data generating one model. - Passing a set of test data to the model(s) generated by the first step in order to derive some measure of predictive accuracy for that model.

A single ML task generally yields two distinct outputs:

- The model derived in the first of the previous steps; and
- Associated statistics of model performance (as evaluated within the context of the test data used to assess that performance).

**A Machine Learning Workflow**

Given those initial considerations, we now attempt the difficult task of envisioning a typical standard workflow for inherently diverse ML software. The following workflow ought to be considered an “extensive” workflow, with shorter versions, and correspondingly more restricted sets of standards, possible dependent upon envisioned areas of application. For example, the workflow presumes input data to be too large to be stored as a single entity in local memory. Adaptation to situations in which all training data can be loaded into memory may mean that some of the following workflow stages, and therefore corresponding standards, may not apply.

Just as typical workflows are potentially very diverse, so are outputs of ML software, which depend on areas of application and intended purpose of software. The following refers to the “desired output” of ML software, a phrase which is intentionally left non-specific, but which it intended to connote any and all forms of “response variable” and other “pre-specified outputs” such as categorical labels or validation data, along with outputs which may not necessarily be able to be pre-specified in simple uni- or multi-variate form, such as measures of distance between sets of training and validation data.

Such “desired outputs” are presumed to be quantified in terms of a “loss” or “cost” function (hereafter, simply “loss function”) quantifying some measure of distance between a model estimate (resulting from applying the model to one or more components of a training data set) and a pre-defined “valid” output (during training), or a test data set (following training).

Given the foregoing considerations, we consider a typical ML workflow to progress through (at least some of) the following steps:

Obtain a local copy of input data, often as multiple*Input Data Specification**objects*(either on-disk or in memory) in some suitably structured form such as in a series of sub-directories or accompanied by additional data defining the structural properties of input objects. Regardless of form, multiple objects are commonly given generic labels which distinguish between`training`

and`test`

data, along with optional additional categories and labels such as`validation`

data used, for example, to determine accuracy of models applied to training data yet prior to testing.Define transformations of input data, including but not restricted to, broadcasting dimensions (as defined below) and standardising data ranges (typically to defined values of mean and standard deviation).*Pre-Processing*Specify the model and associated processes which will be applied to map the input data on to the desired output. This step minimally includes the following distinct stages (generally in no particular order):*Model and Algorithm Specification*- Specify the kind of model which will be applied to the training data. ML software often allows the use of pre-trained models, in which case this this step includes downloading or otherwise obtaining a pre-trained model, along with specification of which aspects of those models are to be modified through application to a particular set of training and validation data.
- Specify the kind of algorithm which will be used to explore the search space (for example some kind of gradient descent algorithm), along with parameters controlling how that algorithm will be applied (for example a learning rate, as defined above).
- Specify the kind of loss function will be used to quantify distance between model estimates and desired output.

Apply the specified model to the training data to generate a series of estimates from the specified loss function. This stage may also include specifying parameters such as stopping or exit criteria, and parameters controlling batch processing of input data. Moreover, this stage may involve retaining some of the following additional data:*Model Training*- Potential “pre-processing” stages such as initial estimates of optimal learning rates (see above).
- Details of summaries of actual paths taken through the search space towards convergence on local or global minimum.

Measure the performance of the trained model when applied to the test data set, generally requiring the specification of a metric of model performance or accuracy.*Model Output and Performance*

Importantly, ML workflows may be partly iterative. This may in turn potentially confound distinctions between training and test data, and accordingly confound expectations commonly placed upon statistical analyses of statistical independence of response variables. ML routines such as cross-validation repeatedly (re-)partition data between training and test sets. Resultant models can then not be considered to have been developed through application to any single set of truly “independent” data. In the context of the standards that follow, these considerations admit a potential lack of clarity in any notional categorical distinction between training and test data, and between model specification and training.

The preceding workflow mentioned a couple of concepts the definitions of which may be seen by clicking on the corresponding items below. Following that, we proceed to standards for ML software, enumerated and developed with reference to the preceding workflow steps. As described above, these steps may not be applicable to all ML software, and so all of the following standards should be considered to be conditioned on “where applicable.” In order that the following standards initially adhere to the enumeration of workflow steps given above, more general standards pertaining to aspects such as documentation and testing are given following the initial five “workflow” standards.

##
Click for a definition of *broadcasting*, referred to in Step 2, above.

The following definition comes from a vignette for the `rray`

package named
*Broadcasting*.

is, “repeating the dimensions of one object to match the dimensions of another.”*Broadcasting*

This concept runs counter to aspects of standards in other categories, which often suggest that functions should error when passed input objects which do not have commensurate dimensions. Broadcasting is a pre-processing step which enables objects with incommensurate dimensions to be dimensionally reconciled.

The following demonstration is taken directly from the `rray`

package (which is not currently on CRAN).

```
library (rray)
a <- array(c(1, 2), dim = c(2, 1))
b <- array(c(3, 4), dim = c(1, 2))
# rbind (a, b) # error!
rray_bind (a, b, .axis = 1)
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#> [3,] 3 4
rray_bind (a, b, .axis = 2)
#> [,1] [,2] [,3]
#> [1,] 1 3 4
#> [2,] 2 3 4
```

Broadcasting is commonly employed in ML software because it enables ML operations to be implemented on objects with incommensurate dimensions. One example is image analysis, in which training data may all be dimensionally commensurate, yet test images may have different dimensions. Broadcasting allows data to be submitted to ML routines regardless of potentially incommensurate dimensions.

##
Click for a definition of *learning rate*, referred to in Step 5, above.

(generally) determines the step size used to search for local optima as a fraction of the local gradient.*Learning Rate*

This parameter is particularly important for training ML algorithms like neural networks, the results of which can be very sensitive to variations in learning rates. A useful overview of the importance of learning rates, and a useful approach to automatically determining appropriate values, is given in this blog post.

Partly because of widespread and current relevance, the category of Machine Learning software is one for which there have been other notable attempts to develop standards. A particularly useful reference is the MLPerf organization which, among other activities, hosts several github repositories providing reference datasets and benchmark conditions for comparing performance aspects of ML software. While such reference or benchmark standards are not explicitly referred to in the current version of the following standards, we expect them to be gradually adapted and incorporated as we start to apply and refine our standards in application to software submitted to our review system.

### 5.8.1 Input Data Specification

Many of the following standards refer to the labelling of input data as “testing” or “training” data, along with potentially additional labels such as “validation” data. In regard to such labelling, the following two standards apply,

**ML1.0***Documentation should make a clear conceptual distinction between training and test data (even where such may ultimately be confounded as described above.)***ML1.0a***Where these terms are ultimately eschewed, these should nevertheless be used in initial documentation, along with clear explanation of, and justification for, alternative terminology.*

**ML1.1***Absent clear justification for alternative design decisions, input data should be expected to be labelled “test”, “training”, and, where applicable, “validation” data.***ML1.1a***The presence and use of these labels should be explicitly confirmed via pre-processing steps (and tested in accordance with***ML7.0**, below).**ML1.1b***Matches to expected labels should be case-insensitive and based on partial matching such that, for example, “Test”, “test”, or “testing” should all suffice.*

The following three standards (**ML1.2**–**ML1.4**) represent three possible
design intentions for ML software. Only one of these three will generally be
applicable to any one piece of software, although it is nevertheless possible
that more than one of these standards may apply. The first of these three
standards applies to ML software which is intended to process, or capable of
processing, input data as a single (generally tabular) object.

**ML1.2***Training and test data sets for ML software should be able to be input as a single, generally tabular, data object, with the training and test data distinguished either by**A specified variable containing, for example,*`TRUE`

/`FALSE`

or`0`

/`1`

values, or which uses some other system such as missing (`NA`

) values to denote test data); and/or*An additional parameter designating case or row numbers, or labels of test data.*

The second of these three standards applies to ML software which is intended to process, or capable of processing, input data represented as multiple objects which exist in local memory.

**ML1.3***Input data should be clearly partitioned between training and test data (for example, through having each passed as a distinct*`list`

item), or should enable an additional means of categorically distinguishing training from test data (such as via an additional parameter which provides explicit labels). Where applicable, distinction of validation and any other data should also accord with this standard.

The third of these three standards for data input applies to ML software for which data are expected to be input as references to multiple external objects, generally expected to be read from either local or remote connections.

**ML1.4***Training and test data sets, along with other necessary components such as validation data sets, should be stored in their own distinctly labelled sub-directories (for distinct files), or according to an explicit and distinct labelling scheme (for example, for database connections). Labelling should in all cases adhere to***ML1.1**, above.

The following standard applies to all ML software regardless of the applicability or otherwise of the preceding three standards.

**ML1.5***ML software should implement a single function which summarises the contents of test and training (and other) data sets, minimally including counts of numbers of cases, records, or files, and potentially extending to tables or summaries of file or data types, sizes, and other information (such as unique hashes for each component).*

#### 5.8.1.1 Missing Values

Missing data are handled differently by different ML routines, and it is also difficult to suggest generally applicable standards for pre-processing missing values in ML software. The following standards attempt to cover a practical range of typical approaches and applications.

**ML1.6***ML software which does not admit missing values, and which expects no missing values, should implement explicit pre-processing routines to identify whether data has any missing values, and should generally error appropriately and informatively when passed data with missing values. In addition, ML software which does not admit missing values should:***ML1.6a***Explain why missing values are not admitted.***ML1.6b***Provide explicit examples (in function documentation, vignettes, or both) for how missing values may be imputed, rather than simply discarded.*

**ML1.7***ML software which admits missing values should clearly document how such values are processed.***ML1.7a***Where missing values are imputed, software should offer multiple user-defined ways to impute missing data.***ML1.7b***Where missing values are imputed, the precise imputation steps should also be explicitly documented, either in tests (see***ML7.2**below), function documentation, or vignettes.

**ML1.8***ML software should enable equal treatment of missing values for both training and test data, with optional user ability to control application to either one or both.*

### 5.8.2 Pre-processing

As reflected in the workflow envisioned at the outset, ML software operates somewhat differently to statistical software in many other categories. In particular, ML software often requires explicit specification of a workflow, including specification of input data (as per the standards of the preceding sub-section), and of both transformations and statistical models to be applied to those data. This section of standards refers exclusively to the transformation of input data as a pre-processing step prior to any specification of, or submission to, actual models.

**ML2.0***A dedicated function should enable pre-processing steps to be defined and parametrized.***ML2.0a***That function should return an object which can be directly submitted to a specified model (see section 3, below).***ML2.0b***Absent explicit justification otherwise, that return object should have a defined class minimally intended to implement a default*`print`

method which summarizes the input data set (as per**ML1.5**above) and associated transformations (see the following standard).

Standards for most other categories of statistical software suggest that pre-processing routines should ensure that input data sets are commensurate, for example, through having equal numbers of cases or rows. In contrast, ML software is commonly intended to accept input data which can not be guaranteed to be dimensionally commensurate, such as software intended to process rectangular image files which may be of different sizes.

**ML2.1***ML software which uses broadcasting to reconcile dimensionally incommensurate input data should offer an ability to at least optionally record transformations applied to each input file.*

Beyond broadcasting and dimensional transformations, the following standards apply to the pre-processing stages of ML software.

**ML2.2***ML software which requires or relies upon numeric transformations of input data (such as change in mean values or variances) should allow optimal explicit specification of target values, rather than restricting transformations to default generic values only (such as transformations to z-scores).***ML2.2a***Where the parameters have default values, reasons for those particular defaults should be explicitly described.***ML2.2b***Any extended documentation (such as vignettes) which demonstrates the use of explicit values for numeric transformations should explicitly describe why particular values are used.*

For all transformations applied to input data, whether of dimension (**ML2.1**)
or scale (**ML2.2**),

**ML2.3***The values associated with all transformations should be recorded in the object returned by the function described in the preceding standard (***ML2.0**).**ML2.4***Default values of all transformations should be explicitly documented, both in documentation of parameters where appropriate (such as for numeric transformations), and in extended documentation such as vignettes.***ML2.5***ML software should provide options to bypass or otherwise switch off all default transformations.***ML2.6***Where transformations are implemented via distinct functions, these should be exported to a package’s namespace so they can be applied in other contexts.***ML2.7***Where possible, documentation should be provided for how transformations may be reversed. For example, documentation may demonstrate how the values retained via***ML2.3**, above, can be used along with transformations either exported via**ML2.6**or otherwise exemplified in demonstration code to independently transform data, and then to reverse those transformations.

### 5.8.3 Model and Algorithm Specification

A “model” in the context of ML software is understood to be a means of
specifying a mapping between input and output data, generally applied to
training and validation data. Model specification is the step of specifying
*how* such a mapping is to be constructed. The specification of *what* the
values of such a model actually are occurs through training the model, and is
described in the following sub-section. These standards also refer to *control
parameters* which specify how models are trained. These parameters commonly
include values specifying numbers of iterations, training rates, and parameters
controlling algorithmic processes such as re-sampling or cross-validation.

**ML3.0***Model specification should be implemented as a distinct stage subsequent to specification of pre-processing routines (see Section 2, above) and prior to actual model fitting or training (see Section 4, below). In particular,***ML3.0a***A dedicated function should enable models to be specified without actually fitting or training them, or if this (***ML3**) and the following (**ML4**) stages are controlled by a single function, that function should have a parameter enabling models to be specified yet not fitted (for example,`nofit = FALSE`

).**ML3.0b***That function should accept as input the objects produced by the previous Input Data Specification stage, and defined according to***ML2.0**, above.**ML3.0c***The function described above (***ML3.0a**) should return an object which can be directly trained as described in the following sub-section (**ML4**).**ML3.0d***That return object should have a defined class minimally intended to implement a default*`print`

method which summarises the model specification, including values of all relevant parameters.

**ML3.1***ML software should allow the use of both untrained models, specified through model parameters only, as well as pre-trained models. Use of the latter commonly entails an ability to submit a previously-trained model object to the function defined according to***ML3.0a**, above.**ML3.2***ML software should enable different models to be applied to the object specifying data inputs and transformations (see sub-sections 1–2, above) without needing to re-define those preceding steps.*

A function fulfilling **ML3.0–3.2** might, for example, permit the following
arguments:

`data`

: Input data specification constructed according to**ML1**`model`

: An optional previously-trained model`control`

: A list of parameters controlling how the model algorithm is to be applied during the subsequent training phase (**ML4**).

A function with the arguments defined above would fulfil the preceding three
standards, because the `data`

stage would represent the output of **ML1**,
while the `model`

stage would allow for different pre-trained models to be
submitted using the same data and associated specifications (**ML3.1**). The
provision of a separate `.data`

argument would fulfil **ML3.2** by allowing one
or both `model`

or `control`

parameters to be re-defined while submitting the
same `data`

object.

**ML3.3***Where ML software implements its own distinct classes of model objects, the properties and behaviours of those specific classes of objects should be explicitly compared with objects produced by other ML software. In particular, where possible, ML software should provide extended documentation (as vignettes or equivalent) comparing model objects with those from other ML software, noting both unique abilities and restrictions of any implemented classes.***ML3.4***Where training rates are used, ML software should provide explicit documentation both in all functions which use training rates, and in extended form such as vignettes, of the importance of, and/or sensitivity to, different values of training rates. In particular,***ML3.4a***Unless explicitly justified otherwise, ML software should offer abilities to automatically determine appropriate or optimal training rates, either as distinct pre-processing stages, or as implicit stages of model training.***ML3.4b***ML software which provides default values for training rates should clearly document anticipated restrictions of validity of those default values; for example through clear suggestions that user-determined and -specified values may generally be necessary or preferable.*

#### 5.8.3.1 Control Parameters

Control parameters are considered here to specify how a model is to be applied to a set of training data. These are generally distinct from parameters specifying the actual model (such as model architecture). While we recommend that control parameters be submitted as items of a single named list, this is neither a firm expectation nor an explicit part of the current standards.

**ML3.5***Parameters controlling optimization algorithms should minimally include:***ML3.5a***Specification of the type of algorithm used to explore the search space (commonly, for example, some kind of gradient descent algorithm)***ML3.5b***The kind of loss function used to assess distance between model estimates and desired output.*

**ML3.6***Unless explicitly justified otherwise (for example because ML software under consideration is an implementation of one specific algorithm), ML software should:***ML3.6a***Implement or otherwise permit usage of multiple ways of exploring search space***ML3.6b***Implement or otherwise permit usage of multiple loss functions.*

#### 5.8.3.2 CPU and GPU processing

ML software often involves manipulation of large numbers of rectangular arrays for which graphics processing units (GPUs) are often more efficient than central processing units (CPUs). ML software thus commonly offers options to train models using either CPUs or GPUs. While these standards do not currently suggest any particular design choice in this regard, we do note the following:

**ML3.7***For ML software in which algorithms are coded in C++, user-controlled use of either CPUs or GPUs (on NVIDIA processors at least) should be implemented through direct use of*`libcudacxx`

.

This library can be “switched on” through activating a single C++ header file to switch from CPU to GPU.

### 5.8.4 Model Training

Model training is the stage of the ML workflow envisioned here in which the
actual computation is performed by applying a model specified according to
**ML3** to data specified according to **ML1** and **ML2**.

**ML4.0***ML software should generally implement a unified single-function interface to model training, able to receive as input a model specified according to all preceding standards. In particular, models with categorically different specifications, such as different model architectures or optimization algorithms, should be able to be submitted to the same model training function.***ML4.1***ML software should at least optionally retain explicit information on paths taken as an optimizer advances towards minimal loss. Such information should minimally include:***ML4.1a***Specification of all model-internal parameters, or equivalent hashed representation.***ML4.1b***The value of the loss function at each point***ML4.1c***Information used to advance to next point, for example quantification of local gradient.*

**ML4.2***The subsequent extraction of information retained according to the preceding standard should be explicitly documented, including through example code.*

#### 5.8.4.1 Batch Processing

The following standards apply to ML software which implements batch processing, commonly to train models on data sets too large to be loaded in their entirety into memory.

**ML4.3***All parameters controlling batch processing and associated terminology should be explicitly documented, and it should not, for example, be presumed that users will understand the definition of “epoch” as implemented in any particular ML software.*

According to that standard, it would for example be inappropriate to have
a parameter, `nepochs`

, described as “Number of epochs used in model training”.
Rather, the definition and particular implementation of “epoch” must be
explicitly defined.

**ML4.4***Explicit guidance should be provided on selection of appropriate values for parameter controlling batch processing, for example, on trade-offs between batch sizes and numbers of epochs (with both terms provided as Control Parameters in accordance with the preceding standard,***ML3**).**ML4.5***ML software may optionally include a function to estimate likely time to train a specified model, through estimating initial timings from a small sample of the full batch.***ML4.6***ML software should by default provide explicit information on the progress of batch jobs (even where those jobs may be implemented in parallel on GPUs). That information may be optionally suppressed through additional parameters.*

#### 5.8.4.2 Re-sampling

As described at the outset, ML software does not always rely on pre-specified and categorical distinctions between training and test data. For example, models may be fit to what is effectively one single data set in which specified cases or rows are used as training data, and the remainder as test data. Re-sampling generally refers to the practice of re-defining categorical distinctions between training and test data. One training run accordingly connotes training a model on one particular set of training data and then applying that model to the specified set of test data. Re-sampling starts that process anew, through constructing an alternative categorical partition between test and training data.

Even where test and training data are distinguished by more than a simple data-internal category (such as a labelling column), for example, by being stored in distinctly-named sub-directories, re-sampling may be implemented by effectively shuffling data between training and test sub-directories.

**ML4.7***ML software should provide an ability to combine results from multiple re-sampling iterations using a single parameter specifying numbers of iterations.***ML4.8***Absent any additional specification, re-sampling algorithms should by default partition data according to proportions of original test and training data.***ML4.8a***Re-sampling routines of ML software should nevertheless offer an ability to explicitly control or override such default proportions of test and training data.*

### 5.8.5 Model Output and Performance

Model output is considered here as a stage distinct from model performance.
Model output refers to the end result of model training (**ML4**), while model
performance involves the assessment of a trained model against a test data set.
The present section first describes standards for model output, which are
standards guiding the form of a model trained according to the preceding
standards (**ML4**). Model Performance is then considered as a separate stage.

#### 5.8.5.1 Model Output

**ML5.0***The result of applying the training processes described above should be contained within a single model object returned by the function defined according to***ML4.0**, above. Even where the output reflects application to a test data set, the resultant object need not include any information on model performance (see**ML5.3**–**ML5.4**, below).**ML5.0a***That object should either have its own class, or extend some previously-defined class.***ML5.0b***That class should have a defined*`print`

method which summarises important aspects of the model object, including but not limited to summaries of input data and algorithmic control parameters.

**ML5.1***As for the untrained model objects produced according to the above standards, and in particular as a direct extension of***ML3.3**, the properties and behaviours of trained models produced by ML software should be explicitly compared with equivalent objects produced by other ML software. (Such comparison will generally be done in terms of comparing model performance, as described in the following standard**ML5.3**–**ML5.4**).**ML5.2***The structure and functionality of objects representing trained ML models should be thoroughly documented. In particular,***ML5.2a***Either all functionality extending from the class of model object should be explicitly documented, or a method for listing or otherwise accessing all associated functionality explicitly documented and demonstrated in example code.***ML5.2b***Documentation should include examples of how to save and re-load trained model objects for their re-use in accordance with***ML3.1**, above.**ML5.2c***Where general functions for saving or serializing objects, such as*`saveRDS`

are not appropriate for storing local copies of trained models, an explicit function should be provided for that purpose, and should be demonstrated with example code.

The `R6`

system for representing classes in R is an
example of a system with explicit functionality, all components of which are
accessible by a simple
`ls()`

call.
Adherence to **ML5.2a** would nevertheless
require explicit description of the ability of
`ls()`

to
supply a list of all functions associated with an object. The `mlr`

package, for example, uses `R6`

classes, yet neither explicitly describes the use of
`ls()`

to
list all associated functions, nor explicitly lists those functions.

#### 5.8.5.2 Model Performance

Model performance refers to the quantitative assessment of a trained model when applied to a set of test data.

**ML5.3***Assessment of model performance should be implemented as one or more functions distinct from model training.***ML5.4***Model performance should be able to be assessed according to a variety of metrics.***ML5.4a***All model performance metrics represented by functions internal to a package must be clearly and distinctly documented.***ML5.4b***It should be possible to submit custom metrics to a model assessment function, and the ability to do so should be clearly documented including through example code.*

The remaining sub-sections specify general standards beyond the preceding workflow-specific ones.

### 5.8.6 Documentation

**ML6.0***Descriptions of ML software should make explicit reference to a workflow which separates training and testing stages, and which clearly indicates a need for distinct training and test data sets.*

The following standard applies to packages which are intended or other able to only encompass a restricted subset of the six primary workflow steps enumerated at the outset. Envisioned here are packages explicitly intended to aid one particular aspect of the general workflow envisioned here, such as implementations of ML optimization functions, or specific loss measures.

**ML6.1***ML software intentionally designed to address only a restricted subset of the workflow described here should clearly document how it can be embedded within a typical full ML workflow in the sense considered here.***ML6.1***Such demonstrations should include and contrast embedding within a full workflow using at least two other packages to implement that workflow.*

### 5.8.7 Testing

#### 5.8.7.1 Input Data

**ML7.0***Test should explicitly confirm partial and case-insensitive matching of “test”, “train”, and, where applicable, “validation” data.***ML7.1***Tests should demonstrate effects of different numeric scaling of input data (see***ML2.2**).**ML7.2***For software which imputes missing data, tests should compare internal imputation with explicit code which directly implements imputation steps (even where such imputation is a single-step implemented via some external package). These tests serve as an explicit reference for how imputation is performed.*

#### 5.8.7.2 Model Classes

The following standard applies to models in both untrained and trained forms,
considered to be the respective outputs of the preceding standards **ML3** and
**ML4**.

**ML7.3***Where model objects are implemented as distinct classes, tests should explicitly compare the functionality of these classes with functionality of equivalent classes for ML model objects from other packages.***ML7.3a***These tests should explicitly identify restrictions on the functionality of model objects in comparison with those of other packages.***ML7.3b***These tests should explicitly identify functional advantages and unique abilities of the model objects in comparison with those of other packages.*

#### 5.8.7.3 Model Training

**ML7.4***ML software should explicit document the effects of different training rates, and in particular should demonstrate divergence from optima with inappropriate training rates.***ML7.5***ML software which implements routines to determine optimal training rates (see***ML3.4**, above) should implement tests to confirm the optimality of resultant values.**ML7.6***ML software which implement independent training “epochs” should demonstrate in tests the effects of lesser versus greater numbers of epochs.***ML7.7***ML software should explicitly test different optimization algorithms, even where software is intended to implement one specific algorithm.***ML7.8***ML software should explicitly test different loss functions, even where software is intended to implement one specific measure of loss.***ML7.9***Tests should explicitly compare all possible combinations in categorical differences in model architecture, such as different model architectures with same optimization algorithms, same model architectures with different optimization algorithms, and differences in both.***ML7.9a***Such combinations will generally be formed from multiple categorical factors, for which explicit use of functions such as*`expand.grid()`

is recommended.

The following example illustrates:

```
architechture <- c ("archA", "archB")
optimizers <- c ("optA", "optB", "optC")
cost_fns <- c ("costA", "costB", "costC")
expand.grid (architechture, optimizers, cost_fns)
```

```
## Var1 Var2 Var3
## 1 archA optA costA
## 2 archB optA costA
## 3 archA optB costA
## 4 archB optB costA
## 5 archA optC costA
## 6 archB optC costA
## 7 archA optA costB
## 8 archB optA costB
## 9 archA optB costB
## 10 archB optB costB
## 11 archA optC costB
## 12 archB optC costB
## 13 archA optA costC
## 14 archB optA costC
## 15 archA optB costC
## 16 archB optB costC
## 17 archA optC costC
## 18 archB optC costC
```

All possible combinations of these categorical parameters could then be tested by iterating over the rows of that output.

**ML7.10***The successful extraction of information on paths taken by optimizers (see***ML5.1**, above), should be tested, including testing the general properties, but not necessarily actual values of, such data.

#### 5.8.7.4 Model Performance

**ML7.11***All performance metrics available for a given class of trained model should be thoroughly tested and compared.***ML7.11a***Tests which compare metrics should do so over a range of inputs (generally implying differently trained models) to demonstrate relative advantages and disadvantages of different metrics.*