3.5 Software structure and design decisions

3.5.1 Data first

The primary force that drives the software’s design choices is “data”. All functions in the package tsibble start with data or its variants as the first argument, namely “data first”. This lays out a consistent interface and addresses the significance of the data throughout the software.

Beyond the tools, the print display provides a quick and comprehensive glimpse of data in temporal context, particularly useful when handling a large collection of data. The contextual summary provided by the print function, shown below on the data from Table 3.1, contains (1) data dimension with its shorthand time interval, alongside time zone if date-times, (2) variables that constitute the “key” with the number of series. These details aid users in understanding their data better.

#> # A tsibble: 12 x 5 [1Y]
#> # Key:       country, gender [6]
#>   country     continent gender  year count
#>   <chr>       <chr>     <chr>  <dbl> <dbl>
#> 1 Australia   Oceania   Female  2011   120
#> 2 Australia   Oceania   Female  2012   125
#> 3 Australia   Oceania   Male    2011   176
#> 4 Australia   Oceania   Male    2012   161
#> 5 New Zealand Oceania   Female  2011    36
#> # … with 7 more rows

3.5.2 Functional programming

Rolling window calculations are widely used techniques in time series analysis, and often apply to other applications. These operations are dependent on having an ordering, particularly time ordering for temporal data. Three common types of variations for sliding window operations are:

  1. slide: sliding window with overlapping observations.
  2. tile: tiling window without overlapping observations.
  3. stretch: fixing an initial window and expanding to include more observations.

Figure 3.4 shows the animations of rolling windows for sliding, tiling and stretching, respectively, on annual tuberculosis cases for Australia. A block of consecutive elements with a window size of 5 are initialized and started rolling sequentially till the end of series by computing average counts.

An illustration of window of size 5 computing rolling averages over annual tuberculosis cases in Australia with respect to sliding, tiling and stretching. (Animation needs to be viewed with Adobe Acrobat Reader.)

Figure 3.4: An illustration of window of size 5 computing rolling averages over annual tuberculosis cases in Australia with respect to sliding, tiling and stretching. (Animation needs to be viewed with Adobe Acrobat Reader.)

Rolling window uses a programming paradigm—functional programming, which is different from those table verbs listed in Table 3.2. Table verbs expect and return a tsibble, and does what the function name suggests. On the contrary, these rolling window functions could accept arbitrary input types and would return arbitrary sorts of output, depending on which method is put into the rolling window. For example, computing moving averages requires numerics and a function like mean(), and produces averaged numerics. However, rolling window regression takes a data frame and a linear regression method like lm(), and generates a complex object that contains coefficients, fitted values, and etc.

The purrr package (Henry and Wickham 2018) provides a good example of functional programming in R. It provides a complete and consistent set of tools to iterate each element of a vector with a function. Rolling window does not just iterate but rolls over a sequence of elements, namely slide(), tile() and stretch(). slide() expects one input, slide2() two inputs, and pslide() multiple inputs. For type stability, the functions always return lists. Other variants including *_lgl(), *_int(), *_dbl(), *_chr() return vectors of the corresponding type, as well as *_dfr() and *_dfc() for row-binding and column-binding data frames respectively. Their multiprocessing equivalents prefixed by future_*() enable rolling in parallel (Bengtsson 2019; Vaughan and Dancho 2018a). This family of functions empowers users to incorporate window-related operations in their workflows.

3.5.3 Modularity

Modular programming is adopted in the design of the tsibble package. Modularity benefits users by providing small focused and manageable chunks, and provides developers with simpler maintenance.

All user-facing functions can be roughly organized into three major chunks according to their functionality: vector functions (1d), table verbs (2d), and window family. Each chunk is an independent module, but works interdependently. Vector functions in the package mostly deal with time. The atomic functions (such as yearmonth() and yearquarter()) embedded in the index_by() table verb achieves in collapsing a tsibble to a less granular interval. The substitution of another time function in the index_by() results in the aggregation of different time resolution. Since these time functions are not exclusive to a tsibble, they can be used in a variety of applications in conjunction with other packages. On the other hand, these tsibble verbs can incorporate many third-party vector functions to step out of the current tsibble zone. It is also generally easier to trace back the errors users encounter from separating 1d and 2d functions, and increase the code readability.

3.5.4 Extensibility

As a fundamental infrastructure, extensibility is a design decision that was employed from the start of tsibble’s development. Contrary to the “data first” principle for end users, extensibility is developer focused and would be mostly used in dependent packages, which heavily relies on S3 classes and methods in R (H. Wickham 2018). The package can be extended in two major aspects: custom index and new tsibble class.

Time representation could be arbitrary, for example R’s native POSIXct and Date for versatile date-times, nano time for nanosecond resolution implemented in nanotime (Eddelbuettel and Silvestri 2018), and pure numbers in simulations. Ordered factors can also be a source of time, such as month names, January to December, and weekdays, Monday to Sunday. The tsibble package supports an extensive range of index types from numerics to nano time, but there might be custom indexes used for some occasions, for example school semesters. These academic terms vary from one institution to another, within an academic year which is defined differently from a calendar year. A new index would be immediately recognized by the software upon defining index_valid(), as long as it can be ordered from past to future. The interval regarding semesters is further outlined through pull_interval(). As a result, the rest of the software methods such as has_gaps() and fill_gaps() will have instant support for data that contains this new index.

The class of tsibble is an underlying basis of temporal data, and there is a demand for sub-classing a tsibble. For example, a fable is actually an extension of a tsibble, mentioned in Section 3.4.3. A low-level constructor new_tsibble() provides a vehicle to easily create a new subclass. This new object itself is a tsibble. It perhaps needs more metadata than those of a tsibble, that gives rise to a new data extension, like prediction distributions to a fable. Tsibble verbs are also S3 generics. Developers will be able to implement these verbs for the new class, if necessary.

3.5.5 Tidy evaluation

The tsibble packages leverages the tidyverse grammars and pipelines through tidy evaluation (Henry and Wickham 2019b) via the rlang package (Henry and Wickham 2019a). In particular, the table verbs extensively use tidy evaluation to evaluate computation in the context of tsibble data and spotlights the “tidy” interface that is compatible with the tidyverse. This not only saves a few keystrokes without explicitly repeated references to the data source, but the resulting code is typically cleaner and more expressive.