3.3 Contextual semantics

The choice of tidy representation of temporal data arises from a data-centric perspective, which accommodates all of the operations that are to be performed on the data. Figure 3.1 marks where this new abstraction is placed in the tidy model, which we refer to as a “tsibble”. The tsibble structure is an extension of a data frame—a two-dimensional array in R—with additional time series semantics: index and key, as shown in Figure 3.2.

To demonstrate the concept of the tsibble, Table 3.1 presents a subset of tuberculosis cases estimated by World Health Organization (2018). It contains 12 observations and 5 variables arranged in a “long” tabular form. Each observation comprises the number of people who are diagnosed with tuberculosis for each gender at three selected countries in the years of 2011 and 2012. To turn this data into a tsibble: (1) column year is declared as the index variable; (2) the key is specified to consist of columns country and gender. The column count is the only measured variable in this data, but the structure is sufficiently flexible to hold other measured variables; for example, adding the corresponding population size (if known) in order to normalize the count later.

Table 3.1: A small subset of estimates of tuberculosis burden generated by World Health Organization in 2011 and 2012, with 12 observations and 5 variables. The index refers to column year, the key to multiple columns: country and gender, and the measured variable to column count.
country continent gender year count
Australia Oceania Female 2011 120
Australia Oceania Female 2012 125
Australia Oceania Male 2011 176
Australia Oceania Male 2012 161
New Zealand Oceania Female 2011 36
New Zealand Oceania Female 2012 23
New Zealand Oceania Male 2011 47
New Zealand Oceania Male 2012 42
United States of America Americas Female 2011 1170
United States of America Americas Female 2012 1158
United States of America Americas Male 2011 2489
United States of America Americas Male 2012 2380

The new data structure, tsibble, bridges the gap between raw temporal data and model inputs. Contextual semantics are introduced to tidy data in order to support more intuitive time-related manipulations and enlighten new perspectives for time series model inputs. Index, key and time interval are the three stone pillars to this new semantically structured temporal data. Each is now described in more detail.

3.3.1 Index

Time provides a contextual basis for temporal data. A variable representing time is essential for a tsibble, and is referred to as an “index”. The “index” is an intact data column rather than a masked attribute, which makes time visible and accessible to users. This is highly advantageous when manipulating time. For example, one could easily extract time components, such as time of day and day of week, from the index to visualize seasonal effects of response variables. One could also join other data sources to the tsibble based on common time indexes. The accessibility of the tsibble index motivates data analysis towards transparency and human readability. When the “index” is available only as meta information (such as in the ts class), it creates an obstacle for analysts to write these simple queries in a programmatic manner, which should be discouraged from an analytic point of view.

A variable number of time representations can be spotted in the wild. A date-time object, universally accepted across computing systems, is the most commonly used type for representing time. Date-time also typically associates with a time zone including adjustments such as summer time. This diversity and time zone is acknowledged and accommodated by tsibble’s index. When creating a tsibble, time indices are arranged from past to future within each series for the strict temporal ordering that is assumed by time series operations.

3.3.2 Key

The “key” specification is the second essential ingredient for a tsibble. The “key” uniquely identifies observations that are recorded over time in a data table. It is similar to a primary key (Codd 1970) defining each observation in a relational database. In the wide format in which multiple time series are often structured, the columns hold a series of values, so that the column implicitly serves as identification. In long format, columns are melted with names converted to “key” values. However, the “key” provides much more flexibility. It is not constrained to a single field, but can be composed from multiple fields. The identifying variables from which the “key” is constituted remain the same as in the original table with no further tweaks.

The “key” is usually known a priori by analysts. For example, Table 3.1 describes the number of tuberculosis cases for each gender across the countries every year. This data description suggests that columns gender and country have to be declared as the key, similar to a panel variable for longitudinal data. Lacking either of the two will be inadequate, because the observations would not be uniquely identified, and thus a tsibble construction would fail. An alternative specification of the key for this data is to include a third variable continent. Since country is nested within continent, it is a free variable for use. This variable brings additional information that can be used for forecasting reconciliation (Hyndman and Athanasopoulos 2017). The key needs to be explicit when multiple units exist in the data. The key can be implicit when it finds a univariate series in the table, but it cannot be absent from a tsibble.

The “key” also provides a link between the data, models, and forecasts. This neatly decouples the data from models and forecasts, leaving more room for necessary model components, such as coefficients, fitted values and residuals. More details are given in Section 3.4.3.

3.3.3 Interval

One of the cornerstones of time series data, and hence beneath a tsibble, is the time interval. This information plays a critical role in computing statistics (e.g. seasonal unit root tests) and building models (e.g. seasonal ARIMA). The principal divide is between regularly or irregularly spaced observations in time. A tsibble permits implicit missing time, making it difficult to distinguish regularity from the index. It relies on a user’s specification by switching the regular argument off, when the data involves irregular intervals. This type of data can flow into event-based data modeling, but would need to be processed or regularized to fit models that expect time series.

For data indexed in regular time space, the time interval is automatically calculated, by first computing absolute differences of time indexes and then finding the greatest common divisor. This covers all conceivable cases, assuming that all observations in a tsibble have only one interval. Data collected at different intervals should be organized in separate tsibbles, encouraging well-tailored analysis and models, because each observation may have different underlying data generating processes.