3.2 Data structures
3.2.1 Time series and longitudinal data
Temporal data problems are typically grouped into two types of analysis, time series and longitudinal. Despite being exactly the same data input, the representation of time series and longitudinal data diverges due to different modeling approaches.
Time series can be univariate or multivariate, and for modeling require relatively long lengths (i.e., large \(T\)). Time series researchers and analysts who are concerned with this large \(T\) property, are mostly concerned with stochastic processes, for the primary purpose of forecasting, and characterizing temporal dynamics. Most statistical software represent such time series as vectors or matrices. Multivariate time series are typically assumed to be in the format where each row is assumed to hold observations at a time point and each column to contain a single time series. (The tidy data name for this would be wide format.) This implies that data are columns of homogeneous types: numeric or non-numeric, but there are limited supporting methods for non-numeric variables. In addition, time indexes are stripped off the data and implicitly inferred as attributes or meta-information. There is a strict requirement that the number of observations must be the same across all the series. Data wrangling, from the form that data arrives in, to this specialist format, can be frustrating and difficult, inhibiting the performance of downstream tasks.
For longitudinal analysis, researchers and analysts are primarily interested in explaining trends across and variations among individuals, and making inference about a broader population. Longitudinal data or panel data typically assumes fewer measurements (small \(T\)) over a large number of individuals (large \(N\)). It often occurs that measurements for individuals are taken at different time points, resulting in an unbalanced panel. Thus, the primary format required for modeling such data is stacked series, blocks of measurements for each individual, with columns indicating individual, times of measurement and the measurements themselves. (The tidy data name for this would be long format.) Evidently, this data organization saves storage space for many sparse cells, compared to structuring it into wide format which would have missing values in many cells. A drawback of this format is that information unique to each individual is often repeated for all time points. An appealing feature is that data is structured in a semantic manner with reference to observations and variables, with the time index stated explicitly. This opens the door to easily operating on time to make calculations and extract different temporal components, such as month and day of the week. It is conducive to examining the data in many different ways and leading to more comprehensive exploration and forecasting.
3.2.2 Tidy data and the grammar of data manipulation
Wickham (2014) coined the term “tidy data”, which is a rephrasing of the second and third normal forms in relational databases but in a way that makes more sense to data scientists by referring rows to observations and columns to variables. The principles of “tidy data” attempt to standardize the mapping of the semantics of a dataset to its physical representation. This data structure is the fundamental unit of the tidyverse, which is a collection of R packages designed for data science. The ubiquitous use of the tidyverse is testament to the simplicity, practicality and general applicability of the tools. The tidyverse provides abstract yet functional grammars to manipulate and visualize data in easier-to-comprehend form. One of the tidyverse packages, dplyr (H. Wickham, François, et al. 2018), showcases the value of a grammar as a principled vehicle to transform data for a wide range of data challenges, providing a consistent set of verbs:
arrange(). Each verb focuses on a singular task. Most common data tasks can be rephrased and tackled with these five key verbs, by composing them sequentially.
The tidyverse largely formalizes exploratory data analysis. Many in the R community have adopted the tidyverse way of thinking and extended it to broader domains, such as simple features for spatial data in the sf package (Pebesma 2018) and missing value handling in the naniar package (Tierney and Cook 2018). Temporal data tools need to catch up.
3.2.3 Existing time series standards in R
Current standards, provided by the native
ts object in R, and extended by zoo (Zeileis and Grothendieck 2005) and xts (Ryan and Ulrich 2018), assemble temporal data into matrices with implicit time indexes. These objects were designed for modeling methods. The diagram in the style of Figure 3.1 would place the model at the center of the analytical universe, and all the transformations and visualizations would hinge on that format. This is contrary to the tidyverse conceptualization, which holistically captures the full data workflow.
A new temporal data class is needed in the upstream of the workflow, which could incorporate all the downstream modules. A relatively new R package tibbletime (Vaughan and Dancho 2018b) proposed a data class of time tibble to represent temporal data in heterogeneous tabular format. It only requires an index variable to declare a temporal data object, thus placing it at the import stage. However, as proposed in Section 3.3 a more rigid data structure is required for time series analytics and models.
This paper describes a new tidy representation for temporal data, and a unified framework to streamline the workflow from data preprocessing to visualization and forecasting, as an integral part of a tidy data analysis.