3.2 Data structures

3.2.1 Comparing time series and longitudinal data

Temporal data problems often fall into two types of analysis, time series and longitudinal. Both of these may have similar data input, but the representation for modeling is typically different. Time series analysis tends to focus on the dependency within series, and the cross-correlation between series. Longitudinal analysis tends to focus on overall temporal patterns across demographic or experimental treatment strata, that incorporates within subject dependency.

Time series can be univariate or multivariate, and require relatively long lengths (i.e., large $T$ ) for modeling. With this large $T$ property, the series can be handled as stochastic processes for the primary purposes of forecasting and characterizing temporal dynamics. Due to an expectation of regularly spaced time, and equal lengths across series, multivariate time series are typically assumed to be in the format where each column contains a single time series, and time is specified implicitly. This also implies that data are columns of homogeneous types: either all numeric or all non-numeric. It can be frustrating to wrestle data from its original format to this modeling format. The format could be considered to be model-centric, rather than data-centric, and thus throws the analyst into the deep end of the pool, rather than allowing them to gently wade to the modeling stage from the shallow end. The expectation is that the “model” is at the center of the analytical universe. This is contrary to the tidyverse conceptualization (Figure 3.1), which holistically captures the data workflow. More support needs to be provided, in the form of consistent tools and data structures, to transform the data into the analytical cycle.

Longitudinal data (or panel data) typically assumes fewer measurements (small $T$ ) over a large number of individuals (large $N$ ). It often occurs that measurements for individuals are taken at different time points, and in different quantities. The primary format required for modeling is stacked data, blocks of measurements for each individual, with columns indicating panels, times of measurement and the measurements themselves. An appealing feature is that data is structured in a semantic manner with reference to observations and variables, with panel and time explicitly stated.

3.2.2 Existing data standards

In R (R Core Team 2018), time series and longitudinal data are of different representations. The native ts object and the enhancements by zoo (Zeileis and Grothendieck 2005) and xts (Ryan and Ulrich 2018), assemble time series into wide matrices with implicit time indexes. If there are multiple sub-groups, such as country or product type, these would be kept in different data objects. A relatively new R package tibbletime (Vaughan and Dancho 2018 b) proposed a data class of time tibble to represent time series in heterogeneous long format. It only requires an index variable to be declared. However, this is insufficient, and a more rigid data structure is required for temporal analytics and modeling. The plm (Croissant and Millo 2008) and panelr (Long 2019) packages both manage longitudinal data in long format.

Stata (StataCorp 2017) provides two commands, tsset and xtset, to declare time series and panels respectively, both of which require explicit panel id and time index specification. Different variables would be stored in multiple columns. The underlying data arrangement is only long form, for both types of data. Both groups of functions can be applied interchangeably to whether the data is declared for time series or longitudinal data. The SAS software (SAS Institute Inc. 2018) also handles both types of data in the same way as Stata.

3.2.3 Tidy data

Wickham (2014) coined the term “tidy data”, to standardize the mapping of the semantics of a dataset to its physical representation. In tidy form, rows correspond to observations and columns to variables. Tidy data is a rephrasing of the second and third normal forms from relational databases, but the explanation in terms of observations and variables is easier to understand because it uses statistical terminology.

Multiple time series, with each column corresponding to a measurement is tidy data when the time index is explicitly stored in a column. The stacked data format used in longitudinal data is tidy, and accommodates explicit identification of sub-groups.

The tidy data structure is the fundamental unit of the tidyverse, which is a collection of R packages designed for data science. The ubiquitous use of the tidyverse is testament to the simplicity, practicality and general applicability of the tools. The tidyverse provides abstract yet functional grammars to manipulate and visualize data in easier-to-comprehend form. One of the tidyverse packages, dplyr (H. Wickham, François, et al. 2019), showcases the value of a grammar as a principled vehicle to transform data for a wide range of data challenges, providing a consistent set of verbs: mutate(), select(), filter(), summarize(), and arrange(). Each verb focuses on a singular task. Most common data tasks can be rephrased and tackled with these five key verbs, in conjunction with group_by() to perform grouped operations.

The tidyverse largely formalizes exploratory data analysis. Many in the R community have adopted the tidyverse way of thinking and extended it to broader domains, such as simple features for spatial data in the sf package (Pebesma 2018) and missing value handling in the naniar package (Tierney and Cook 2018). This paper with the associated tsibble R package (Wang, Cook, and Hyndman 2019 c) extends the tidy way of thinking to temporal data.

For temporal data, the tidy definition needs additional criteria, that assist in handling the time context. This is addressed in the next section, and encompasses both time series and longitudinal data. It provides a unified framework to streamline the workflow from data preprocessing to visualization and modeling, as an integral part of a tidy data analysis.