3.4 Temporal data pipelines
A data pipeline describes the flow of data through an analysis, and can generally assist in conceptualizing the process for a stream of problems. McIlroy, Pinson, and Tague (1978) coined the term “pipelines” in software development while developing Unix at Bell Labs. In Unix-based computer operating systems, a pipeline chains together a series of operations based on their standard streams, so that the output of each program becomes the input to another. The Extract, Transform, and Load (ETL) process, described in recent data warehousing literature (Kimball and Caserta 2011), outlines the workflow to prepare data for analysis, and can also be considered a data pipeline. Buja et al. (1988) describes a viewing pipeline for interactive statistical graphics, that takes control of the transformation from data to plot. Deborah F. Swayne, Cook, and Buja (1998), Swayne et al. (2003), Sutherland et al. (2000), Wickham et al. (2010) and Xie, Hofmann, and Cheng (2014) implemented data pipelines for the interactive statistical software XGobi, GGobi, Orca, plumbr and cranvas, respectively.
A fluent data pipeline anticipates a standard data structure. The tsibble data abstraction lays the plumbing for data analysis modules of transformation, visualization and modeling in temporal contexts. It provides a data infrastructure to a new ecosystem, tidyverts (Tidyverts Team 2019). (The name “tidyverts” is a play on the term “tidyverse” that acknowledges the time series analysis purpose.)
3.4.1 Transformation
The tsibble package not only provides a tsibble data object but also a domain specific language in R for transforming temporal data. It takes advantage of the wrangling verbs implemented in the dplyr package, and develops a suite of new tools for facilitating temporal manipulation for primarily easing two aspects: implicit missingness handlers and time-aware aggregations.
Implicit missings are values that should be present but are absent. In regularly spaced temporal data, these are data entries that should be available at certain time points but are missing, leaving gaps in time. These can be detected when computing the interval estimate. It will be a problem for temporal models and operations like lag/lead, that expect consecutive time. A family of verbs is provided to help explore implicit missing values, and convert them into an explicit state, as follows:
has_gaps()
checks the existence of time gaps.scan_gaps()
reveals all implicit missing observations.count_gaps()
summarizes the time ranges that are absent from the data.fill_gaps()
turns them into explicit ones, along with imputing by values or functions.
These verbs are evocative, and of simple interface. They, by default, look into gaps for each individual time period. Switching on the option .full = TRUE
will fill in the full-length time span, and create fully balanced panels in longitudinal data, when possible.
The other important function, is an adverb, index_by()
, which is the counterpart of group_by()
in dplyr, grouping and partitioning by the index only. It is most often used in conjunction with summarize()
, thus creating aggregations to higher-level time resolutions. This combination automatically produces a new index and interval, and can also be used regularize data of irregular interval.
In addition to the new verbs, the dplyr vocabulary has been adapted and expanded to facilitate temporal transformations. The dplyr suite showcases the general-purpose verbs for effectively manipulating tabular data. But these verbs need handling with care due to the context switch. A perceivable difference is summarizing variables between normal data and tsibble using summarize()
. The former gives a single summary for a data table, and the latter provides the corresponding summary for each index value.
Attention has been paid to warning and error handling. The principle that underpins most verbs is a tsibble in and a tsibble out, thereby striving to maintain a valid tsibble over the course of the transformation pipeline. If the desired temporal ordering is changed by row-wise verbs (such as arrange()
and slice()
), a warning is broadcast. If a tsibble cannot be maintained in the output of a pipeline module (likely occurring with column-wise verbs), for example the index is dropped by select()
-ing, an error informs users of the problem and suggests alternatives. This avoids surprising users and reminds them of the time context. In general, users who are already familiar with the tidyverse, should have less resistance to learning the new semantics and verbs.
3.4.2 Visualization
The ggplot2 package (Wickham 2009) (as the implementation of grammar of graphics) builds a powerful graphical system to declaratively visualize data. The data underpinning of ggplot2 is tidy data, and in turn tsibble integrates well with ggplot2. The integration encourages more flexible graphics for exploring temporal structures via index, and individual or group differences via key.
Line charts are universally accepted for ordered data, such as time series plots or spaghetti plots, depending on the fields. But they end up with exactly the same grammar: chronological time mapped to the horizontal axis, and the measurement of interest on the vertical axis, for each unit. Many specialist plots centering around time series or longitudinal data, can hence be described and re-created under the umbrella of the grammar and ggplot2.
3.4.3 Model
Modeling is crucial to explanatory and predictive analytics, where time series and longitudinal data analysis diverge. The tsibble, as a model-oriented object, can flow into both types of modeling, and the new semantics (index and key) can be internally utilized to accelerate modeling.
Most time series models are univariate, such as ARIMA and Exponential Smoothing, modeling temporal dynamics for each series independently. The fable package (O’Hara-Wild, Hyndman, and Wang 2019), currently under development, provides a tidy forecasting framework built on top of tsibble, with the goal of promoting transparent and human-centered forecasting practices. With the presence of the key, a tsibble can hold many series. Since models are fundamentally scalable, the model()
and forecast()
generics will take care of fitting and forecasting univariate models to each series across time in a tsibble at once.
Panel data models, however, put emphases on overall, within, and between variation both across individuals and time. Fixed and random effects models could be developed in line with the fable design.
3.4.4 Summary
To sum up, the tsibble abstraction provides a formal organization of forwarding tidy data to model-oriented temporal data. The supporting operations can be chained for sequencing analysis, articulating a data pipeline. As Friedman and Wand (2008) stated, “No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system.” A mini snippet below, illustrates how transformation and forecasting are glued together, to realize the fluent pipeline.
pedestrian %>%
fill_gaps() %>% # turn implicit missingness to explicit
filter(year(Date_Time) == 2016) %>% # subset data of year 2016
model(arima = ARIMA(Count)) %>% # fit ARIMA to each sensor
forecast(h = days(2)) # forecast 2 days ahead
Here, the pedestrian
dataset (City of Melbourne 2017), available in the tsibble package is used. It contains hourly tallies of pedestrians at four counting sensors in 2015 and 2016 in inner Melbourne. The pipe operator %>%
introduced in the magrittr package (Bache and Wickham 2014) chains the verbs, read as “then”. A sequence of functions are composed in a way that can be naturally read from left to right, which improves the code readability. This code is read as “take the pedestrian data, fill the temporal gaps, filter to 2016 measurements, then apply an ARIMA model and forecast ahead 2 days.”
Piping coordinates a user’s analysis making it cleaner to follow, and permits a wider audience to follow the data analysis from code, without getting lost in a jungle of computational intricacies. It helps to (1) break up a big problem into more manageable blocks, (2) generate human readable analysis workflow, and (3) forestall introducing mistakes or, at least, make it possible to track, and fix, mistakes upstream through the pipeline.