3.4 Data pipelines
A data pipeline describes the flow of data through an analysis, and can generally assist in conceptualizing the process, when it is applied to a variety of problems. Mcilroy, Pinson, and Tague (1978) coined the term “pipelines” in software development while developing Unix at Bell Labs. In Unix-based computer operating systems, a pipeline chains together a series of operations on the basis of their standard streams, so that the output of each program becomes the input to another. The Extract, Transform, and Load (ETL) process from recent data warehousing literature dating back to Kimball and Caserta (2011) outlines the workflow to prepare data for analysis, and can also be considered a data pipeline. Buja et al. (1988) describes a viewing pipeline for interactive statistical graphics, that takes control of the transformation from data to plot. Swayne, Cook, and Buja (1998), Swayne et al. (2003), Sutherland et al. (2000), Wickham et al. (2010) and Xie, Hofmann, and Cheng (2014) implemented data pipelines for the interactive statistical software XGobi, GGobi, Orca, plumbr and cranvas, respectively. The pipeline is typically described with a one way flow, from data to plot. For interactive graphics, where all plots need to be updated when a user interacts with one plot, the events typically trigger the data pipeline to be run. Xie, Hofmann, and Cheng (2014) uses a reactive programming framework, to implement the pipeline, in which user’s interactions trigger a sequence of modules to update their views, that is, practically the same as running the data pipeline producing each plot.
Building a data pipeline is technically difficult: many implementation decisions have to be made about the interface, input and output objects and functionality. The tidy data abstraction lays the plumbing for data analysis modules of transformation, visualization and modeling. Each module communicates with the others, requiring tidy input, producing tidy output, chaining a series of operations together to accomplish the analytic tasks.
What is notable about an effective implementation of a data pipeline is that it coordinates a user’s analysis making it cleaner to follow, and permits a wider audience to focus on the data analysis without getting lost in a jungle of computational intricacies. A fluent pipeline glues tidy data and the grammar of data manipulation together. It helps (1) break up a big problem to into manageable blocks, (2) generate human readable analysis workflow, (3) avoid introducing mistakes, at least making it possible to trace them through the pipeline. New data tools developed in the R package tsibble (Wang, Cook, and Hyndman 2019) articulate the time series data pipeline, which shepherds raw temporal data through to time series analysis, and plots. More detailed explanations are given in the following sections, and the examples.
3.4.1 Time series transformation
Figure 3.3 illustrates the distinction of a time series pipeline from a regular data pipeline. It is highly recommended to check for identical entries of key and index before constructing a tsibble. Duplicates signal the data quality issue, which would likely affect subsequent analyses and hence decision making. Analysts are encouraged to gaze at data early and reason about the process of data cleaning. When the data meets the tsibble standard, it flows neatly into the analysis stage and takes full advantage of the tsibble infrastructure.
Many time operations such as lag/lead and time series models, assume an intact vector input ordered in time. Since a tsibble permits time gaps in the index, it is good practice to check and inspect any gaps in time following the creation of a tsibble, in order to prevent inviting these avoidable errors into the analysis. The first suite of verbs (rephrasing actions performed on the object) are provided to understand and tackle implicit missing values: (1)
has_gaps() checks if there exists time gaps; (2)
scan_gaps() reveals all implicit missing observations; (3)
count_gaps() summarizes the time ranges that are absent from the data; (4)
fill_gaps() turns them into explicit ones, along with imputing by values or functions. To look into gaps over individual time periods or full-length time span, the common argument
.full in these functions gives an option to easily switch between. The specification of
.full = TRUE will result in fully balanced panels in other words.
|has_gaps()||Test if a tsibble has gaps in time|
|scan_gaps()||Reveal implicit missing entries|
|count_gaps()||Summarize time gaps|
|fill_gaps()||Fill in gaps by values and functions|
|filter()||Pick rows based on conditions|
|filter_index()||Provide a shorthand for time subsetting|
|slice()||Select rows based on row positions|
|arrange()||Sort the ordering of row by variables|
|select()||Pick columns by variables|
|mutate()||Add new variables|
|transmute()||Drops existing variables|
|summarize()||Aggregate values over time|
|index_by()||Group by index candidate|
|group_by()||Group by one or more variables|
|group_by_key()||Group by key variables|
|gather()||Gather columns into long form|
|spread()||Spread columns into wide form|
|nest()||Nest values in a list-variable|
|unnest()||Unnest a list-variable|
|left_join()||Join two tables together|
|right_join()||Join two tables together|
|full_join()||Join two tables together|
|inner_join()||Join two tables together|
|semi_join()||Join two tables together|
|anti_join()||Join two tables together|
Besides the time gap verbs, the tidyverse vocabulary is adapted and expanded to facilitate time series transformations, as listed in Table 3.2. The tidyverse suite showcases the general-purpose verbs for effectively manipulating tabular data, for example
filter() picks observations,
select() picks variables, and
left_join() joins two tables. But these verbs need handling with care when used in the time series domain. A perceivable difference is summarizing variables between data frame and tsibble using
summarize(). The former will reduce to a single summary, whereas the latter will obtain the index and their corresponding summaries. Users who are already familiar with the tidyverse, will experience a gentle learning curve for mastering these verbs and glide into time series analysis with low cognitive load.
Attention has been paid to warning and error handling. The principle that underpins most verbs is a tsibble in and a tsibble out, thereby striving to maintain a valid tsibble by automatically updating index and key under the hood. If the desired temporal ordering is changed by row-wise verbs, a warning is broadcast. If a tsibble cannot be maintained in the output of a pipeline module (likely occurring to column-wise verbs), for example the index is removed by selection, an error informs users of the problem and suggests alternatives. This avoids surprising users and reminds them of the time context.
The tsibble structure and operations support data pipelines for sequencing analysis. Friedman and Wand (2008) asserted “No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system.” Each verb works with other transformation family members in harmony. This set of verbs can result in many combinations to prepare tsibble for a broad range of visualization and modeling problems. Chaining operations is achieved with the pipe operator
%>% introduced in the magrittr package (Bache and Wickham 2014), read as “then”. A sequence of functions can be composed in a way that can be naturally read from left to right, which improves the readability of the code. It consequently generates a block of code without saving intermediate values.
Most importantly, a new ecosystem for tidy time series analysis has been undertaken, using the tsibble framework, and is called “tidyverts”, a play on tidyverse that acknowledges the time series analysis purpose.
3.4.2 Time series visualization
As a tsibble is a subclassing of data frame, it integrates well with the grammar of graphics. It is easy to create and extend specialist time series plotting methods based on the tsibble structure, for example autocorrelation plots and calendar-based graphics (Wang, Cook, and Hyndman 2018a).
3.4.3 Time series models
Modeling is crucial to explanatory and predictive analytics, but often imposes stricter assumptions on tsibble data. The verbs listed in Table 3.2 ease the transition to a tsibble that suits modeling. A tidy forecasting framework built on top of tsibble is under development, which aims at promoting transparent forecasting practices and concise model representation. A tsibble usually contains multiple time series. Batch forecasting will be enabled if a univariate model, such as ARIMA and Exponential Smoothing, is applied to each time series independently. This yields a “mable” (short for model table), where each model relates to each “key” value in tsibble. This avoids expensive data copying and reduces model storage. The mable is further supplied to forecasting methods, to produce a “fable” (short for forecasting table) in which each “key” along with its future time holds predictions. It also underlines the advantage of tsibble’s “key” in acting as linkage between data inputs, models and forecasts. Advanced forecasting techniques, such as vector autocorrelation, hierarchical reconciliation, and ensembles, can be developed in a similar spirit. The modeling module is a current endeavor.