4.5 Scaling up to large collections of temporal data
Section 4.4 discussed the graphics for revealing and understanding the missingness patterns in a handful of series. However, they have little capacity for scaling up to handling a large collection of temporal data, which involve many series and many measures in a table. A solution to dealing with missing values at scale is proposed and described in this section, which is referred to as “missing data polish”. Tukey (1977) coined the term “median polish” — an iterative procedure to obtain an additive-fit model for data. Here, a new analytical technique is developed to strategically remove observations and variables in order to reduce the proportion of missing values in the data, called “missing data polish”. The polishing process can give numerical summaries to facilitate the understanding, and in turn produce a reasonable subset to work with, especially when too many missings are scattered across variables and observations.
4.5.1 Polishing missing data by variables and observations
The polishing procedure assumes that the incoming data is a “tsibble” (Wang, Cook, and Hyndman 2019a). The tsibble is a modern re-imagining of temporal data, which formally organizes a collection of related observational units and measurements over time in a tabular form. A tsibble consists of index, key, and measured variables in a long format. The index variable contains time in chronological order; the key uniquely defines each observational unit over time; columns other than index and key are classified as measures.
This data structure invokes polishing procedures in two directions (by rows and columns), resulting in four polishers:
na_polish_measures()
: A column polisher for removing measured variables.na_polish_key()
: A row polisher for removing a whole chunk of units across measures.na_polish_index()
: A row polisher for removing leading indexed observations within each unit across measures.na_polish_index2()
: A row polisher for removing trailing indexed observations within each unit across measures.
This set of polishers covers the basics of missing values occurring in a tsibble. The decision rule on deleting certain rows or columns is controlled by a constant cutoff value (\(0 \le c \le 1\)). Each polisher first computes \(p_{i} = \text{proportion of overall missings}\), where \(i\) is a partition of the data (i.e. each column for na_polish_measures()
and each chunk of rows for the rest of polishers). If \(p_{i} \ge c\), the \(i\)th column or chunk of rows will be removed; otherwise as is.
However, an ideal choice of \(c\) is not clear. Missing data polishing is an upstream module relative to other analytical tasks from data visualization to modeling. These analytics have various degrees of tolerance for missing values. For example, data plots are almost independent of missing data, implying higher tolerance. For such, specifying a higher \(c\) removes little data. On the other hand, (time series) models would likely complain about the existence of any missings and some would even decline the job, requiring lower tolerance. A lower \(c\) is likely to produce a complete dataset for such downstream analyses, but may remove too much data.
4.5.2 Formulating polishing strategies
The polishers described in the previous section are the elementary tools provided to analysts for brushing missings away. A few iterations using these functions are often required with lots of manual efforts to achieve a desired result. The polished data can be influenced by the ordering of polishers and cutoff choices. The polishing goal, in general, is to maximize the proportion of missings in the removed data slice as well as to minimize the number of observations to be removed. An automated polishing strategy is formulated to refine the procedure with less human involvement, as implemented in na_polish_auto()
. It essentially takes care of the sequence of polishers and the number of iterations in operation, but leaves the cutoff in the user’s hands. This automating process involves a loss metric in a loop to determine the order the polishers and when to stop the iterations. This loss metric is defined as
\[\begin{equation}
l_{i} = (1 - p_{i}) \times \frac{r_{i}}{N},
\tag{4.1}
\end{equation}\]
where \(p\) is the proportion of missings, \(r\) is the number of removed observations for each data slice \(i = 1, \dots\), and \(N\) is the total observations. Minimizing the loss \(l\) guides the polishing procedure:
- Run four polishers independently to obtain \(l_{i}\).
- Re-run the polishers sequentially according to the \(l_{i}\) from high to low, and obtain \(l_{I}\) where \(I\) is an iteration.
- Repeat 1 and 2 until \(l_{I} \le \tau\), where \(\tau\) is a pre-specified tolerance value close to \(0\). (Early exit given a higher \(\tau\).)
The companion function na_polish_autotrace()
documents the entire polishing process above, and traces the \(p_{i}\), \(r_{i}\), and \(l_{i}\) along the way. These quantities can provide useful visual summaries about missing data patterns, and an aid to choose the cutoffs in return.