4.3 New data abstraction and operations for missing data in time
Figure 4.1 is a typical time series plot, plotting the present data and leaving gaps between to indicate what is not available. The result is that missing values receive little attention, due to a lack of visual emphasis. A need to better represent and display missing data in time is exposed. To begin this process, it is convenient to first address appropriate computer representation. In R, missing values are encoded as NA
. However, the notion of an ordered NA
does not exist. A new abstraction for ordered NA
provides the scope for conveying the temporal locations and dependencies in missing data.
4.3.1 New encoding for indexing missing data by time
Inspired by run-length encoding (RLE), a new encoding is proposed to solely extract the NA
s from time-indexed data and compress them in a simpler form, namely “RLE <NA>”. It comprises three components to locate the missings and mark their corresponding runs: (1) positions where NA
starts, (2) run lengths (NA
in a row), and (3) interval (for example, hourly or yearly intervals). This implies that time indices should be unique.
This new encoding purely focuses on indexed missing values, separated from its data input. It is partially lossless, because its reverse operation can recover the original positions of missing values, but not the whole data. It is most useful and compact on indexing runs of missing data, requiring less storage than its original lengthy form. However, when missings mostly involve runs of length one, it is not that advantageous. Considering the missingness types of Missing at Occasions and Missing at Runs in Figure 4.1, the former occupies 14 positions to store NA
s; while the latter uses 7 positions for storing more NA
s than the former as a sparser representation. The RLE <NA> is easy to interpret: a sequence of 12 NA
s beginning at 1949 March, followed by 13 NA
s since 1950 August, and so on, for the latter.
#> <list_of<Run Length Encoding <NA>>[2]>
#> [[1]]
#> <Run Length Encoding <NA>[13]>
#> $lengths: <int> 1 1 1 1 1 1 1 1 1 1 ...
#> $indices: <date> 1951 Apr 1952 May 1953 May 1954 Apr 1955 Feb 1955
Nov 1956 Jul 1957 Feb 1957 Aug 1958 Jan ...
#> [[2]]
#> <Run Length Encoding <NA>[7]>
#> $lengths: <int> 12 13 5 3 4 8 4
#> $indices: <date> 1949 Mar 1950 Aug 1951 Dec 1953 Jan 1955 Jun 1956
Apr 1958 Dec
The instance of RLE <NA> is a reduced form for representing NA
in time, built on top of the new vctrs framework (H. Wickham, Henry, and Vaughan 2019).
4.3.2 Supporting functions operating on RLE <NA>
The RLE <NA> prioritizes indexed missing data as the raw data itself, that provides the opportunities to manipulate the missings with many useful operations.
It is computationally efficient to sum (sum()
) and count (length()
) the run lengths over a standalone RLE <NA>, than directly dealing with its original form for identical results. Other mathematical functions, such as mean()
and median()
, make it accessible to compute runs-related statistics. For example, mean()
gives the average of missings per run. If not going on the route of RLE, it would be cumbersome to compute these statistics otherwise.
These math operations primarily require a singular RLE <NA> at a time. The other set is the set operators that performs set union (union()
), intersection (intersect()
), and asymmetric difference (setdiff()
) on a pair of RLE <NA>. They are useful for exploring the association between multiple sets of missing data. For example, the intersect()
operator could tell if they overlap with each other and by how much, which powers one of the plots in the next section. Since set operators are binary functions, a collection of series can be successively combined and applied to give an overall picture about all.