Introduction
The ongoing rapid deployment of IoT and edge device systems has led to the generation of a large amount of time series data [1]. Due to incomplete monitoring of sensors, these time series often contain a large number of missing values, making them difficult to directly utilize [2]. Effectively modeling time series with missing values for prediction is a challenging problem [3]. In this context, Multivariate Time Series Forecasting (MTSF) offers a more comprehensive approach by simultaneously considering the interdependencies among multiple time series. However, when missing values are prevalent, traditional MTSF models struggle to capture temporal and spatial dependencies from past to future effectively [4]. To this end, modeling MTSFMV involves two steps: (i) performing imputation around missing values to alleviate data sparsity, and (ii) extracting temporal and spatial dependencies using known and imputed values. Ignoring either of these two steps will lead to degraded performance [5].
Given the limited attention paid to MTSFMV, existing methods can be roughly divided into two categories: imputation models and forecasting models. The former focuses on reconstructing missing values using various techniques, such as bidirectional recurrent units [6], spatial attention [7], and graph neural networks [8] –[10]. However, they lack secondary modeling of the relationship between known and imputed values, resulting in poor performance when directly applied to MTSFMV. The latter is designed for abundant data and tends to collapse when faced with a large number of missing values [11]. For example, GCN-M [12] proposes a memory network that considers local and global spatiotemporal features and focuses on traffic flow prediction with missing values. However, it does not consider the variation of missing patterns when modeling spatiotemporal correlations and relies on a predefined graph structure, which might lead to suboptimal solutions and lack certain generalization capabilities. BiTGraph [13] proposes a temporal convolutional network with a bias term to incorporate missing patterns into spatiotemporal relationship modeling. However, the missing pattern that changes over time is an uncertain factor, and simply incorporating partial convolution [14] into 1D-CNN will lead to unstable and unrobust interpolation. Additionally, the design of Biased GCN ignores the spatial relationship that changes over time, which might lead to certain limitations when facing scenes with complex spatiotemporal dynamic features.
Inspired by the above observations, we propose Spatiotemporal Missing Pattern Awareness Networks (STMPANets) that explicitly consider time-varying missing patterns when capturing temporal correlations and spatial dependencies. We first decompose the series into seasonal trend components to highlight the inherent properties. Subsequently, we carefully design two key modules: MGCPT and ADGCN. MGCPT performs partial convolution with conditional restrictions to mine temporal correlations of different granularities by adjusting the imputation rate of missing values. ADGCN captures dynamic spatial dependencies by constructing dynamic graphs and incorporating enhanced missing patterns. We also integrate the above modules into a multi-branch hierarchical framework. The main contributions can be summarized as follows:
We propose STMPANets, which can simultaneously capture temporal correlation and dynamic spatial dependence of time series forecasting with missing values.
We design MGCPT to perceive missing patterns from different time granularities and propose ADGCN to incorporate missing patterns for dynamic feature interaction.
Experimental results on three real-world datasets demonstrate the significant improvement of STMPANets over other baselines.
The overall framework of STMPANets and the detail of Multi-granularity Conditional Partial TCN.
Preliminaries
Multivariate Time Series Forecasting. Given a historical observation sequence
Multivariate Time Series Forecasting with Missing Value. Not all historical observed variables have observed values. We define the mask matrix
Methodology
The overall STMPANets framework, shown in Fig. 1(a). Details of each model component are described below.
A. Sequence Decomposition
Inspired by the traditional time series decomposition algorithm [15], we introduce a sequence decomposition module to separate the complex patterns of the input sequence to decouple the temporal pattern and highlight the inherent properties of the sequence [16]. Specifically, the input sequence \begin{align*} & {{\mathcal{X}}_t} = \operatorname{AvgPool} {(\operatorname{Padding} ({\mathcal{X}}))_{kernel}}\tag{1} \\ & {{\mathcal{X}}_s} = {\mathcal{X}} - {{\mathcal{X}}_t}\tag{2}\end{align*}
B. Multi-granularity Conditional Partial TCN
Due to TCN having better sequence modeling capabilities than RNN in a variety of time series tasks, in this paper, we introduce the improved TCN as the primary backbone to capture temporal correlation. As illustrated in Fig. 1(b), MGCPT has two main layers: the dilated inception layer and the conditional partial layer. To simplify the description, we omit the superscripts in the following description.
Dilated Inception Layer. We adopt the dilated Inception layer structure proposed in [17], which can obtain a wide receptive field with less computational cost, thereby extracting high-level temporal features. Given an input \begin{equation*}{{\mathcal{X}}^o}(n,c,t) = \sum\limits_{s = 0}^{K - 1} {\mathcal{F}} (c,s)\cdot {\mathcal{X}}(n,c,t - d\cdot s)\tag{3}\end{equation*}
Conditional Partial Layer. Motivated by the successful application of partial convolution in computer vision tasks [18] –[20], we induce conditional partial convolution to model the temporal correlation with partial observations. Unlike ordinary partial convolution [14], we propose only considering aggregating observable information from a certain conditional proportion of neighbors for each time series variable with missing values. Specifically, for a specific time series variable \begin{equation*}{x^\prime } = {\begin{cases} {{{\mathbf{W}}^T}(x \odot {\mathbf{m}})\frac{K}{{\sum {({\mathbf{m}})} }} + {\mathbf{b}},if\frac{{\sum {({\mathbf{m}})} }}{{\sum {({\mathbf{1}})} }} \geq \tau } \\ {{\mathbf{0}},otherwise} \end{cases}}\tag{4}\end{equation*}
\begin{equation*}{{\mathcal{M}}^\prime }(x) = {\begin{cases} {1,if\frac{{\sum {({\mathbf{m}})} }}{{\sum {({\mathbf{1}})} }} \geq \tau } \\ {0,otherwise} \end{cases}}\tag{5}\end{equation*}
To capture the temporal correlations at multiple granularities hidden in the input sequence, we propose to aggregate multi-granularity convolutions with different scales into the Dilated Inception Layer and the Conditional Partial Layer. Specifically we use kernel sizes of 1 × 2, 1 × 3, 1 × 5 and 1 × 7 to aggregate temporal information of different granularities through maximum pooling. Subsequently, two gate functions, tanh(•) and sigmoid(•), are used to control the amount of information passed to the next module.
C. Adaptive Dynamic GCN
The MGCPT module captures temporal features but ignores dynamic spatial relations between sequences. Unlike existing methods [21] –[23], we construct an adaptive dynamic graph to learn spatial node relations at each timestamp and incorporate the enhanced missing pattern into the learning process of graph structure to explain missing values.
Adaptive Dynamic Graph Construction. We first initialize two learnable random node embeddings E1, E2 ∈ ℝN×d to represent static node features, as shown in Fig. 2(a). In the absence of a predefined graph structure, we use a Gaussian kernel and dynamic node filter to learn dynamic node relations. This process is described as:
\begin{equation*}{\mathcal{G}} = exp\left( { - \frac{{{{\left\| {{E_1} - {E_2}} \right\|}^2}}}{{2{\sigma ^2}}}} \right)\tag{6}\end{equation*}
In particular, we aim for dynamic changes in one node to affect another, with the learned graph structure being unidirectional. Thus, we fuse random node embeddings and construct the dynamic adjacency matrix \begin{align*} & \hat E_1^t = tanh\left( {\alpha \left( {{{\mathcal{F}}_t}{E_1}} \right)} \right),\hat E_2^t = tanh\left( {\alpha \left( {{{\mathcal{F}}_t}{E_2}} \right)} \right)\tag{7} \\ & {{\mathcal{A}}_t} = ReLU\left( {tanh\left( {\alpha \left( {\hat E_1^t\hat E_2^{tT} - \hat E_2^t\hat E_1^{tT}} \right)} \right)} \right)\tag{8}\end{align*}
Missing Pattern Enhancement. To enhance the mask matrix, we introduce a self-attention mechanism to better capture missing patterns, as shown in Fig. 2(b). We first compute the self-attention weight matrix \begin{equation*}{{\mathcal{W}}_{att}} = softmax\left( {\frac{{{\mathbf{Q}}{{\mathbf{K}}^T}}}{{\sqrt {{d_k}} }}} \right){\mathbf{V}}\tag{9}\end{equation*}
\begin{equation*}{{\mathcal{M}}^{\prime \prime (l)}} = sigmoid\left( {{{\mathcal{M}}^{\prime (l)}} \oplus {{\mathcal{W}}_{att}} \odot {{\mathcal{M}}^{\prime (l)}}} \right)\tag{10}\end{equation*}
\begin{equation*}{\mathcal{A}}_t^{\prime (l)} = {\mathcal{A}}_t^{(l)} \oplus \beta sigmoid\left( {{{\mathcal{S}}^{(l)}} \odot {{\mathcal{M}}^{\prime \prime (l)}}{{\mathcal{M}}^{\prime \prime (l)T}}} \right)\tag{11}\end{equation*}
Dynamic Feature Interaction. Combined with the updated dynamic adjacency matrix, we apply graph convolution to perform dynamic feature interaction:
\begin{equation*}{{\mathcal{X}}^{(l + 1)}} = \left( {I + {\mathcal{D}}_o^{ - 1}{{\mathcal{A}}^{\prime (l)}} + {\mathcal{D}}_i^{ - 1}{{\mathcal{A}}^{\prime (l)T}}} \right){{\mathcal{X}}^{\prime (l)}}{{\mathbf{W}}^{(l)}} + {b^{(l)}}\tag{12}\end{equation*}
D. Loss and Training
To train our model, we select Mean Absolute Error (MAE) as the training object of our proposed method, and the loss function can be defined as:
\begin{equation*}{\mathcal{L}}\left( {{\mathcal{Y}},\widehat {\mathcal{Y}},{{\mathcal{M}}_{l:l + Q}}} \right) = \frac{{\sum\nolimits_{h = l}^{t + Q - 1} {\sum\nolimits_{n = 1}^N {m_h^{(n)}} } \odot \left| {y_h^{(n)} - \hat y_h^{(n)}} \right|}}{{\sum\nolimits_{h = l}^{t + Q - 1} {\sum\nolimits_{n = 1}^N {m_h^{(n)}} } }}\tag{13}\end{equation*}
Experiment
A. Experimental Setting
Datasets. To evaluate the proposed STMPANets, we conduct extensive experiments on three popular real-world datasets from different domains: PEMS-BAY [24], Weather [15] and BeijingAir [6]. We randomly discard data according to the missing rate r, which ranged from 0.2 to 0.8.
Baselines. We compare STMPANets with several classical imputation-forecasting methods and the latest state-of-the-art baselines, including BRTIS [6], SPIN [7], GRIN [8], GCN-M [12], AGCRN [25], MTGNN [17], Autoformer [15], GinAR [26] and BiTGraph [13]. We fill the missing parts with zeros for models requiring complete input for prediction.
Configurations and Evaluation Metrics. The datasets are split into training, validation, and test sets in a 6:2:2 ratio. All methods use a historical window P = 24 and a forecasting horizon Q = 24. The learning rate is set to 0.001, and the batch size is 32. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are used as evaluation metrics, with lower values indicating better performance.
B. Experimenmtal Results
TABLE I shows the forecasting performance of STMPANets and other baselines across three datasets at missing rates of 0.2 and 0.8. The results indicate that: (i) SPIN and GRIN excel in handling time series imputation but are limited by insufficient modeling of the relationship between observed and missing data. (ii) In multivariate time series models explicitly considering missing values, GinAR may experience error accumulation due to variable missing, while BiTGraph, despite its bias correction for missing patterns, might overlook time-varying spatial relationships. (iii) Our proposed STMPANets outperforms all other models, achieving nearly an 8.92% improvement over the best baseline, thanks to its ability to integrate spatiotemporal missing patterns and capture temporal and spatial dependencies. Notably, its performance gain is more pronounced at a missing rate of 0.8.
C. Ablation Study
To validate the effectiveness of the key components in STMPANets, we conduct ablation experiments. We design four model variants: (i) w/o DIL: removing the Dilated Inception Layer, (ii) w/o CPL: removing the Conditional Partial Layer, (iii) w/o MPE: replacing the enhanced mask with the original mask and (iv) w/o DFI: replacing the updated dynamic adjacency matrix with the original matrix. As shown in Fig. 3 ,it is evident that all key components of STMPANets play crucial roles. The most promising components are Conditional Partial Layer and Missing Pattern Enhancement, suggesting that incorporating missing patterns into spatiotemporal modeling effectively captures sequence dependencies. Meanwhile, the introduction of a dynamic graph effectively captures dynamic spatial relationships.
D. Hyperparameter Sensitivity
We investigate the hyperparameter sensitivity of STM-PANets on the PEMS-BAY dataset, as shown in Fig. 4. The results indicate that the optimal performance is achieved with 3 layers, as too many layers may lead to overfitting, while too few may fail to capture complex spatiotemporal relationships. In particular, τ controls the size of the observable masked area and is highly sensitive to model performance, while setting τ = 0.5 yields the best imputation results. This is because the smaller τ is, the larger the area with missing values observed by the partial convolution, in other words, the more likely it is to focus on the local area of the predicted part at this stage. When the value of τ is large, it means that more mask areas are retained to the next level for filling. The similar values of the learnable β across different missing rates suggest that β is independent of the missing rate and functions to control the correction strength.
Conclusion
In this paper, we propose a novel spatiotemporal missing pattern awareness network that simultaneously captures temporal correlations and dynamic spatial dependencies for time series forecasting with missing values. The STMPANets framework starts from the sequence decomposition perspective and combines the carefully designed MGCPT and ADGCN modules to perceive missing patterns along the time dimension and spatial dimension respectively. Extensive experiments on three real-world datasets demonstrate its superior performance across various missing value scenarios. In the future, we will improve the applicability of STMPANets to identify more complex dynamic missing value patterns in time series data.