Introduction
Multivariate time series forecasting is a primary machine learning task in both scientific research and industrial applications [1], [2]. The interactions and dependencies between many time series data govern how they evolve, and these can range from simple linear correlations to complex relationships such as the traffic flows underlying intelligent transportation systems [3], [4], [5], [6] or physical forces affecting the trajectories of objects in space [7], [8], [9].
Accurately predicting future values of the time series may require understanding their true relationships, which can provide valuable insights into the system represented by the time series. Recent studies aim to jointly infer these relationships and learn to forecast in an end-to-end manner, even without prior knowledge of the underlying graph [5], [10]. However, inferring the graph from numerous time series data has a quadratic computational complexity, making it prohibitively expensive to scale to a large number of time signals.
Another important aspect of time series forecasting is the presence of non-stationary properties, such as seasonal effects, trends, and other structures that depend on the time index [11]. Such properties may need to be eliminated before modeling, and a recent line of work aims to incorporate trend and seasonality decomposition into the model architecture to simplify the prediction process [12], [13].
Therefore, it is natural to ask whether one can leverage deep neural networks to combine the strength of both worlds: 1) using a latent graph structure that aids in time series forecasting with each signal represented as a node and the interactions between them as edges, and 2) using end-to-end training to model the time series by decomposing it into multiple levels, which enables separate modeling of different patterns at each level, and then combining them to make accurate predictions. Existing works have not addressed both of these strengths together in a unified framework, and this is precisely the research question we seek to address in our current study.
To address this, we propose the use of graph neural networks (GNN) and a self-attention mechanism that efficiently infers latent graph structures with a time complexity and memory usage of
We introduce a novel approach that extends hierarchical signal decomposition, merging it with concurrent hierarchical latent graphs learning. This is termed as hierarchical joint graph learning and multivariate time series forecasting (HGMTS).
Our method incorporates a sparse self-attention mechanism, which we establish as a good inductive bias when learning on graphs and addressing LSTF challenges.
Through our experimental findings, it is evident that our proposed model outperforms traditional transformer networks in time series forecasting. The design not only sets a superior standard for direct multi-step forecasting but also establishes itself as a promising spatio-temporal GNN benchmark for subsequent studies bridging latent graph learning and time series forecasting.
Related Work
Until recently, deep learning methods for time series forecasting have primarily focused on utilizing recurrent neural networks (RNN) and their variants to develop a sequence-to-sequence prediction approach [14], [15], [16], [17], which has shown remarkable outcomes. Despite significant progress, however, these methods are yet to achieve accurate predictions for long sequence time series forecasting (LSTF) due to challenges such as the accumulation of errors in many steps of unrolling, as well as vanishing gradients and memory limitations [18].
Self-attention based transformer models proposed recently for LSTF tasks have revolutionized time series prediction and attained remarkable success. In contrast to traditional RNN models, transformers have exhibited superior capability in capturing long-range temporal dependencies. Still, recent advancements in this domain, as illustrated by LongFormer [19], Reformer [20], Informer [21], AutoFormer [22], and ETSformer [23], have predominantly zeroed in on improving the efficiency of the self-attention mechanism, particularly for handling long input and output sequences. Concurrently, there has been a rise in the development of attention-free architectures, as seen in Oreshkin et al. [12] and Challu et al. [13], which present a computationally efficient alternative for modeling extensive input-output relationships by using deep stacks of fully connected layers. However, such models often overlook the intricate interactions between signals in multivariate time series data, tending to process each time series independently.
Spatio-temporal graph neural networks (ST-GNNs) are a specific type of GNNs that are tailored to handle both time series data and their interactions. They have been used in a wide range of applications such as action recognition [24], [25] and traffic forecasting [26], [27], [28]. These networks integrate sequential models for capturing temporal dependencies with GNNs employed to encapsulate spatial correlations among distinct nodes. However, a caveat with ST-GNNs is that they necessitate prior information regarding structural connectivity to depict the interrelations in time series data. This can be a limitation in cases where the structural information is not available.
Accordingly, GNNs that include structure learning components have been developed to learn effective graph structures suitable for time series forecasting. Two such models, NRI [8] and GTS [6], calculate the probability of an edge between nodes using pairwise scores, resulting in a discrete adjacency matrix. Nonetheless, this approach can be computationally intensive with a growing number of nodes. In contrast, MTGNN [5] and GDN [10] utilize a randomly initialized node embedding matrix to infer the latent graph structure. While this approach is less taxing on computational resources, it might compromise the accuracy of predictions.
Methods
In this section, we detail our proposed method, HGMTS. The overarching framework and core operational principles of this approach can be viewed in Figures 1 and 2.
Overview of the latent graph structure learning (L-GSL). (a) Key nodes chosen at random (depicted as gray circles) are used to measure the significance of a query node (shown as a blue circle). (b) Top-
Overview of the proposed HGMTS model architecture. The hierarchical residual block is marked by signal decomposition and GNN-centric L-GSL modules (left). The combination of multiple blocks forms a stack (middle), culminating in the entire model design (right) to ultimately produce a global forecasting output.
Ablation study overview. Displayed are four distinct model architectures explored to understand the impact of specific components on overall LSTF performance.
A. Preliminaries
Let
B. Latent Graph Structure Learning (L-GSL)
We embrace the concept of self-attention (introduced by [29]) and employ the attention scores in the role of edge weights. The process of learning the adjacency matrix of the graph, denoted as \begin{equation*} {\mathbf {Q}}= {\mathbf {H}}\mathbf {W}^{Q},\quad {\mathbf {K}} = {\mathbf {H}} {\mathbf {W}} ^{K},\; \mathcal {A}=\mathrm {softmax}\left ({\frac { {\mathbf {Q}} {\mathbf {K}} ^{T}}{\sqrt {D}}}\right) \tag{1}\end{equation*}
1) Identifying Pivotal Query Nodes
For the purpose of determining which query nodes will establish connections with other nodes, our initial step involves evaluating the significance of queries. Recent studies [19], [21], [30], [31] have highlighted the existence of sparsity in the distribution of self-attention probabilities. Drawing inspiration from these findings, we establish the importance of queries based on the Kullback-Leibler (KL) divergence between a uniform distribution and the attention probability distribution of query nodes.
Let
The traversal of all query nodes for this measurement, however, still entails a quadratic computational requirement. It is worth noting that a recent study demonstrated that the relative magnitudes of query importance remain unchanged even when the divergence metric is calculated using randomly sampled keys [21]. Building on this idea, we determine the importance of query nodes through the computation of
2) Identifying Associated Key Nodes
Using the selected set of \begin{equation*} \bar {\mathcal {A}}=\mathrm {softmax}\left ({\frac {\bar {\mathbf {Q}} \bar {\mathbf {K}} ^{T}}{\sqrt {D}}}\right) \tag{2}\end{equation*}
In this equation,
C. Hierarchical Signal Decomposition
This section provides an overview of the proposed approach shown in Figure 2 and discusses the overall design principles. Our approach builds upon N-BEATS [12], enhancing its key elements significantly. Our main methodology comprises of three primary elements: signal decomposition, latent graph structure learning, and constructing forecasts and backcasts in a hierarchical manner. Much like the N-BEATS approach, every block is trained to generate signals for both backcast and forecast outputs. Here, the backcast output is designed to be subtracted from the input of the subsequent block, whereas the forecasts are combined to produce the final prediction (Figure 2a). These blocks are arranged in stacks, each focusing on a distinct spatial dependency through a unique set of graph structures.
1) Signal Decomposition Module
Recent research has witnessed a surging interest in disentangling time series data into its trend and seasonal components. These components respectively represent the overall long-term pattern and the seasonal fluctuations within the time signals. However, when it comes to future time series, directly performing this decomposition becomes impractical due to the inherent uncertainty of the future. To address this challenge, we propose the incorporation of a signal decomposition module within a single block (Figure 2a). This module enables the gradual extraction of the consistent, long-term trend from intermediate forecasting signals. Specifically, we employ the moving average technique to smooth out recurring fluctuations and uncover the underlying long-term trends as outlined below:\begin{align*} {\mathbf {X}}^{\texttt {trend}} &=\mathrm {AvgPool}(\mathrm {Padding}({\mathbf {X}})) \\ {\mathbf {X}}^{{\texttt {seas}}} &= {\mathbf {X}}- {\mathbf {X}}^{\texttt {trend}} \tag{3}\end{align*}
2) Message-Passing Module
The message-passing module receives as input the past \begin{align*} {\mathbf {h}}_{i}^{(0)} &=\; f({\mathbf {x}}_{i,t-L:t}) \tag{4}\\ {\mathbf {m}}_{ij}^{(r)} &=\; g({\mathbf {h}}_{i}^{(r)}- {\mathbf {h}}_{j}^{(r)}) \tag{5}\\ \mathcal {\bar A} &=\; \text {L-GSL}({\mathbf {H}}) \tag{6}\\ {\mathbf {h}}_{i}^{(r+1)} &=\; \text {GRU}\left({{\mathbf {h}}_{i}^{(r)}, \textstyle \sum \nolimits _{j \in \mathcal {N}(i)} \bar a_{ij} \cdot {\mathbf {m}} _{i j}^{(r)}}\right) \tag{7}\end{align*}
To enhance both the model’s expressivity and its capacity for generalization, we employ a multi-module GNN framework [32]. More specifically, the next hidden state \begin{equation*} {\mathbf {h}}_{i}^{(r+1)} =\; \beta _{i}^{(r)} {\mathbf {h}}_{i,1}^{(r)} + (1-\beta _{i}^{(r)}) {\mathbf {h}}_{i,2}^{(r)} \tag{8}\end{equation*}
3) Forecast and Backcast Module
Following the completion of the last \begin{align*} \hat {\mathbf {x}} _{i,t-L:t}^{\texttt {seas}} &=\; \phi _{\texttt {seas}}({\mathbf {h}}_{i,\texttt {seas}}^{(R)}) \tag{9}\\ \hat {\mathbf {x}} _{i,t-L:t}^{\texttt {trend}} &=\; \phi _{\texttt {trend}}({\mathbf {h}}_{i,\texttt {trend}}^{(R)}) \tag{10}\\ \hat {\mathbf {x}} _{i,t-L:t} &=\; \hat {\mathbf {x}} _{i,t-L:t}^{\texttt {seas}} + \hat {\mathbf {x}} _{i,t-L:t}^{\texttt {trend}} \tag{11}\\ \hat {\mathbf {y}} _{i,t+1:t+K}^{\texttt {seas}} &=\; \psi _{\texttt {seas}}({\mathbf {h}}_{i,\texttt {seas}}^{(R)}) \tag{12}\\ \hat {\mathbf {y}} _{i,t+1:t+K}^{\texttt {trend}} &=\; \psi _{\texttt {trend}}({\mathbf {h}}_{i,\texttt {trend}}^{(R)}) \tag{13}\\ \hat {\mathbf {y}} _{i,t+1:t+K} &=\; \hat {\mathbf {y}} _{i,t+1:t+K}^{\texttt {seas}} + \hat {\mathbf {y}} _{i,t+1:t+K}^{\texttt {trend}} \tag{14}\end{align*}
Here,
Experimental Setup
We first provide an overview of the datasets (Table 1), evaluation metrics, and baselines employed to quantitatively assess our model’s performance. The main results are summarized in Table 2, demonstrating the competitive predictive performance of our approach in comparison to existing works. We then elaborate on the specifics of our training and evaluation setups followed by detailing the ablation studies.
A. Datasets
Our experimentation extensively covers six real-world benchmark datasets. Conforming to the standard protocol [21], [33], the split of all datasets into training, validation, and test sets has been conducted chronologically, following a split ratio of 60:20:20 for the ETTm2 dataset and a split ratio of 70:10:20 for the remaining datasets.
(Electricity Transformer Temperature): This dataset encompasses data obtained from electricity transformers, featuring load and oil temperatures recorded every 15 minutes during the period spanning from July 2016 to July 2018.$\textbf {ETTm}_{2}$ ECL (Electricity Consuming Load): The ECL dataset compiles hourly electricity consumption (in Kwh) data from 321 customers, spanning the years 2012 to 2014.
Exchange: This dataset aggregates daily exchange rates of eight different countries relative to the US dollar. The data spans from 1990 to 2016.
Traffic: The Traffic dataset is a collection of road occupancy rates from 862 sensors situated along San Francisco Bay area freeways. These rates are recorded every hour, spanning from January 2015 to December 2016.
Weather: This dataset comprises 21 meteorological measurements, including air temperature and humidity. These measurements are recorded every 10 minutes throughout the entirety of the year 2020 in Germany.
ILI (Influenza-Like Illness): This dataset provides a record of weekly influenza-like illness (ILI) patients and the total patient count, sourced from the Centers for Disease Control and Prevention of the US. The data covers the extensive period from 2002 to 2021. It represents the ratio of ILI patients versus the total count for each week.
B. Evaluation Metrics
We evaluate the effectiveness of our approach by measuring its accuracy using the mean squared error (MSE) and mean absolute error (MAE) metrics. These evaluations are conducted for various prediction horizon lengths \begin{align*} \mathrm {MSE} &=\frac {1}{NK} \sum _{i=1}^{N}\sum _{\tau =t}^{t+K}\left ({\mathbf {y}_{i,\tau }-\hat {\mathbf {y}}_{i,\tau }}\right)^{2} \tag{15}\\ \mathrm {MAE} &=\frac {1}{NK} \sum _{i=1}^{N}\sum _{\tau =t}^{t+K}\left |{\mathbf {y}_{i,\tau }-\hat {\mathbf {y}}_{i,\tau }}\right | \tag{16}\end{align*}
C. Baselines
We evaluate our proposed model by comparing it with seven baseline models. These include: (1) N-BEATS [12], which aligns with the external structure of our model, (2) Autoformer [33], (3) Informer [21], (4) Reformer [20], (5) LogTrans [31] – latest transformer-based models. Additionally, we compare with two conventional RNN-based models: (6) LSTNet [34] and (7) LSTM [35].
D. Hyperparameters
Our model is trained using the ADAM optimizer, starting with a learning rate of 10−4 that gets reduced by half every two epochs. We employ early stopping during training, stopping the process if there is no improvement after 10 epochs. The training is carried out with a batch size of 32. We have configured our model with 3 stacks, each containing 1 block. All tests are conducted three times, making use of the PyTorch framework, and are executed on a single NVIDIA RTX 3090 with 24GB GPU.
Experimental Results
A. Multivariate Time Series Forecasting
In the multivariate setting, our proposed model, HGMTS, consistently achieves state-of-the-art performance across all benchmark datasets and prediction length configurations (Table 2). Notably, under the input-96-predict-192 setting, HGMTS demonstrates significant improvements over previous state-of-the-art results, with a 34% (
B. Effect of Sparsity in Graphs on Forecasting
Within the HGMTS model framework, a key hyperparameter is the sampling factor in L-GSL. This factor determines how many query nodes are selected and subsequently linked to key nodes. For the sake of simplicity, we ensure that the number of chosen query and key nodes remains the same. We then measure the sparsity of the latent graphs by computing the proportion of selected pivotal query or key nodes relative to the total time series count. This proportion is denoted as
To understand the impact of sparsity in the learned graphs, we modify
C. Ablation Studies
We posit that the strengths of the HGMTS architecture stem from its ability to hierarchically model the interplay between time series, particularly in the realms of trend and seasonality components. To delve deeper into this proposition, we present a series of control models for a comparative analysis:
HGMTS1: The model as showcased in Figure 2.
HGMTS2: A model that has shared latent graphs between trend and seasonality channels, but not across different blocks and stacks.
HGMTS3: A model where latent graphs are shared throughout all blocks and stacks but remain distinct between trend and seasonality channels.
HGMTS4: This model omits the L-GSL and MPNN modules.
HGMTS5: A model focusing solely on either the trend or seasonality channel, essentially lacking the signal decomposition module.
HGMTS6: A model that has used a single GRU module in Eq (8).
Under the same multivariate setting, the evaluation metrics for each control model, averaged over all benchmark datasets excluding ILI, are detailed in Table 4. The HGMTS4, which forgoes the L-GSL and MPNN modules, experiences a noticeable average MSE surge of 30% (
The information presented in Table 4 robustly supports the idea that best performance is achieved by integrating both suggested components: the latent graph structure and hierarchical signal decomposition. This emphasizes their synergistic role in enhancing the accuracy of long sequence time series predictions. Furthermore, it is confirmed that crafting distinct latent associations between time series hierarchically, spanning both trend and seasonal channels, is instrumental in attaining improved prediction outcomes.
Conclusion
In this paper, we delved into the challenge of long-term multivariate time series forecasting, an area that has seen notable progress recently. However, the intricate temporal patterns often impede models from effectively learning reliable dependencies. In response, we introduce HGMTS, a spatio-temporal multivariate time series forecasting model that incorporates a signal decomposition module and employs a latent graph structure learning as intrinsic operators. This unique approach allows for the hierarchical aggregation of long-term trend and seasonal information from intermediate predictions. Furthermore, we adopt a multi-module message-passing framework to enhance our model’s capacity to capture diverse time series data from a range of heterogeneous sensors. This approach distinctly sets us apart from previous neural forecasting models. Notably, HGMTS naturally achieves a computational complexity of
Learning a latent graph typically poses considerable challenges. Even though our model leverages the top-k pooling method to infer the latent graph, there are many other deep learning techniques that could be investigated in upcoming studies to uncover hidden structural patterns. Enhancements related to both representation capacity and computational efficiency might expand its broader adoption.