Journals & Magazines >IEEE Access >Volume: 11

Hierarchical Joint Graph Learning and Multivariate Time Series Forecasting

Overview of the latent graph structure learning (L-GSL). (a) Key nodes chosen at random (depicted as gray circles) are used to measure the significance of a query node (s...

Abstract:

Multivariate time series is prevalent in many scientific and industrial domains. Modeling multivariate signals is challenging due to their long-range temporal dependencie...Show More

Metadata

Abstract:

Multivariate time series is prevalent in many scientific and industrial domains. Modeling multivariate signals is challenging due to their long-range temporal dependencies and intricate interactions–both direct and indirect. To confront these complexities, we introduce a method of representing multivariate signals as nodes in a graph with edges indicating interdependency between them. Specifically, we leverage graph neural networks (GNN) and attention mechanisms to efficiently learn the underlying relationships within the time series data. Moreover, we suggest employing hierarchical signal decompositions running over the graphs to capture multiple spatial dependencies. The effectiveness of our proposed model is evaluated across various real-world benchmark datasets designed for long-term forecasting tasks. The results consistently showcase the superiority of our model, achieving an average 23% reduction in mean squared error (MSE) compared to existing models.

Overview of the latent graph structure learning (L-GSL). (a) Key nodes chosen at random (depicted as gray circles) are used to measure the significance of a query node (s...

Published in: IEEE Access ( Volume: 11)

Page(s): 118386 - 118394

Date of Publication: 16 October 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3325041

Funding Agency:

Contents

SECTION I.

Introduction

Multivariate time series forecasting is a primary machine learning task in both scientific research and industrial applications [1], [2]. The interactions and dependencies between many time series data govern how they evolve, and these can range from simple linear correlations to complex relationships such as the traffic flows underlying intelligent transportation systems [3], [4], [5], [6] or physical forces affecting the trajectories of objects in space [7], [8], [9].

Accurately predicting future values of the time series may require understanding their true relationships, which can provide valuable insights into the system represented by the time series. Recent studies aim to jointly infer these relationships and learn to forecast in an end-to-end manner, even without prior knowledge of the underlying graph [5], [10]. However, inferring the graph from numerous time series data has a quadratic computational complexity, making it prohibitively expensive to scale to a large number of time signals.

Another important aspect of time series forecasting is the presence of non-stationary properties, such as seasonal effects, trends, and other structures that depend on the time index [11]. Such properties may need to be eliminated before modeling, and a recent line of work aims to incorporate trend and seasonality decomposition into the model architecture to simplify the prediction process [12], [13].

Therefore, it is natural to ask whether one can leverage deep neural networks to combine the strength of both worlds: 1) using a latent graph structure that aids in time series forecasting with each signal represented as a node and the interactions between them as edges, and 2) using end-to-end training to model the time series by decomposing it into multiple levels, which enables separate modeling of different patterns at each level, and then combining them to make accurate predictions. Existing works have not addressed both of these strengths together in a unified framework, and this is precisely the research question we seek to address in our current study.

To address this, we propose the use of graph neural networks (GNN) and a self-attention mechanism that efficiently infers latent graph structures with a time complexity and memory usage of $\mathcal {O}(N\log N)$ where $N$ is the number of time series. We further incorporate hierarchical residual blocks to learn backcast and forecast outputs. These blocks operate across multiple inferred graphs, and the aggregated forecasts contribute to producing the final prediction. By implementing this approach, we have achieved a superior forecasting performance compared to baseline models, with an average enhancement of 23%. For an overview, this paper brings the following contributions:

We introduce a novel approach that extends hierarchical signal decomposition, merging it with concurrent hierarchical latent graphs learning. This is termed as hierarchical joint graph learning and multivariate time series forecasting (HGMTS).
Our method incorporates a sparse self-attention mechanism, which we establish as a good inductive bias when learning on graphs and addressing LSTF challenges.
Through our experimental findings, it is evident that our proposed model outperforms traditional transformer networks in time series forecasting. The design not only sets a superior standard for direct multi-step forecasting but also establishes itself as a promising spatio-temporal GNN benchmark for subsequent studies bridging latent graph learning and time series forecasting.

SECTION II.

Related Work

Until recently, deep learning methods for time series forecasting have primarily focused on utilizing recurrent neural networks (RNN) and their variants to develop a sequence-to-sequence prediction approach [14], [15], [16], [17], which has shown remarkable outcomes. Despite significant progress, however, these methods are yet to achieve accurate predictions for long sequence time series forecasting (LSTF) due to challenges such as the accumulation of errors in many steps of unrolling, as well as vanishing gradients and memory limitations [18].

Self-attention based transformer models proposed recently for LSTF tasks have revolutionized time series prediction and attained remarkable success. In contrast to traditional RNN models, transformers have exhibited superior capability in capturing long-range temporal dependencies. Still, recent advancements in this domain, as illustrated by LongFormer [19], Reformer [20], Informer [21], AutoFormer [22], and ETSformer [23], have predominantly zeroed in on improving the efficiency of the self-attention mechanism, particularly for handling long input and output sequences. Concurrently, there has been a rise in the development of attention-free architectures, as seen in Oreshkin et al. [12] and Challu et al. [13], which present a computationally efficient alternative for modeling extensive input-output relationships by using deep stacks of fully connected layers. However, such models often overlook the intricate interactions between signals in multivariate time series data, tending to process each time series independently.

Spatio-temporal graph neural networks (ST-GNNs) are a specific type of GNNs that are tailored to handle both time series data and their interactions. They have been used in a wide range of applications such as action recognition [24], [25] and traffic forecasting [26], [27], [28]. These networks integrate sequential models for capturing temporal dependencies with GNNs employed to encapsulate spatial correlations among distinct nodes. However, a caveat with ST-GNNs is that they necessitate prior information regarding structural connectivity to depict the interrelations in time series data. This can be a limitation in cases where the structural information is not available.

Accordingly, GNNs that include structure learning components have been developed to learn effective graph structures suitable for time series forecasting. Two such models, NRI [8] and GTS [6], calculate the probability of an edge between nodes using pairwise scores, resulting in a discrete adjacency matrix. Nonetheless, this approach can be computationally intensive with a growing number of nodes. In contrast, MTGNN [5] and GDN [10] utilize a randomly initialized node embedding matrix to infer the latent graph structure. While this approach is less taxing on computational resources, it might compromise the accuracy of predictions.

SECTION III.

Methods

In this section, we detail our proposed method, HGMTS. The overarching framework and core operational principles of this approach can be viewed in Figures 1 and 2.

FIGURE 1.

Overview of the latent graph structure learning (L-GSL). (a) Key nodes chosen at random (depicted as gray circles) are used to measure the significance of a query node (shown as a blue circle). (b) Top-$n$ query nodes (blue circles) are picked according to the importance distribution across all query nodes. (c) Key nodes, colored in orange, that hold sufficient relevance to be linked with the chosen query node.

Show All

FIGURE 2.

Overview of the proposed HGMTS model architecture. The hierarchical residual block is marked by signal decomposition and GNN-centric L-GSL modules (left). The combination of multiple blocks forms a stack (middle), culminating in the entire model design (right) to ultimately produce a global forecasting output.

Show All

FIGURE 3.

Ablation study overview. Displayed are four distinct model architectures explored to understand the impact of specific components on overall LSTF performance.

Show All

A. Preliminaries

Let ${\mathbf {X}}\in \mathbb {R}^{N \times T \times M}$ represents a multivariate time series, where $N$ signifies the count of signals originating from various sensors, $T$ denotes the length of the sequence, and $M$ represents the dimension of the signal input (usually $M$ = 1). We depict this multivariate time series as a graph termed $\mathcal {G} = \{\mathcal {V},\mathcal {E},\mathcal {A}\}$ , wherein the collection of nodes denoted by $\mathcal {V}$ corresponds to the sensors, the set $\mathcal {E}$ pertains to the edges, and $\mathcal {A}$ signifies the adjacency matrix. Notably, the precise composition of $\mathcal {E}$ and $\mathcal {A}$ is not known initially; however, our model will acquire this knowledge through the learning process.

B. Latent Graph Structure Learning (L-GSL)

We embrace the concept of self-attention (introduced by [29]) and employ the attention scores in the role of edge weights. The process of learning the adjacency matrix of the graph, denoted as $\mathcal {A}\in \mathbb {R}^{N\times N}$ unfolds as follows:\begin{equation*} {\mathbf {Q}}= {\mathbf {H}}\mathbf {W}^{Q},\quad {\mathbf {K}} = {\mathbf {H}} {\mathbf {W}} ^{K},\; \mathcal {A}=\mathrm {softmax}\left ({\frac { {\mathbf {Q}} {\mathbf {K}} ^{T}}{\sqrt {D}}}\right) \tag{1}\end{equation*} View Source where ${\mathbf {H}}\in \mathbb {R}^{N\times D}$ corresponds to node embeddings,1 $\mathbf {W}^{Q} \in \mathbb {R}^{D\times D}$ and $\mathbf {W}^{K} \in \mathbb {R}^{D\times D}$ are weight matrices that project ${\mathbf {H}}$ into query $\mathbf {Q}$ and key $\mathbf {K}$ , respectively. The main limitation in estimating latent graph structures in Eq. (1) for a large value of $N$ is the necessity to perform quadratic time dot-product computations along with the utilization of $\mathcal {O}(N^{2})$ memory. In an effort to achieve a self-attention mechanism complexity of $\mathcal {O}(N\log N)$ , our approach involves identifying pivotal query nodes and their associated significant key nodes in a sequential manner.

1) Identifying Pivotal Query Nodes

For the purpose of determining which query nodes will establish connections with other nodes, our initial step involves evaluating the significance of queries. Recent studies [19], [21], [30], [31] have highlighted the existence of sparsity in the distribution of self-attention probabilities. Drawing inspiration from these findings, we establish the importance of queries based on the Kullback-Leibler (KL) divergence between a uniform distribution and the attention probability distribution of query nodes.

Let ${\mathbf {q}}_{i}$ and ${\mathbf {k}}_{i}$ represent the $i$ -th row in matrices ${\mathbf {Q}}$ and ${\mathbf {K}}$ respectively. For a given query node, $p({\mathbf {k}}_{j}| {\mathbf {q}}_{i})=\exp ({\mathbf {q}}_{i} {\mathbf {k}}_{j}^{\top })/\sum _{\ell }\exp ({\mathbf {q}}_{i} {\mathbf {k}}_{\ell }^{\top })$ denotes the attention probability of the $i$ -th query towards the $j$ -th key node. Then, $p({\mathbf {K}}| {\mathbf {q}}_{i})=[p({\mathbf {k}}_{1} | {\mathbf {q}}_{i}) \ldots p({\mathbf {k}}_{N} | {\mathbf {q}}_{i})]$ indicates the probability distribution of how the $i$ -th query allocates its attention/weight across all nodes. In this context, $D_{KL}(U, p({\mathbf {K}}| {\mathbf {q}}_{i}))$ quantifies the deviation of a query node’s attention probabilities from a uniform distribution $\mathcal {U}\{1,N\}$ . This divergence measurement serves as a metric for identifying significant query nodes; a higher KL divergence suggests that a query’s attention is mainly directed towards particular key nodes, rather than being evenly distributed. As a result, these query nodes are postulated to be suitable candidates for establishing sparse connections.

The traversal of all query nodes for this measurement, however, still entails a quadratic computational requirement. It is worth noting that a recent study demonstrated that the relative magnitudes of query importance remain unchanged even when the divergence metric is calculated using randomly sampled keys [21]. Building on this idea, we determine the importance of query nodes through the computation of $D_{KL}(\bar {U}, p(\bar {\mathbf {K}} | \mathbf {q}_{i}))$ instead, where $\bar U=\mathcal {U}\{1,n\}$ , $\bar {\mathbf {K}} $ represents a matrix containing randomly sampled $n$ row vectors from $\mathbf {K}$ , and $n=\lfloor c\cdot \log N\rfloor $ denotes the number of random samples based on a constant sampling factor $c$ (Figure 1a). Given this measurement of query importance, we select top-$n$ query nodes and denote it as $\bar {\mathbf {Q}} $ (Figure 1b).

2) Identifying Associated Key Nodes

Using the selected set of $n$ query nodes, our subsequent step involves identifying the corresponding key nodes to establish connections. In pursuit of achieving this objective, we initiate by computing the attention probabilities $p({\mathbf {K}}| {\mathbf {q}}_{i})$ of the $i$ -th query across all keys nodes; this procedure is reiterated for each of the $n$ query nodes. Next, we choose the top-$n$ key nodes for each query based on their attention scores (Figure 1c), and we designate this collection as $\bar {\mathbf {K}} $ . The ultimate adjacency matrix, adhering to the sparsity constraint, is defined by the equation:\begin{equation*} \bar {\mathcal {A}}=\mathrm {softmax}\left ({\frac {\bar {\mathbf {Q}} \bar {\mathbf {K}} ^{T}}{\sqrt {D}}}\right) \tag{2}\end{equation*} View Source

In this equation, $\bar {\mathbf {Q}} $ and $\bar {\mathbf {K}} $ possess the same dimensions as ${\mathbf {Q}}$ and ${\mathbf {K}}$ , except that the row vectors corresponding to insignificant query and key nodes are replaced with zeros. To sum up, the complexity of all the necessary computations for evaluating the significance of a query node and determining which key nodes to establish connections with, considering the top-$n$ chosen queries, amounts to $\mathcal {O}(N \log N)$ .

C. Hierarchical Signal Decomposition

This section provides an overview of the proposed approach shown in Figure 2 and discusses the overall design principles. Our approach builds upon N-BEATS [12], enhancing its key elements significantly. Our main methodology comprises of three primary elements: signal decomposition, latent graph structure learning, and constructing forecasts and backcasts in a hierarchical manner. Much like the N-BEATS approach, every block is trained to generate signals for both backcast and forecast outputs. Here, the backcast output is designed to be subtracted from the input of the subsequent block, whereas the forecasts are combined to produce the final prediction (Figure 2a). These blocks are arranged in stacks, each focusing on a distinct spatial dependency through a unique set of graph structures.

1) Signal Decomposition Module

Recent research has witnessed a surging interest in disentangling time series data into its trend and seasonal components. These components respectively represent the overall long-term pattern and the seasonal fluctuations within the time signals. However, when it comes to future time series, directly performing this decomposition becomes impractical due to the inherent uncertainty of the future. To address this challenge, we propose the incorporation of a signal decomposition module within a single block (Figure 2a). This module enables the gradual extraction of the consistent, long-term trend from intermediate forecasting signals. Specifically, we employ the moving average technique to smooth out recurring fluctuations and uncover the underlying long-term trends as outlined below:\begin{align*} {\mathbf {X}}^{\texttt {trend}} &=\mathrm {AvgPool}(\mathrm {Padding}({\mathbf {X}})) \\ {\mathbf {X}}^{{\texttt {seas}}} &= {\mathbf {X}}- {\mathbf {X}}^{\texttt {trend}} \tag{3}\end{align*} View Source where ${\mathbf {X}}^{{\texttt {trend}}}, {\mathbf {X}}^{{\texttt {seas}}}$ denote the trend and seasonal components respectively. We opt for the $\mathrm {AvgPool}(\cdot)$ for the moving average, accompanied by the zero padding operation to maintain the original series length intact.

2) Message-Passing Module

The message-passing module receives as input the past $L$ time steps of both seasonal and trend outputs ${\mathbf {X}}^{{\texttt {seas}}}_{t-L:t}, {\mathbf {X}}^{{\texttt {trend}}}_{t-L:t}\in \mathbb {R}^{N\times L}$ obtained from the signal decomposition. As the two components go through the same set of distinct parameterized network modules, their differentiation will be disregarded henceforth. At each time step $t$ , the input consisting of $N$ multivariate time series with $L$ lags are transformed into embedding vectors ${\mathbf {H}}\in \mathbb {R}^{N\times D}$ using a multilayer perceptron (MLP). Each row vector ${\mathbf {h}}_{i}$ in this matrix represents an individual node embedding. Subsequently, these node embeddings are employed in Eq. 1 of the latent graph structure learning module to create a sparse adjacency matrix $\mathcal {\bar A}$ . This matrix, in conjunction with the node embedding matrix, serves as the input for the message-passing neural network (Figure 2a). To be more specific, the $r$ -th round of message passing in the GNN is executed using the following equations:\begin{align*} {\mathbf {h}}_{i}^{(0)} &=\; f({\mathbf {x}}_{i,t-L:t}) \tag{4}\\ {\mathbf {m}}_{ij}^{(r)} &=\; g({\mathbf {h}}_{i}^{(r)}- {\mathbf {h}}_{j}^{(r)}) \tag{5}\\ \mathcal {\bar A} &=\; \text {L-GSL}({\mathbf {H}}) \tag{6}\\ {\mathbf {h}}_{i}^{(r+1)} &=\; \text {GRU}\left({{\mathbf {h}}_{i}^{(r)}, \textstyle \sum \nolimits _{j \in \mathcal {N}(i)} \bar a_{ij} \cdot {\mathbf {m}} _{i j}^{(r)}}\right) \tag{7}\end{align*} View Source where ${\mathbf {h}}_{i}^{(r)}$ refers to the $i$ -th node embedding after round $r$ , and ${\mathbf {m}}_{ij}^{(r)}$ represents the message vector from node $i$ to $j$ . The interaction strength associated with the edge $(i,j)$ , denoted as $\bar a_{ij}$ , corresponds to the entry in $\mathcal {\bar A}$ at the $i$ -th row and $j$ -th column. Both the encoding function $f(\cdot)$ and the message function $g(\cdot)$ are implemented as two-layer MLPs with ReLU nonlinearities. Finally, the node embeddings are updated using a GRU after aggregating all incoming messages through a weighted sum over the neighborhood $\mathcal {N}(i)$ for each node $i$ . This sequence of operations is repeated separately for the seasonal and trend inputs, with no sharing of parameters (Figure 2a).

To enhance both the model’s expressivity and its capacity for generalization, we employ a multi-module GNN framework [32]. More specifically, the next hidden state ${\mathbf {h}}_{i}^{(r+1)}$ is computed by blending two intermediate node states, ${\mathbf {h}}_{i,1}^{(r)}$ and ${\mathbf {h}}_{i,2}^{(r)}$ , through a linear combination defined as follows:\begin{equation*} {\mathbf {h}}_{i}^{(r+1)} =\; \beta _{i}^{(r)} {\mathbf {h}}_{i,1}^{(r)} + (1-\beta _{i}^{(r)}) {\mathbf {h}}_{i,2}^{(r)} \tag{8}\end{equation*} View Source where the two intermediate representations ${\mathbf {h}}_{i,1}^{(r)}$ and ${\mathbf {h}}_{i,2}^{(r)}$ are derived from Eq. (7) using two distinct GRUs. The value of the gating variable $\beta _{i}^{(r)}$ is determined by another processing unit employing a gating function $\xi _{g}$ , which is a neural network producing a scalar output through a sigmoid activation.

3) Forecast and Backcast Module

Following the completion of the last $R$ round of message passing (3 rounds in total), the backcast $\hat {\mathbf {x}} $ and forecast outputs $\hat {\mathbf {y}} $ are generated in this procedure. This is achieved by mapping the final node embeddings through separate two MLPs. These MLPs are responsible for handling the generation of backcast and forecast outputs individually (Figure 2a). It is important to note that the last layer of these MLPs is designed as a linear layer. This process of generating backcast and forecast outputs is applied to both the seasonal and trend pathways, and the ultimate backcast and forecast outputs are obtained by summing up the respective outputs from the seasonal and trend components (Figure 2a):\begin{align*} \hat {\mathbf {x}} _{i,t-L:t}^{\texttt {seas}} &=\; \phi _{\texttt {seas}}({\mathbf {h}}_{i,\texttt {seas}}^{(R)}) \tag{9}\\ \hat {\mathbf {x}} _{i,t-L:t}^{\texttt {trend}} &=\; \phi _{\texttt {trend}}({\mathbf {h}}_{i,\texttt {trend}}^{(R)}) \tag{10}\\ \hat {\mathbf {x}} _{i,t-L:t} &=\; \hat {\mathbf {x}} _{i,t-L:t}^{\texttt {seas}} + \hat {\mathbf {x}} _{i,t-L:t}^{\texttt {trend}} \tag{11}\\ \hat {\mathbf {y}} _{i,t+1:t+K}^{\texttt {seas}} &=\; \psi _{\texttt {seas}}({\mathbf {h}}_{i,\texttt {seas}}^{(R)}) \tag{12}\\ \hat {\mathbf {y}} _{i,t+1:t+K}^{\texttt {trend}} &=\; \psi _{\texttt {trend}}({\mathbf {h}}_{i,\texttt {trend}}^{(R)}) \tag{13}\\ \hat {\mathbf {y}} _{i,t+1:t+K} &=\; \hat {\mathbf {y}} _{i,t+1:t+K}^{\texttt {seas}} + \hat {\mathbf {y}} _{i,t+1:t+K}^{\texttt {trend}} \tag{14}\end{align*} View Source

Here, $\phi _{_{\square} }$ and $\psi _{_{\square} }$ represent two-layer MLPs designed to acquire the predictive decomposition of the partial backcast $\hat {\mathbf {x}} _{i,t-L:t}$ of the preceding $L$ time steps, and the forecast $\hat {\mathbf {y}} _{i,t+1:t+K}$ of the subsequent $K$ time steps. These MLPs operate on components denoted as, which can be either the seasonal or trend aspects. Note that the indexing related to block or stack levels has been excluded for clarity. The resulting global forecast is constructed by summing the outputs of all blocks (Figure 2b-c).

SECTION IV.

Experimental Setup

We first provide an overview of the datasets (Table 1), evaluation metrics, and baselines employed to quantitatively assess our model’s performance. The main results are summarized in Table 2, demonstrating the competitive predictive performance of our approach in comparison to existing works. We then elaborate on the specifics of our training and evaluation setups followed by detailing the ablation studies.

TABLE 1 Summary Statistics for the Benchmark Datasets Used in Our Empirical Study

TABLE 2 Multivariate Forecasting Results for Different Prediction Length $K\in\{96, 192, 336, 720\}$ . For the ILI Dataset, We Set the Input Length ( $L$ ) to 36, While for the Other Datasets, We Set it to 96. A Prediction is Considered More Precise if it has a Lower MSE or MAE Value. The Metrics are the Average of Three Trials, With the Best Results Colored in Blue for Emphasis

$Table 2- Multivariate Forecasting Results for Different Prediction Length $K\in\{96, 192, 336, 720\}$ . For the ILI Dataset, We Set the Input Length ( $L$ ) to 36, While for the Other Datasets, We Set it to 96. A Prediction is Considered More Precise if it has a Lower MSE or MAE Value. The Metrics are the Average of Three Trials, With the Best Results Colored in Blue for Emphasis$

A. Datasets

Our experimentation extensively covers six real-world benchmark datasets. Conforming to the standard protocol [21], [33], the split of all datasets into training, validation, and test sets has been conducted chronologically, following a split ratio of 60:20:20 for the ETTm₂ dataset and a split ratio of 70:10:20 for the remaining datasets.

$\textbf {ETTm}_{2}$ (Electricity Transformer Temperature): This dataset encompasses data obtained from electricity transformers, featuring load and oil temperatures recorded every 15 minutes during the period spanning from July 2016 to July 2018.
ECL (Electricity Consuming Load): The ECL dataset compiles hourly electricity consumption (in Kwh) data from 321 customers, spanning the years 2012 to 2014.
Exchange: This dataset aggregates daily exchange rates of eight different countries relative to the US dollar. The data spans from 1990 to 2016.
Traffic: The Traffic dataset is a collection of road occupancy rates from 862 sensors situated along San Francisco Bay area freeways. These rates are recorded every hour, spanning from January 2015 to December 2016.
Weather: This dataset comprises 21 meteorological measurements, including air temperature and humidity. These measurements are recorded every 10 minutes throughout the entirety of the year 2020 in Germany.
ILI (Influenza-Like Illness): This dataset provides a record of weekly influenza-like illness (ILI) patients and the total patient count, sourced from the Centers for Disease Control and Prevention of the US. The data covers the extensive period from 2002 to 2021. It represents the ratio of ILI patients versus the total count for each week.

B. Evaluation Metrics

We evaluate the effectiveness of our approach by measuring its accuracy using the mean squared error (MSE) and mean absolute error (MAE) metrics. These evaluations are conducted for various prediction horizon lengths $K\in \{96, 192, 336, 720\}$ given a fixed input length $L=96$ , except for ILI where $L=36$ :\begin{align*} \mathrm {MSE} &=\frac {1}{NK} \sum _{i=1}^{N}\sum _{\tau =t}^{t+K}\left ({\mathbf {y}_{i,\tau }-\hat {\mathbf {y}}_{i,\tau }}\right)^{2} \tag{15}\\ \mathrm {MAE} &=\frac {1}{NK} \sum _{i=1}^{N}\sum _{\tau =t}^{t+K}\left |{\mathbf {y}_{i,\tau }-\hat {\mathbf {y}}_{i,\tau }}\right | \tag{16}\end{align*} View Source

C. Baselines

We evaluate our proposed model by comparing it with seven baseline models. These include: (1) N-BEATS [12], which aligns with the external structure of our model, (2) Autoformer [33], (3) Informer [21], (4) Reformer [20], (5) LogTrans [31] – latest transformer-based models. Additionally, we compare with two conventional RNN-based models: (6) LSTNet [34] and (7) LSTM [35].

D. Hyperparameters

Our model is trained using the ADAM optimizer, starting with a learning rate of 10⁻⁴ that gets reduced by half every two epochs. We employ early stopping during training, stopping the process if there is no improvement after 10 epochs. The training is carried out with a batch size of 32. We have configured our model with 3 stacks, each containing 1 block. All tests are conducted three times, making use of the PyTorch framework, and are executed on a single NVIDIA RTX 3090 with 24GB GPU.

SECTION V.

Experimental Results

A. Multivariate Time Series Forecasting

In the multivariate setting, our proposed model, HGMTS, consistently achieves state-of-the-art performance across all benchmark datasets and prediction length configurations (Table 2). Notably, under the input-96-predict-192 setting, HGMTS demonstrates significant improvements over previous state-of-the-art results, with a 34% ($0.273\rightarrow 0.180$ ) reduction in MSE for ETT, 19% ($0.180\rightarrow 0.146$ ) reduction for ECL, 53% ($0.225\rightarrow 0.105$ ) reduction for Exchange, 5% ($0.409\rightarrow 0.389$ ) reduction for Traffic, and 10% ($0.229\rightarrow 0.207$ ) reduction for Weather. In the case of the input-36-predict-60 setting for ILI, HGMTS achieves 17% ($2.547\rightarrow 2.118$ ) reduction in MSE. Overall, HGMTS delivers an average MSE reduction of 23% across these settings. It is particularly striking how HGMTS drastically improves predictions for the Exchange dataset, where it records an average MSE reduction of 52% for all prediction lengths. Moreover, HGMTS stands out for its outstanding long-term stability, an essential attribute for real-world applications.

B. Effect of Sparsity in Graphs on Forecasting

Within the HGMTS model framework, a key hyperparameter is the sampling factor in L-GSL. This factor determines how many query nodes are selected and subsequently linked to key nodes. For the sake of simplicity, we ensure that the number of chosen query and key nodes remains the same. We then measure the sparsity of the latent graphs by computing the proportion of selected pivotal query or key nodes relative to the total time series count. This proportion is denoted as $\gamma =\lfloor c\cdot \log N\rfloor /N$ and acts as an indicator of the sparsity in building these latent graphs.

To understand the impact of sparsity in the learned graphs, we modify $\gamma $ values between 0.2 and 0.7 and then document the findings from the multivariate forecasting studies. As detailed in Table 2, there is a consistent trend: all the graphs lean towards sparse interactions ($\gamma \le 0.5$ ), targeting optimal predictive outcomes in LSTF tasks. Additionally, different benchmark datasets exhibit unique preferences regarding the optimal sparsity level for predictive performance, as displayed in Table 2.

C. Ablation Studies

We posit that the strengths of the HGMTS architecture stem from its ability to hierarchically model the interplay between time series, particularly in the realms of trend and seasonality components. To delve deeper into this proposition, we present a series of control models for a comparative analysis:

HGMTS₁: The model as showcased in Figure 2.
HGMTS₂: A model that has shared latent graphs between trend and seasonality channels, but not across different blocks and stacks.
HGMTS₃: A model where latent graphs are shared throughout all blocks and stacks but remain distinct between trend and seasonality channels.
HGMTS₄: This model omits the L-GSL and MPNN modules.
HGMTS₅: A model focusing solely on either the trend or seasonality channel, essentially lacking the signal decomposition module.
HGMTS₆: A model that has used a single GRU module in Eq (8).

Under the same multivariate setting, the evaluation metrics for each control model, averaged over all benchmark datasets excluding ILI, are detailed in Table 4. The HGMTS₄, which forgoes the L-GSL and MPNN modules, experiences a noticeable average MSE surge of 30% ($0.258\rightarrow 0.336$ ) across all horizons. This rise is the most significant among all controls, indicating that capturing interdependencies between multivariate signals is vital in our suggested model. HGMTS₅, which emphasizes solely on a single channel between trend and seasonality, registers the second most pronounced MSE growth (18%: $0.258\rightarrow 0.305$ ), suggesting that signal decomposition is also instrumental in LSTF tasks. Sharing the latent graphs – whether between the trend and seasonality pathways (as in HGMTS₂) or among blocks (as in HGMTS₃) – does elevate the average MSE, but the rise is modest when compared with the first two control models. Additionally, our findings highlight that incorporating multiple node update mechanisms in MPNN, as seen in HGMTS₆, brings about a slight enhancement in forecasting precision.

TABLE 3 The HGMTS Performance Evaluated Under Various Selections of the Graph Sparsity Hyper-Parameter $\gamma$ . The Forecasting Setup Remains Consistent With What is Presented in Table 2

$Table 3- The HGMTS Performance Evaluated Under Various Selections of the Graph Sparsity Hyper-Parameter $\gamma$ . The Forecasting Setup Remains Consistent With What is Presented in Table 2$

TABLE 4 Empirical Evaluation of Long Sequence Time Series Forecasts for HGMTS. MAE and MSE are Averaged Over Three Runs and Five Datasets, With the Best Result in Blue and the Second Best in Red

The information presented in Table 4 robustly supports the idea that best performance is achieved by integrating both suggested components: the latent graph structure and hierarchical signal decomposition. This emphasizes their synergistic role in enhancing the accuracy of long sequence time series predictions. Furthermore, it is confirmed that crafting distinct latent associations between time series hierarchically, spanning both trend and seasonal channels, is instrumental in attaining improved prediction outcomes.

SECTION VI.

Conclusion

In this paper, we delved into the challenge of long-term multivariate time series forecasting, an area that has seen notable progress recently. However, the intricate temporal patterns often impede models from effectively learning reliable dependencies. In response, we introduce HGMTS, a spatio-temporal multivariate time series forecasting model that incorporates a signal decomposition module and employs a latent graph structure learning as intrinsic operators. This unique approach allows for the hierarchical aggregation of long-term trend and seasonal information from intermediate predictions. Furthermore, we adopt a multi-module message-passing framework to enhance our model’s capacity to capture diverse time series data from a range of heterogeneous sensors. This approach distinctly sets us apart from previous neural forecasting models. Notably, HGMTS naturally achieves a computational complexity of $\mathcal {O}(N\log N)$ and consistently delivers state-of-the-art performance across a wide array of real-world datasets.

Learning a latent graph typically poses considerable challenges. Even though our model leverages the top-k pooling method to infer the latent graph, there are many other deep learning techniques that could be investigated in upcoming studies to uncover hidden structural patterns. Enhancements related to both representation capacity and computational efficiency might expand its broader adoption.

References is not available for this document.

MIT Libraries

MIT Libraries

Hierarchical Joint Graph Learning and Multivariate Time Series Forecasting

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work