Journals & Magazines >IEEE Access >Volume: 12

An Enhanced Vehicle Trajectory Prediction Model Leveraging LSTM and Social-Attention Mechanisms

Our model takes the historical trajectories of all traffic participants within a certain spatial range around the ego vehicle as input. The historical trajectories of all...

Abstract:

Accurate trajectory prediction for multiple vehicles in complex social interaction environments is essential for ensuring the safety of autonomous vehicles and improving ...Show More

Metadata

Abstract:

Accurate trajectory prediction for multiple vehicles in complex social interaction environments is essential for ensuring the safety of autonomous vehicles and improving the quality of their planning and control. The social interactions between vehicles significantly influence their future trajectories. However, traditional trajectory prediction models based on Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN) often overlook or simplify these interactions. Although these models may exhibit high performance in short-term predictions, they fail to achieve high prediction accuracy in scenarios with long-term dynamic interactions. To address this limitation, we propose a Social-Attention Long Short-Term Memory (LSTM) model which predicts the future trajectories of neighboring vehicles and achieves increased accuracy. Our proposed model employs a Social-Pooling layer to effectively capture cooperative behaviors and mutual influences between vehicles. Additionally, we incorporate a self-attention mechanism to weight the inputs and outputs of the Social-Pooling layer, which is significant for assessing the influence between vehicles in different positions. This combination allows our model to take into consideration both the dependencies within the sequence and the social relationships between vehicles, providing a more comprehensive scene understanding. The efficacy of our model is tested on two real-world freeway trajectory datasets, namely NGSIM and HighD. Our model surpasses various baseline methods, exhibiting exceptional accuracy in both prediction and tracking.

Our model takes the historical trajectories of all traffic participants within a certain spatial range around the ego vehicle as input. The historical trajectories of all...

Published in: IEEE Access ( Volume: 12)

Page(s): 1718 - 1726

Date of Publication: 21 December 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3345643

Funding Agency:

Contents

SECTION I.

Introduction

Autonomous driving technology has always been a focal point of research and public interest. Autonomous vehicles employ onboard sensors and advanced algorithms to perceive their surrounding environment, infer the intentions of circumambient traffic participants, and conduct high-precision estimation of their future trajectories. This process can help autonomous vehicles better understand future driving environments and make better decisions [1]. In real-world complex traffic scenarios, the future trajectories of vehicles are contingent not only on their historical paths but also on the uncertain, multimodal, and intricate agent-to-agent and agent-to-space interactions [2]. This fact poses significant challenges to trajectory prediction tasks. Current trajectory prediction methods include model-based trajectory prediction and data-driven trajectory prediction.

Model-based methods primarily rely on the vehicle’s kinematics model, consider the physical characteristics of the vehicle and environmental factors, and use traditional filtering techniques and optimization technologies like the Bayes’ theorem [3], Monte Carlo simulation [4], Hidden Markov Model (HMM) [5], Kalman filters [6], Model Predictive Control (MPC) [7], and so on. Shin et al. [8] proposed an urban intersection vehicle trajectory prediction method based on internal modeling and Extended Kalman Filter (EKF), incorporates the target vehicle’s kinematics and dynamics model. Xie et al. [9] designed an IMM trajectory prediction (IMMTP) algorithm combining Interactive Multiple Model (IMM), Unscented Kalman Filter (UKF), and Deep Belief Network (DBN) to investigate vehicle lane-change intentions on highways, where UKF is used to predict the vehicle’s trajectory, and DBN predicts intention with uncertainty. IMM averages the outputs of UKF/DBN to improve prediction results. Additionally, MPC has been widely used in trajectory prediction. Li et al. [10] used an MPC controller to predict lane line changes and short-term vehicle trajectory prediction. Nilsson et al. [11] designed an algorithm to transform the intention prediction problem into a longitudinal planning problem and used MPC to solve it. These methods use mathematical models or predefined rules to predict the behavior of traffic participants, but the short-term assumptions no longer holding in the long-term, these models perform poorly for long-term prediction tasks. Furthermore, these methods usually require manual calculation of cost functions for different situations [12]. When facing complex, real-world traffic scenarios, their generalizability is limited and the accuracy of long-term trajectory prediction is low.

Data-driven trajectory prediction methods use historical vehicle trajectory information and machine learning technologies to predict the possible trajectory of the target vehicle in the future. This method analyzes the target’s historical trajectory, extracts features, and builds a model to predict the target’s future movement state. In [13], [14], and [15], conventional (shallow) machine learning methods such as Multi-Layer Perceptron (MLP) and extended Random Decision Forests were used to predict vehicle intentions. Benterki et al. [16] trained an MLP and used it for predicting trajectories of up to 5 vehicles. To improve the accuracy of model prediction, it is necessary to consider the temporal and spatial relationships between road users. Capturing the interactive relationships between multiple vehicles requires a model with multimodal feature processing capability. Traditional machine learning techniques struggle to efficiently handle high-dimensional data and the high computation when considering vehicle interactions, limiting their ability to accurately represent long-term spatiotemporal interactions between agents in complex environments [17], reducing their ability to provide precise solutions. Therefore, traditional machine learning methods are typically used to predict tasks within a short time range and under simple working conditions [18]. Specifically, it refers to the short-term historical trajectory of individual vehicles, which will inevitably lead to significant errors.

Since vehicle trajectory prediction is a sequence-to-sequence task, methods based on deep neural networks, such as Recurrent Neural Networks (RNN) and its variants, including Long Short-Term Memory (LSTM) [19] and Gated Recurrent Units (GRU) [20], have exhibited exceptional performance in Multivariate Time Series (MTS) tasks, particularly in capturing long-term features in trajectories, due to their ability to extract hidden dependencies from contextual time steps. For instance, Dang et al. [21] employed an LSTM with two dense layers to predict the Time to Lane Change (TTLC), achieving a remarkably low average prediction error of just 0.3 meters. Messaoud et al. [22] proposed an LSTM encoder-decoder algorithm for inferring the intentions of vehicles on highways. The encoder learns the spatial probability distribution of observed historical states of vehicles in a road scene, while the decoder interprets and predicts the parameters of a bivariate Gaussian distribution. In recent research, deep neural networks considering social interaction mechanisms have been widely used. Deo and Trivedi [23] embedded convolutional social pooling into an LSTM encoder/decoder framework and employed a social tensor to encode the past motion states of surrounding vehicles. Dai et al. [24] designed an ST-LSTM algorithm comprising two layers of LSTM, one for predicting the motion trajectories of SVs’, and another for considering the interactions of SVs’. Methods based on Graph Neural Networks have been widely applied to trajectory prediction tasks in multi-agent environments due to the capacity of graph structure data in depicting social interaction relations. Jeon et al. [25] employed a graph to encode the behavior of surrounding vehicles. Shi et al. [26] considered the interactions among pedestrians, proposing a SparseGCN (SGCN) for modeling pedestrian trajectory predictions. Sheng et al. [20] proposed a graph-based spatiotemporal convolution network (GSTCN), in which a Graph Convolutional Network (GCN) is used to handle spatial interactions and a Convolutional Neural Network (CNN) is employed to capture temporal features. Although these methods improved the accuracy of trajectory prediction through the introduction of social interaction, they struggle to capture the dynamic impact across the complete historical trajectory, and fail to detect key positions that significantly influence the prediction results. Therefore, for trajectories that contain long-term interactions, these interaction considering models can still be further improved in terms of accuracy.

When considering social interactions, its complexity and characteristics suggest that simplified pooling concepts may not accurately reflect actual interaction patterns, potentially reducing the effectiveness of the methods. The attention mechanism allows models to selectively focus on processing specific components of the input data, which enables the model to consider dependencies within the sequence and focus more on positions with a higher impact on the prediction results. Recent studies employing the attention mechanism to handle trajectory interactions have shown promising results in prediction tasks [27], [28]. Messaoud et al. [29] proposed a LSTM encoder/decoder framework based on multi-head attention to quantify the importance of surrounding vehicles during the driving process. These developments highlight the potential effectiveness of attention mechanisms in handling long-term temporal data. The use of the attention mechanism mainly focuses on the dependencies within the sequence, but it fails to incorporate relative vehicle positions and interaction behaviors, thus offering a less comprehensive perception of the scene.

We propose a vehicle trajectory prediction method based on Social-Attention Long Short-Term Memory model (AS- LSTM), aiming to predict the driving routes more accurately. In this model, we introduce a Social-Pooling (S-Pooling) layer to capture and process social interaction information between vehicles. Additionally, we utilize a self-attention mechanism to learn and understand the correlations among different positions in the input sequence. Consequently, we can comprehensively consider the internal dependencies of vehicle trajectories and the social interactions among vehicles, providing an all-round scene perception and effectively capturing long-range dependencies.

The main contributions of this study can be summarized as follows:

We employ an LSTM layer to encode and model the historical trajectory of vehicles, capturing the temporal patterns and long-term dependencies of trajectories. On this basis, we introduce S-Pooling layer, aggregating and combining the outputs of multiple LSTM layers, aiming to fully consider the interactions among traffic participants. By encoding the information of surrounding vehicles into the model, we integrate inter-vehicle relations and social information, enabling spatially adjacent LSTMs to share information. Then the AS-LSTM model can capture the cooperative behavior and mutual influence between vehicles, and more accurately predicting the trajectory of the target vehicle.
Considering social interactions often involves long-term time series data, where dependencies may extend to distant time steps. To address this issue, we use a self-attention mechanism to model dependencies in the whole situation, and to weight the output of the S-Pooling layer. This quantifies the degree of mutual influence among vehicles, enabling the model to utilize social interaction information to predict trajectories more effectively.
A comprehensive evaluation of the proposed model was undertaken using the HighD dataset and the NGISM dataset, with an ensuing analysis of the outcomes. Experimental results showed that compared with existing trajectory prediction models, our proposed network model exhibited higher prediction accuracy within a prediction time of 1 to 5 seconds.

The remainder of this paper is organized as follows. Section II defines the problem to be solved, Section III elaborates on the methodology of this study, encompassing the overarching framework and specifics. Section IV elucidates the experimental settings, displays the experimental results, and compares them with existing methods. Finally, Section V proposes concluding remarks and highlights the scope for future work.

SECTION II.

Problem Formulation

This paper postulates that vehicle trajectory prediction is a process of estimating future trajectory distributions based on historical trajectories. It implies that when making trajectory predictions, all adjacent vehicles’ future positions need to be taken into consideration. This approach facilitates a more comprehensive understanding of future scenarios, as the behavior of neighboring vehicles can potentially exert a significant impact on the future trajectory of the target vehicle. Neglecting the information of these adjacent vehicles may lead to inaccurate or incomplete prediction results.

To appropriately frame this problem, we start by introducing specific notations and parameters. We consider an environment populated with multiple agents whose historical states, from time step 1 to $T_{\text {obs}}$ , can be acquired from existing datasets. For each agent $i$ , these historical states, denoted as $\mathbf {S}_{t}^{i}$ , consist of various parameters including global coordinates $(x,y)$ , velocity $v$ , and acceleration $a$ at time $t$ . The global coordinates $(x,y)$ represent the agent’s specific position in the environment, with $x$ as the longitudinal coordinate and $y$ as the lateral coordinate. The velocity $v$ and acceleration $a$ are crucial parameters reflecting the agent’s driving intentions.

Mathematically, the historical state of an agent $i$ at the time step $t$ is represented as $\mathbf {S}_{t}^{i} = [\left ({x_{1}, y_{1}, v_{1}, a_{1}}\right), \left ({x_{2}, y_{2}, v_{2}, a_{2}}\right),\,\,\ldots, \left ({x_{t}, y_{t}, v_{t}, a_{t}}\right), t \in \{1, 2, 3, \ldots, T_{\text {obs}}\}$ .

From these historical states, we can extract the agent’s historical positions as $\mathbf {X}_{\text {in}}^{i} = \left [{\left ({x_{1}^{i}, y_{1}^{i}}\right), \left ({x_{2}^{i}, y_{2}^{i}}\right), \ldots, \left ({x_{t}^{i}, y_{t}^{i}}\right)}\right]$ , with the same time steps $t \in \{1, 2, 3, \ldots, T_{\text {obs}}\}$ .

The primary aim of our model is to predict the trajectory of each agent from time $T_{\text {obs}+1}$ to $T_{\text {pred}}$ . This predicted trajectory is defined as $\mathbf {Y}_{\text {out}}^{i} = \left [{\left ({x_{1}^{i}, y_{1}^{i}}\right), \left ({x_{2}^{i}, y_{2}^{i}}\right), \ldots, \left ({x_{t}^{i}, y_{t}^{i}}\right)}\right], t \in \{T_{\text {obs}+1}, T_{\text {obs}+2}, \ldots, T_{\text {pred}}\}$ .

In summary, the trajectory prediction problem can be described as: Given the historical positions $\mathbf {X}_{\text {in}}^{i}$ of all adjacent vehicles within the time range of 1 to $T_{\text {obs}}$ , the goal is to predict their future trajectory distributions $\mathbf {Y}_{\text {out}}^{i}$ from $T_{\text {obs}+1}$ to $T_{\text {pred}}$ .

SECTION III.

Structure of As-Lstm Model

This Section provides the network architecture of our algorithm, as depicted in Figure 1. The model consists of four components: Encoder, Self-Attention, S-Pooling Layer, and Decoder. The Encoder receives trajectory data as input and transforms them into internal representations that can capture important characteristics of the input data, providing valuable information for subsequent processing steps. Self-Attention is used to analyze the internal dependencies of vehicle trajectories and to measure the degree of mutual influence among vehicles at different positions. The S-Pooling Layer captures the collaborative behavior and interaction among vehicles by processing adjacent LSTM layers through gridding. The Decoder predicts the future trajectories of the vehicle based on the information received from the Encoder, Self-Attention, and S-Pooling Layer. Each part is described below:

$FIGURE 1. - Schematic diagram of the proposed AS-LSTM structure:Our model takes the historical trajectories of all traffic participants within a certain spatial range around the ego vehicle (highlighted in a bright green box) as input.The historical trajectories of all vehicles are fed into the LSTM Encoder layer to obtain hidden states( ${h}_{i}^{t-{1}}$ ). These vectors are then embedded into a tensor( ${H_{t}^{i}\in {\mathbb {R}}}^{N\mathbf {0}\times {M}\mathbf {0}\times {D}}$ ) based on their spatial positions and added to themselves through a self-attention layer. The resulting matrix can be considered to contain historical interaction information and spatial position information. Subsequently, this information is passed through the LSTM layer, combined with the information processed by the Embedding and self-attention layers, and decoded to produce predicted trajectory outputs.$

FIGURE 1.

Schematic diagram of the proposed AS-LSTM structure:Our model takes the historical trajectories of all traffic participants within a certain spatial range around the ego vehicle (highlighted in a bright green box) as input.The historical trajectories of all vehicles are fed into the LSTM Encoder layer to obtain hidden states(${h}_{i}^{t-{1}}$ ). These vectors are then embedded into a tensor(${H_{t}^{i}\in {\mathbb {R}}}^{N\mathbf {0}\times {M}\mathbf {0}\times {D}}$ ) based on their spatial positions and added to themselves through a self-attention layer. The resulting matrix can be considered to contain historical interaction information and spatial position information. Subsequently, this information is passed through the LSTM layer, combined with the information processed by the Embedding and self-attention layers, and decoded to produce predicted trajectory outputs.

Show All

A. Lstm Encoder

The main task of the Encoder is to receive the historical states of the input vehicle $\mathbf {S}_{t}^{i}$ and encode it into a continuous internal representation, providing deeper context information for subsequent processing. For this, we choose to use a Long Short-Term Memory network (LSTM) as the Encoder. Compared with traditional Recurrent Neural Networks (RNN), LSTM has a better ability to handle long-term dependency issues, effectively retaining and utilizing past information when dealing with long sequences.

For the input $\mathbf {S}_{t}^{i} \in \mathbb {R}^{n \times N}$ , LSTM can associate it with the information of a hidden state $\mathbf {h}_{t-1}$ and memory cell $\mathbf {c}_{t-1}$ from the previous moment, generating the hidden state $\mathbf {h}^{t}$ and memory cell $\mathbf {c}^{t}$ for the next moment. This process can be formally described by the following formulas:\begin{align*} \mathbf {i}_{t} &= \sigma \left ({\mathbf {W}_{ii}\mathbf {S}_{t}^{i} + \mathbf {b}_{ii} + \mathbf {W}_{hi}\mathbf {h}_{t-1} + \mathbf {b}_{hi}}\right) \tag{1}\\ \mathbf {f}_{t} &= \sigma \left ({\mathbf {W}_{if}\mathbf {S}_{t}^{i} + \mathbf {b}_{if} + \mathbf {W}_{hf}\mathbf {h}_{t-1} + \mathbf {b}_{hf}}\right) \tag{2}\\ \mathbf {g}_{t} &= \tanh \left ({\mathbf {W}_{ig}\mathbf {S}_{t}^{i} + \mathbf {b}_{ig} + \mathbf {W}_{hg}\mathbf {h}_{t-1} + \mathbf {b}_{hg}}\right) \tag{3}\\ \mathbf {c}_{t} &= \mathbf {f}_{t} \ast \mathbf {c}_{t-1} + \mathbf {i}_{t} \ast \mathbf {g}_{t} \tag{4}\\ \mathbf {h}_{t} &= \mathbf {o}_{t} \ast \tanh \left ({\mathbf {c}_{t}}\right) \tag{5}\\ \mathbf {o}_{t} &= \sigma \left ({\mathbf {W}_{io}\mathbf {S}_{t}^{i} + \mathbf {b}_{io} + \mathbf {W}_{ho}\mathbf {h}_{t-1} + \mathbf {b}_{ho}}\right) \tag{6}\end{align*} View Source where the variables in the formulas represent different aspects of the LSTM function and the input data, including the input gate $\mathbf {i}_{t}$ at time step t, forget gate $\mathbf {f}_{t}$ , candidate cell state $\mathbf {g}_{t}$ , output gate $\mathbf {o}_{t}$ , sigmoid function $\sigma $ , $\mathbf {W}_{xy}$ and $\mathbf {b}_{xy}$ respectively represent LSTM layer’s weights and biases. Additionally, make $\mathbf {S}_{t}^{i}$ be the input at time step t, $\mathbf {h}_{t-1}$ be the previous hidden state, and $\mathbf {c}_{t-1}$ represent the previous memory cell. The element-wise multiplication is denoted as “*”. For convenience, we can use $\mathbf {h}_{i}^{t} = \text {LSTM}\left ({\mathbf {S}_{t}^{i}; \mathbf {W}_{l}}\right)$ to represent the operation process of LSTM. Here, $\mathbf {W}_{l}$ represents the weights of LSTM.

B. Self-Attention

To capture dependencies among different positions in a sequence and allocate distinct weights to each position based on their importance, we employ a self-attention mechanism to model the relationships between sequence elements within the input sequence. Within this self-attention mechanism, we calculate attention scores between all points in the input sequence and all other points. Then, utilizing these scores, we can generate a new, weighted-average representation for each point, where the weights are the attention scores of the point in relation to all other points.

Self-Attention is applied to the output of the S-Pooling layer to weight it, thereby deriving a set of weights to measure the extent of mutual influence between vehicles at different positions. More specifically, this can be represented as:\begin{align*} &\begin{cases} \displaystyle \mathbf {Q}_{t}^{i} = \mathbf {H}_{t}^{i} \cdot \mathbf {W}^{Q} \\ \displaystyle \mathbf {K}_{t}^{i} = \mathbf {H}_{t}^{i} \cdot \mathbf {W}^{K} \\ \displaystyle \mathbf {V}_{t}^{i}= \mathbf {H}_{t}^{i} \cdot \mathbf {W}^{V} \end{cases} \tag{7}\\ \text {Attention}(\mathbf {Q}_{t}^{i}, \mathbf {K}_{t}^{i}, \mathbf {V}_{t}^{i}) &= \text {softmax}\left ({\frac {\mathbf {QK}^{T}}{\sqrt {d_{k}}}}\right)\mathbf {V} \tag{8}\end{align*} View Source where the matrices $\mathbf {Q}_{t}^{i}$ , $\mathbf {K}_{t}^{i}$ , $\mathbf {V}_{t}^{i}$ respectively denote the input matrices for Query, Key, and Value, $\mathbf {W}^{Q}$ , $\mathbf {W}^{K}$ , $\mathbf {W}^{V} \in \mathbb {R}^{d_{m} \times d_{m}}$ represent the learnable weight matrix, $d_{k}$ signifies the dimension of the attention head, and its application to the weighted output of the S-Pooling layer can be expressed as:\begin{equation*} \mathbf {a}_{i}^{t} = \text {Attention}\left ({\mathbf {H}_{i}^{t}; \mathbf {W}_{a}}\right) \tag{9}\end{equation*} View Source Here, $\mathbf {H}_{i}^{t}$ represents the output of the S-Pooling layer, and $\mathbf {W}_{a}$ is the learnable parameter matrix.

FIGURE 2.

The structure of self-attention adjacency matrix generator.

Show All

C. S-Pooling Layer

To enable the LSTM layer to capture social interactions between vehicles during trajectory prediction, we connect adjacent LSTM layers to grasp the collaborative behavior and mutual influence between vehicles. The procedure starts by compiling the historical states of each vehicle in the dataset. To more effectively utilize the relative positional relationships between different agents, a gridding process is adopted. For each target vehicle to predict, we perform gridding with the vehicle as the center. Next, we create a tensor of shape $N_{0} \times M_{0} \times D$ . This tensor, denoted as $\mathbf {H}_{t}^{i}$ , essentially forms a grid map centered on the $i^{\text {th}}$ agent. Each grid has a side length of $a$ , and the hidden dimension size of $D$ is intended to accommodate the historical information $\mathbf {h}_{t-1}^{j}$ of the $j^{\text {th}}$ agent occupying the grid. The implementation of this process is captured in the following formula:\begin{equation*} \mathbf {H}_{t}^{i}\left ({n,m,:}\right) = \sum 1_{mn} \left [{x_{t}^{j} - x_{t}^{i}, y_{t}^{j} - y_{t}^{i}}\right] \mathbf {h}_{t-1}^{j} \tag{10}\end{equation*} View Source where $\mathbf {h}_{t-1}^{j}$ represents the hidden state of the $i^{\text {th}}$ agent at time $t-1$ , and $1_{mn}\left [{x, y}\right]$ is an indicator function checking whether point $\left ({x, y}\right)$ is on the grid $\left ({n, m}\right)$ . Next, we insert the pooled social hidden state tensor $\mathbf {H}_{t}^{i}$ and coordinate embedding vector $\mathbf {e}_{t}^{i}$ and into the vector $\mathbf {a}_{t}^{i}$ . These vectors are concatenated together and serve as the input for the temporal LSTM units, forming the following recursive relationship:\begin{align*} \mathbf {r}_{t}^{i} &= \phi \left ({x_{t}^{i}, y_{t}^{i}; \mathbf {W}_{r}}\right) \tag{11}\\ \mathbf {e}_{t}^{i} &= \phi \left ({\mathbf {a}_{t}^{i}, \mathbf {H}_{t}^{i}; \mathbf {W}_{e}}\right) \tag{12}\\ \mathbf {b}_{t}^{i-1} &= \text {Attention}\left ({\mathbf {h}_{t}^{i-1}; \mathbf {W}_{b}}\right) \tag{13}\\ \mathbf {h}_{t}^{i} &= \text {LSTM}(\left [{\mathbf {b}_{t}^{i-1}, \mathbf {e}_{t}^{i}}\right]; \mathbf {W}_{l}) \tag{14}\end{align*} View Source

FIGURE 3.

Overview of AS-LSTM model.

Show All

Here, $\phi \left ({.}\right)$ is an embedding function with ReLU non-linearity, $\mathbf {W}_{e}$ , $\mathbf {W}_{r}$ and $\mathbf {W}_{b}$ are the embedding weights, and $\mathbf {W}_{l}$ are the weights of LSTM. Through the above steps, our method captures not only the dynamic characteristics of each vehicle but also considers social interactions between vehicles, which is crucial for improving the accuracy of trajectory prediction.

D. Lstm Decoder

The role of the decoder is to transform the internal representation generated by the encoder into the predicted future vehicle trajectory. When predicting positions, we set up a bivariate Gaussian distribution parameterized by the mean $\boldsymbol {\mu }_{t+1}^{i} = \left ({\boldsymbol {\mu }_{x}, \boldsymbol {\mu }_{y}}\right)_{t+1}^{i}$ , with standard deviation $\boldsymbol {\sigma }_{t+1}^{i} = \left ({\boldsymbol {\sigma }_{x}, \boldsymbol {\sigma }_{y}}\right)_{t+1}^{i}$ and correlation coefficient $\boldsymbol {\rho }_{t+1}^{i}$ . These parameters are predicted by a linear layer with a weight matrix of size:\begin{equation*} \left [{\boldsymbol {\sigma }_{t}^{i}, \boldsymbol {\mu }_{t}^{i}, \boldsymbol {\rho }_{t}^{i}}\right] = \mathbf {W}_{p} h_{i}^{t-1} \tag{15}\end{equation*} View Source The predicted coordinates at time $t$ are:\begin{equation*} \left ({\hat {x}, \hat {y}}\right)_{t}^{i} \sim \mathcal {N}\left ({\boldsymbol {\mu }_{t}^{i}, \boldsymbol {\sigma }_{t}^{i}, \boldsymbol {\rho }_{t}^{i}}\right) \tag{16}\end{equation*} View Source

FIGURE 4.

Structure of S-pooling.

Show All

Given that each LSTM layer corresponds one-to-one with the S-Poolinglayer (each LSTM layer is paired with a corresponding S-Pooling layer), duringa backward propagation at a time step in each scene, all LSTM layers and S-Pooling layers corresponding to the agents are updated simultaneously. Toenhance prediction accuracy, we guide the model to learn with the objective of minimizing positional error. For the i LSTM model, its parameters are learned through minimizing the negative log-likelihood loss:\begin{align*}{\mathcal{L}}^i\left(\mathbf{W}_e, \mathbf{W}_l, \mathbf{W}_p\right)=-\sum_{t=T_{\mathrm{obs}}+1}^{T_{\mathrm{pred}}} \ln \left(P\left(x_t^i, y_t^i \mid \boldsymbol{\sigma}_t^i, \boldsymbol{\mu}_t^i, \rho_t^i\right)\right)\tag{17}\end{align*} View Source

This loss function ensures that during model optimization, parameters will be adjusted according to the discrepancy between the predicted results and the actual trajectory, making the predicted positions as close as possible to the real vehicle trajectory. In this way, the decoder effectively transforms the internal state into a prediction of future trajectories, thus accomplishing the task of vehicle trajectory prediction.

SECTION IV.

Experiments Results

A. Experimental Setting

The datasets used in this study were NGSIM I-80 [31], [32] and HighD [30]. The NGSIM dataset is a public dataset that was obtained at 10 fps in 2005. The HighD dataset includes driving data of cars and trucks on the highways surrounding Cologne, Germany, amassed by RWTH Aachen University in 2017 and 2018. All data was gathered using a drone at a rate of 25 frames per second (fps). Taking the NGSIM I-80 dataset for example, 150000 samples were used, including 72000 samples for training, 6000 for testing, and 72000 for validation. In order to facilitate processing, the sampling frequency of the dataset was reduced to 5 fps. For the HighD dataset, 50% were utilized for training, 10% for testing, and 40% for validation. An 8-s period was selected to depict the trajectory of each vehicle: 3 seconds were used as the historical input and 5 seconds as the trajectory to be predicted.

All experiments were performed on an AMD EPYC 7371 CPU, NVIDIA RTX A5000 GPU. All the program tasks were conducted on Python 3.9, and the deep learning framework was based on PyTorch.

B. Evaluation Metrics

In order to measure the accuracy of trajectory prediction, the Root Mean Square Error (RMSE) between the predicted trajectory and the actual trajectory is adopted to express the error. RMSE is computed as follows:\begin{equation*} \text {RMSE}_{t} = \sqrt {\frac {1}{N} \sum _{n=1}^{N} \left [{\left ({\hat {x}_{t}^{n} - x_{t}^{n}}\right)^{2} + \left ({\hat {y}_{t}^{n} - y_{t}^{n}}\right)^{2}}\right]} \tag {18}\end{equation*} View Source where $x_{t}^{n}, y_{t}^{n}$ is the ground truth and $\hat {x}_{t}^{n}, \hat {y}_{t}^{n}$ is the prediction of vehicle $n$ at time $t$ .

Scene Size: The sight horizon for autonomous vehicles is set to ±100 meters longitudinally and spans two adjacent lanes laterally.
Trajectory Prediction Module: The grid parameters are set to $20 \times 20 \times 128$ , $a=0.9\text {m}$ . The lengths for the encoding sequence and decoding sequence are both set to 128, with the S-Pooling encoding length set to 64. In the self-attention mechanism, the feature dimensions for both the query and key are set to 64.
Training Process: The batch size is set to 16. We use the Adam optimizer, with a learning rate starting at 0.0001 and linearly decaying to 0. The training algorithm stops updating after 30 iterations.

C. Results and Comparisons

To verify the proposed model, we selected several models with excellent performance in recent years for comparison. The selected models are as follows:

Class variational Gaussian mixture models (C-VGMMs) [33]: This approach uses Variational Gaussian Mixture Models integrated with a Markov random field for classifying driving behaviors and predicting the trajectory.
GAIL [34]: The Generative Adversarial Imitation Learning model is advanced to refine gated recurrent units for enhancing the maintenance of policy fidelity;
Social-LSTM (S-LSTM) [35]: S-LSTM implements a fully interconnected pool LSTM encoder/decoder framework for trajectory prediction;
Convolutional Social Pooling LSTM (CS-LSTM) [23]: This model applies convolutional social pooling to encode surrounding vehicles into social tensors, utilizes an LSTM encoder/decoder, and addresses multi-modal driving maneuvers;
MATF [36]: This approach encodes the historical trajectories of multiple agents and the scene context into a Generative Adversarial Network, leveraging adversarial loss;
Sparse Graph Convolutional Network (SGCN) [26]: SGCN explicitly constructs sparse directed interactions based on sparse directed spatial graphs;
Scalable Network (SCALE-Net) [25]: SCALE-Net deploys edge-enhanced graph convolutional neural networks, displaying insensitivity to input data and improving prediction efficacy;
Multi-Future Prediction (MFP) [37]: MFP employs parallel Recurrent Neural Networks with shared weight encoders to capture the interaction between encoding agents’ past and future and predict encoded trajectories;
Multi-Head Attention Social Pooling (MHA) [29]: This technique utilizes the multi-head attention mechanism of an encoder/decoder to extract profound features of target vehicles and surrounding vehicles, considering a broad spectrum of input features such as speed, acceleration, and vehicle type;
Spatio-Temporal Graph Dual-Attention Networks (STG-DAT) [38]: STG-DAT employs dynamic graph representation and relational inductive biases for explicit interaction modeling. It utilizes both trajectory and scene context data, incorporates an efficient kinematic constraint layer for vehicle trajectory prediction, and enhances model performance.
Incremental Pearson Correlation Coefficient(IPCC-TP) [39]: IPCC-TP introduces a novel relevance-aware module based on the Incremental Pearson Correlation Coefficient for enhanced multi-agent interaction modeling. It learns pairwise joint Gaussian Distributions through the closely linked estimation of means and covariances based on interactive incremental movements.

Table 1 illustrates the RMSE values for each model when predicting trajectories within a time frame of 1s-5s in the NGSIM and the HighD dataset. It should be noted that some studies did not evaluate the model using the HighD dataset, hence the corresponding results are not available. From Table 1, we can observe that the RMSE of NGSIM was higher compared to HighD. This difference is likely due to the higher accuracy and fewer errors in the data provided by the HighD dataset.

TABLE 1 RMSE error of each model in predicting 1s-5s

In the short-term forecast horizon (1s-3s), predictions grounded in the kinematic properties and inertia of the target vehicle tend to exhibit relatively minor errors. However, when it comes to long-term forecasting (4s-5s), predictions regarding driving intentions begin to play a more influential role in shaping the future trajectory of the target vehicle, consequently leading to larger discrepancies. Therefore, the identification and comprehension of intrinsic driving patterns become paramount to effectively steer the trajectory prediction process.

Table 2 presents a comparative analysis of the accuracy of the predicted horizontal and vertical positions using the AS-LSTM model proposed in this paper, as compared to the MHA and SGCN models. The proposed model demonstrates superior accuracy in both horizontal and vertical trajectory predictions when compared to these models.

TABLE 2 Comparison of horizontal and vertical RMSE

In Table 3, we carry out a series of ablation studies, which are grounded on the measurement of RMSE loss over a span of 5 seconds. These include increasing the number of training steps, implementing the use of the self-attention mechanism (where A1 signifies the application of self-attention within the LSTM layer and A2 denotes the employment of self-attention subsequent to the S-pooling), and integrating S-pooling (S denoted by ’$\times $ ’, which implies that the S-pooling is replaced by a simple fully connected network).

TABLE 3 Ablation study of AS-LSTM

Our experimental findings underscore the value of integrating the attention mechanism into our models. When applied individually to the LSTM and S-pooling layers, the attention mechanism yields significant error reductions of 53% and 44%, respectively, when compared to models devoid of this mechanism. Even more impressively, the simultaneous application of the attention mechanism to both layers cuts the error rate down to just 39% of its initial value. Moreover, the incorporation of the S-pooling layer takes the model’s performance a notch higher, enhancing accuracy by an extra 26% in comparison to the model using only the attention mechanism.

The results clearly illuminate the profound impact of implementing the self-attention mechanism, especially when used synergistically with S-pooling. The self-attention mechanism allows the model to focus on the most relevant features by weighting them more heavily, enhancing the learning process by providing a more precise understanding of the data. This, in turn, significantly improves the model’s ability to make accurate predictions.

Furthermore, this technique provides a significant advantage in long-term predictions. Often, long-term predictions are challenging due to the increased uncertainty and complexity associated with extended time horizons. However, the self-attention mechanism, by focusing on the most relevant features, enables the model to extract long-term dependencies in the data more effectively. This improves model’s ability to understand and learn from the historical context, which leads to more accurate long-term predictions.

SECTION V.

Conclusion

This study proposed the AS-LSTM model for vehicle trajectory prediction, focusing on vehicle-to-vehicle interaction and spatio-temporal characteristics between vehicles, aiming to predict vehicle driving trajectories more accurately. In this model, we introduce a Social-Pooling (S-Pooling) layer to capture and process social interaction information between vehicles. To allow spatially adjacent LSTMs to share information, the model can capture social interactions between vehicles. A self-attention mechanism is adopted to adaptively learn the correlation between different positions in the input sequence, thereby better capturing long-distance dependencies. Our model demonstrates satisfactory performance on the HighD and NGSIM datasets in terms of RMSE, reducing it by up to 44.7%. Compared to existing models, the prediction accuracy has significantly improved. Moving forward, our research agenda involves integrating environmental factors, particularly the specifics of roads and lane demarcations, into our dataset. This approach is designed to expand the breadth of our study and enhance the effectiveness of our predictive model.

References is not available for this document.

An Enhanced Vehicle Trajectory Prediction Model Leveraging LSTM and Social-Attention Mechanisms

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Problem Formulation