Introduction
Predicting pedestrian trajectories in crowd scenario is essential in smart city. It has numerous applications such as self-driving cars [1], smart road crossings and intelligent retail [2]. KB models describe pedestrian dynamics using physical, social or psychological rules. Pioneer KB models are Social Force model [3] and collision avoidance [4]. Deep learning (DL) approaches leverage extensive observations. They can be mainly categorized into Recurrent Neural Network (RNN) [5], [6], Convolutional Neural Network (CNN) [7], Transformer (TF) [8], Generative Adversarial Network (GAN) [9], [10], [11]. Most recent research focuses on Social-awareness incorporated deep neural network architectures [10] and graph convolutional network (GCN) [12] to further improve performance.
While much attention is directed towards modelling the trajectory in outdoor scenarios with applications to autonomous vehicles, this paper focuses on modelling the pedestrian trajectory within an urban complex. Recently, Indoor Pedestrian Trajectory Generator (IPTG) [13] was reported, which uses a GAN based approach to generate trajectories for a fictional conference scenario. Han et al. [14] employed trajectory clustering in modeling pedestrian flow for indoor design space. D’Orazio et al. [15] simulated the pedestrian flow of a building using agent-based model with proximity and exposure time based rules to estimate the spread of Coronavirus Disease (COVID-19) in building. However, there is few existing literature about indoor pedestrian trajectory modelling under the influence of weather and time-of-day, a.k.a. weather-time (WT) condition. Xue et al. [16] studied the modelling of pedestrian movement in a train station and proposed a Pedestrian Trajectory Prediction method by LSTM with Automatic Route Class Clustering (PoPPL). It employed k-mean clustering to label pedestrian trajectories followed with subsequent LSTM based intent classification and trajectory prediction. However, the train station dataset only contained video lasting for 30 minutes with same weather and it mainly serves the purpose of transportation.
Weather-time (WT) conditions refer to weather and time-of-day variations. An objective of this paper is to study the effect of weather and time-of-day for pedestrian movement pattern in urban complex. Typical indoor environment, such as residential apartments, offices, factories, etc., are single functional premises. Individuals usually share common location-of-interest (LOI), i.e. going home/going to work. In contrast, pedestrian behavior in urban complexes exhibits much more randomness as the pedestrians could have different destinations to functional objects [17] that serves a wide range of purposes, such as retail, shopping malls, office accommodations, and business functions. Previous studies [18], [19] suggested that weather has an impact in affecting pedestrian behavior. In particular, bad weather may discourage consumers from shopping. Also, adverse weather conditions may lead to delays or cancellations of public transportation services [20], which affects pedestrian traffic. Time-of-day will affect commuter traffic and hence pedestrian flow [21], [22]. This study aims to improve understanding on how the weather and time-of-day influence the choice of destination and hence the trajectories of pedestrians, which will help to facilitate flow management [23] and intelligent retail [2]. With the increasing popularity of multimodal transportation in large metropolises to decrease reliance on private cars and greenhouse emission, many urban complexes are designed with multimodal transportation [24] capabilities. They serve as interconnection points to facilitate seamless transfers between buses and trains. Examples are Osaka station (Osaka, Japan) [24] and Chatswood interchange shopping mall (Sydney, Australia).
Three practical issues may arise in modelling the pedestrian trajectory under different weather-time (WT) conditions in urban complex are i) appropriate preprocessing and feature selection, ii) effective fusion, iii) choice of clusters under the effect of different WT conditions.
First, the format of weather information may not directly fit for use and require appropriate preprocessing and feature selection. Directly concatenating this information to the deep neural network may even confuse the classifier and lead to inferior performance. For instance, Time-of-day information is commonly available as numeric values and the classifier may perceive it as ordinal, i.e. 9 o’clock is larger than 8 o’clock, which is not logical at all.
Second, it is not trivial on where and how to fuse the WT information. For example, direct concatenation of one-hot encoded WT information to the raw pedestrian trajectories does not yield satisfactory performance.
Third, although the use of trajectory prediction guided by pedestrian intent have been reported before, it is mainly used to predict the pedestrian’s intent for road crossing in outdoor scenarios [25] involving pedestrian-vehicle interaction. Unlike the road crossing scenario, where pedestrians will need to cross the road under different weather conditions, the pedestrian behavior in urban complex can be affected by weather, especially in destinations for retail and entertainment.
To overcome these challenges in improving the pedestrian trajectory prediction accuracies of baseline deep learning models, we propose a new weather-time-trajectory network (WTTFNet) for pedestrian trajectory prediction The WTTFNet is made up of the following components:
Weather-time (WT) Embedding: To tackle the issue of preprocessing and feature selection of WT information, a word embedding is used to encode the WT information and it has the advantage to be further optimized according to the final loss function.
A new statistical test based on the Pearson’s chi-squared
statistic is used to test the significance of the WT condition and determine whether to incorporate the WT information.\chi ^{2} Novel WTTFNet based intended destination (ID) classifier: The ID classifier is used to predict the destination based on input trajectories. Motivated by the rationale that weather-time conditions can influence the decision of reaching a destination, the proposed WTTF architecture employs the Gated Multimodal Unit (GMU) to fuse the WT embedding with preliminary pedestrian intent probabilities obtained from a baseline deep neural network based classifier. The fused representation is used to train the final classifier, which generates predicted destinations refined by the weather and time.
Deep supervision [27] is used to co-train the preliminary and final classifiers together using auxiliary and final loss functions. While the preliminary pedestrian intent probabilities provide supervisory signals to train the baseline classifier, the final loss function optimizes the whole architecture. The Focal Loss [28] is used to cater for possible class imbalance. A Destination adapted trajectory predictor (DATP) is used to perform subsequent trajectory prediction. Multiple trajectory models targeted to different destinations are trained and the trajectory model that points to the predicted destination will be chosen.
To illustrate the effectiveness of the proposed approach in improving a baseline pedestrian trajectory model, the public dataset obtained from Asia and Pacific Trade Center (ATC) [29] in Osaka is considered. It is an urban complex serving as a multimodal transportation hub, which connects the intercity ferry pier and Osaka metro line, as well as accommodating a trade center and multi-entertainment complex. Pedestrian trajectories obtained on a sunny (22nd May, 2013) and cloudy day (29th September, 2013) were used. There were roughly 1.5 times more pedestrians during peak hours in compared to off-peak hours. A significant log p-value of −104.8395 (
Experimental results show that the proposed WTTFNet surpasses state-of-the-art algorithm by reduction of 9.16 % and 7.07% in average displacement error (ADE) and final displacement error (FDE), respectively. It also improves the classification accuracy (ACC) and Cohen’s Kappa (
To study the role of weather and time-of-day in improving prediction performance, ablation test is performed to compare between the proposed WTTFNet with/without incorporation of weather-time information. Significant McNemar’s test [30] p-value of
Further analysis of the 3008 significant pedestrians identified by McNemar’s test shows that an overall 5.47% (7.8m to 7.4m) and 7.58% (14.11m to 13.04m) improvement in ADE and FDE reduction were obtained for the significant 3008 pedestrians, and significant one-sided Mann-Whitney U test [32] p-values were attained for ADE (
Finally, with the increasing popularity of multimodal transportation in large metropolises to decrease reliance on private cars and reduce greenhouse emission, understanding pedestrians’ behavior in urban complex is increasingly important. Walking networks with interconnecting urban complexes will be increasingly prevalent to facilitate smooth transfers between different modes of transportation and contribute to the economic development of nearby areas. There are also numerous applications in public space development [33], evacuation planning [34], and advancements in technology-driven retail [2].
The rest of this paper is organized as follows. Section II presents a review on the background and related works, whereas the proposed WTTFNet is presented in Section III. In Section IV, experimental results and comparisons with state-of-the-art algorithms are presented. The proposed statistical test is also used to test the significance of weather-time effects. Finally, conclusion is drawn in Section V.
Background and Related Work
Pedestrian trajectory prediction (PTP) methods can be categorized according to input modality, network architecture, features, and prediction tasks [35], [36]. Traditionally, PTP is achieved using knowledge based methods such as social force [3] collision avoidance [4], kinetic models [37]. In the last decade, deep learning approaches have gained much popularity for its powerfulness in leveraging extensive observations. They can be mainly categorized into
Recurrent neural network (RNN): Examples are Long Short Term Memory (LSTM) [5], Social LSTM [38], Gated Recurrent Unit (GRU) and Conv-LSTM [39]. LSTM are renowned for its capability to handle sequence-to-sequence prediction. Social LSTM further extends LSTM to model social interactions. Conv-LSTM replaces the fully connected layers in conventional LSTM with convolutional layers, which enables the capturing of both spatial and temporal information for intent and trajectory prediction in [39].
Convolutional neural networks (CNN): The CNNs are usually used for PTP approaches that uses images/videos to predict the trajectories. CNN is used to extract spatial-temporal features [7] or skeleton keypoints [40] for classifying pedestrian behaviour.
Transformer: VOSTN [8] used a variational one-shot transformer for trajectory prediction together with a cross-attention module to model the inter-relationship between trajectory and ego-motion. AgentFormer [10] integrated a transformer architecture with agent-aware attention mechanism and a conditional variational autoencoders (CVAE) based trajectory prediction framework.
Generative adversarial network (GAN): POI-GAN [41] used generative model that integrates interest point model, field of view angle, and observed trajectories, to produce projected pedestrian trajectories for future time frames. Social GAN [9] employs a LSTM model to capture temporal structure of individual pedestrian and a social pooling mechanism to aggregate pedestrian interactions. The resultant deep features are used to train the GAN.
Over the past 5 years, most research focuses on incorporation of Social-awareness [9], [10], or contextual information [25], [39] to improve prediction performance. Social-awareness approaches such as social LSTM Mann and Whitney [32], social GAN [9], Sophie [42], AgentFormer [10] etc., primarily center around predicting trajectories and modeling interactions among a fixed number of pedestrians based on social pooling mechanisms. GCN based approaches, such as Social Spatial Temporal Graph CNN (SSTGCNN) [43], which models pedestrian interactions as graphs and extract spatial-temporal feature from the graphs using convolutional operations.
Context-based approaches incorporates context information to predict pedestrian intent and use it to guide subsequent trajectory prediction [23], [33]. Typical pedestrian intent includes crossing road and other walking gestures [44]. These intents are predicted from video or LIDAR sequences. Examples of contextual information are road topology, maps, pedestrian attributes, road boundaries and ego-vehicle information [23], [33].
While much attention is directed towards modelling the trajectory in outdoor scenarios with applications to autonomous vehicles, this paper focuses on modelling the pedestrian trajectory within an urban complex, which is challenging because pedestrians can have many possible destinations, such as shops, escalators, and attractions. Moreover, weather and time-of-day may affect pedestrian behavior. A new weather-time-trajectory fusion network (WTTFNet) is proposed to incorporate weather and time-of-day (WT) information to refine the predicted destination and trajectories. In the next section, the proposed methodology will be discussed.
Proposed Methodology
Fig. 1 shows an illustration of the pedestrian trajectory prediction problem, where the proposed WTTFNet predicts the final destination and future trajectory from partially observed trajectory, e.g. half of the trajectory in this paper. The proposed WTTFNet is made up of the following components:
Destination-driven clustering: It is used to label the pedestrian trajectories of the training set with destinations assigned by k-mean clustering for subsequent training of the intended-destination (ID) classifier.
The proposed statistical test based on the Pearson’s chi-squared
statistic is designed to determine the minimum sample size required for each cluster and determine whether to incorporate the WT information.\chi ^{2} ID classifier: It predicts the final destination that occurs in future from an observed “historical” trajectory of the pedestrian. The training set is provided by the destination-driven clustering. It is made up of a baseline deep neural network based classifier and the proposed WTTFNet, which serve as the preliminary and final classifiers, respectively. The baseline classifier will generate a set of preliminary pedestrian intent probabilities indicating the chances of reaching different destinations. Afterwards, the WTTFNet fuses the WT information and the preliminary pedestrian intent probabilities for subsequent training of the final classifier, which generates the final intent probabilities.
Destination adapted trajectory predictor (DATP): After the final pedestrian intent probabilities are generated, the destination with the highest probability is chosen. The target trajectory model trained using the clustered trajectories of surrounding the chosen destination is used to predict the future trajectory. As an illustration, the PoPPL-def sub-LSTM [16] is adopted as the trajectory model. In general, other trajectory prediction models can be used.
A pedestrian trajectory prediction problem. The observed trajectory is used to predict the future trajectory and final destination in this paper.
A. Destination-Driven Clustering Module
In a pedestrian trajectory prediction problem, an observed trajectory \begin{align*} \boldsymbol {s}_{\boldsymbol {n}}& =\left \{{{\left ({{ x_{n,1},y_{n,1} }}\right ),\ldots ,\left ({{ x_{n,L},y_{n,L} }}\right ) }}\right \} \tag {1a}\\ \hat {z}_{\boldsymbol {n}}& =\left \{{{\left ({{ \hat {x}_{n,L+1},\hat {y}_{n,L+1} }}\right ),\ldots ,\left ({{ \hat {x}_{n,L+L'},\hat {y}_{n,L+L'} }}\right ) }}\right \}. \tag {1b}\end{align*}
The proposed approach also employs a statistical test to test the significance of each cluster and establish the minimum number of samples for each cluster (See Eqn. (12)). If a cluster is found to have insufficient number of samples, it can be merged to one of the clusters using an agglomerative clustering similarity measure, such as centroid linkage criterion\begin{equation*} {min. || \left ({{ \mu _{x,k}, \mu _{y,k} }}\right )-\left ({{ \mu _{{x,k}_{i}},\mu _{{y,k}_{i}} }}\right )||}_{2}^{2}, \tag {2}\end{equation*}
\begin{align*} \text {Trajectory:} \boldsymbol {s}_{\boldsymbol {n}}& =\left \{{{\left ({{ x_{n,1},y_{n,1} }}\right ),\ldots ,\left ({{ x_{n,L},y_{n,L} }}\right ) }}\right \}, \tag {3a}\\ \text {Destination:} \omega _{n}& =1,\ldots ,K, \tag {3b}\end{align*}
B. Novel Weather-Time-Trajectory Network for Destination Classification
Fig. 2 and Table 1 show the proposed intended destination (ID) classifier, which comprises the weather-time (WT) embedding, baseline model (e.g. PoPPL) and the novel WTTFNet. First, a baseline model is used to learn the micro-level representation of the trajectory. Afterwards, a fully connected (FC) layer is used to learn a preliminary classifier of the destinations. The output preliminary ID class probabilities are then passed to the GMU for fusing with the WT embedding. The fused multimodal representation is passed to a final FC layer for training the final classifier. Both the preliminary and final classifier are co-optimized using the focal loss function. Here, the PoPPL is employed as the baseline model. In general, other trajectory models can be used.
The proposed WTTFNet. Key innovations lie in the Intended Destination (ID) classifier, which is made up of i) baseline model, ii) focal loss iii) Deep supervision (preliminary and final classifiers optimized using joint loss function), and iv) incorporation of weather and time information via Gated Multimodal Unit (GMU). The structural details are summarized in Table 1.
More precisely, suppose there are \begin{equation*} \text {WT Embedding:} \boldsymbol {e}_{\boldsymbol {WT,n}}= \boldsymbol {\Theta } \left ({{ \boldsymbol {f}_{\boldsymbol {w,n}},\boldsymbol {f}_{\boldsymbol {d,n}} }}\right ), \tag {4}\end{equation*}
\begin{align*} \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right )& =\boldsymbol {\sigma }_{\mathrm {Soft}}\left ({{ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {C}} }}\right ), \tag {5a}\\ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {C}}& = \boldsymbol {\phi }_{\boldsymbol {BN}}\left ({{ \mathrm {FC}\left ({{ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {base}} }}\right ) }}\right ), \tag {5b}\end{align*}
\begin{align*}\boldsymbol {\sigma }_{Soft}\left ({{ u }}\right )& =\frac {1}{\sum \nolimits _{k=1}^{K} e^{u_{k}} }\left [{{ e^{u_{1}},e^{u_{2}}\mathrm {,\ldots ,}e^{u_{K}} }}\right ]^{T}, \quad \text {and}~ \tag {5c}\\ \phi \left ({{ u_{k} }}\right )& =\frac {u_{k}-E\left ({{ u_{k} }}\right )}{\sqrt {var\left ({{ u_{k} }}\right )+\epsilon } }\times w_{\gamma ,k}+w_{b,k} \tag {5d}\end{align*}
The preliminary pedestrian intent probabilities
More precisely, the GMU can be described using the following set of equations:\begin{align*} \boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {v}}& = \mathbf {tanh}\left ({{ \boldsymbol {W}_{\boldsymbol {v}}\cdot \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right ) }}\right ), \tag {6a}\\ \boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {e}}& = \mathbf {tanh}\left ({{ \boldsymbol {W}_{\boldsymbol {e}} \cdot \boldsymbol {\Theta }\left ({{ \boldsymbol {f}_{\boldsymbol {w,n}},\boldsymbol {f}_{\boldsymbol {d,n}} }}\right ) }}\right ), \tag {6b}\\ {3}_{\boldsymbol {n}}& =\mathrm {\boldsymbol {\sigma }}_{sgm}(\boldsymbol {W}_{3}\cdot \left [{{ \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right )^{T},\boldsymbol {\Theta }\left ({{ \boldsymbol {f}_{\boldsymbol {w,n}},\boldsymbol {f}_{\boldsymbol {d,n}} }}\right )^{T} }}\right ]^{T}), \tag {6c}\\ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {fuse}}& ={3}_{\boldsymbol {n}}\mathrm {\odot }\boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {v}}+\left ({{ \boldsymbol {1}-{3}_{\boldsymbol {n}} }}\right )\mathrm {\odot }~\boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {e}}, \tag {6d}\end{align*}
\begin{equation*} \hat {\boldsymbol {p}}_{\boldsymbol {F}}\left ({{ \omega _{n} }}\right )=\boldsymbol {\sigma }_{\mathrm {Soft}}\left ({{ \boldsymbol {\phi }_{\boldsymbol {BN}}\left ({{ FC_{M,K}\left ({{ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {fuse}} }}\right ) }}\right ) }}\right ), \tag {7}\end{equation*}
\begin{align*} L_{T}& =\left ({{ 1-\lambda _{P} }}\right )\mathrm {L}_{\mathrm {focal}}\left ({{ \omega ,\hat {\boldsymbol {p}}_{F}(\omega ) }}\right ) \\ & \quad + \lambda _{P}\mathrm {L}_{\mathrm {focal}}\left ({{ \omega ,\hat {\boldsymbol {p}}_{pre}\left ({{ \omega }}\right ) }}\right ). \tag {8}\end{align*}
\begin{align*} \mathrm {L}_{\mathrm {focal}}\left ({{ \omega ,\hat {\boldsymbol {p}} }}\right )& =-\frac {1}{NK}\sum \limits _{n=1}^{N} \sum \limits _{k=1}^{K} I_{k,n} {\beta _{k}\left ({{ 1-\hat {\boldsymbol {p}}_{k,n} }}\right )}^{\gamma } \\ & \quad \times \log \left ({{ \hat {\boldsymbol {p}}_{k,n} }}\right ), \tag {9}\end{align*}
\begin{equation*} \hat {\omega }_{n}=\mathrm {max(}\hat {\boldsymbol {p}}_{F}\left ({{ \omega _{n,1} }}\right ),\hat {\boldsymbol {p}}_{F}\left ({{ \omega _{n,2} }}\right ),\ldots ,\hat {\boldsymbol {p}}_{F}(\omega _{n,K})), \tag {10}\end{equation*}
C. Destination-Adapted Trajectory Predictor Module
After obtaining the final probabilities, the predicted trajectory can be obtained as\begin{equation*} \hat {z}_{n}=\mathrm {DT}\mathrm {P}_{k=\hat {\omega }_{n}}(s_{n}),\end{equation*}
D. Statistical Test for Weather-Time Conditions
The proposed statistical test can be used to establish the minimum required samples for each cluster and to quantify whether it is necessary to treat the pedestrian movement pattern in different periods and weathers as different groups and use different trajectory models to describe their behavior. More precisely, Table 2 shows a \begin{align*} H_{0}:& \text {The WT condition does not affect} \\ & \quad \text {the choice of destination.} \tag {11}\end{align*}
\begin{equation*} e_{kc}=\frac {l_{K}\times \underline {n}_{C}}{\underline {n}} \ge 5, \tag {12}\end{equation*}
Once the clusters are established, the test statistic for WT condition reads\begin{equation*} \chi _{obs}^{2}=\sum \limits _{c=1}^{C} \sum \limits _{k=1}^{K} {\frac {\left ({{ o_{kc}-e_{kc} }}\right )^{2}}{e_{kc}},} \tag {13}\end{equation*}
\begin{equation*} p=Pr\mathrm {(}\chi ^{2}\ge \chi _{obs,j}^{2}\mathrm {\vert }H_{0}\mathrm {),} \tag {14}\end{equation*}
Results and Analysis
For illustrative purposes, the Osaka Asia and Pacific Trade Center (ATC) dataset (Dražen et al. 2013) is considered. The Osaka ATC is a transportation hub linking the Sunflower inter-city Ferry pier to the Osaka City Metro. It contains a multi-entertainment complex and a conference center. The trajectories were collected at 1/F of ATC using 3D range sensors. The full dimension is over
A. Choice of Cluster
A general rule of thumb to choose the number of classes is to study the number of possible entrances/exits of the floor plan [16], [17]. Fig. 3 shows the floor plan of the Osaka ATC Center (1/F). Following this notion, key entrances and exits are chosen as the initialization centroids as in Fig. 3. Table 3 shows the list of initialization centroids.
Initialization centroids for k-means clustering,
B. Statistical Analysis of Time-of-Day and Weather Conditions
In this sub-section, we shall test the significance of time-of-day and weather conditions using the proposed statistical test.
Table 4 shows the number of observed pedestrian arrival during peak hour (12:00-16:59), off-peak, sunny and rainy conditions for
C. Baseline and Metric
To evaluate the performance of the proposed approach, we compare the proposed WTTFNet with the following algorithms:
Linear Model: A simple linear model with a hidden layer (nn.linear in Pytorch) [46] is used to predict the trajectories.
Vanilla LSTM: The sub-LSTM in PoPPL-def is used. It employs an encoder-decoder LSTM with 2 hidden layers fitting all the trajectories. The implementation follows the Github codes [16].
PoPPL [16]: The sub-LSTM model is employed together with route class clustering. The implementation follows the Github codes. Following the previous statistical analysis,
destinations were chosen. Route class clustering divides all trajectories according to all combinations of all 9 origins and 9 destinations for training trajectory models.K=9 Proposed WTTFNet: For fair comparison, we adopt the same baseline model as in PoPPL, as shown in Fig. 1. However, the proposed destination-driven clustering and proposed WTTFNet are used. Hyperparameters same as the authors are adopted for the PoPPL baseline model. For the number of destinations,
is chosen as in the previous analysis.K=9
For evaluating the quality of trajectory prediction, the average displacement error (ADE) is the average Euclidean distance between all the actual and all predicted coordinates over all trajectories. The FDE is the average Euclidean distance between the final destination of the predicted and actual trajectories. They are given as\begin{align*} \mathrm {ADE}& =\frac {1}{{N_{T}L}^{\prime }}\sum \nolimits _{n=1}^{N_{T}} \sum \nolimits _{t=1}^{L^{\prime }} \left \|{{ \left ({{\begin{array}{l} x_{n,L+t} \\ y_{n,L+t} \\ \end{array}}}\right )-\left ({{\begin{array}{l} \hat {x}_{n,L+t} \\ \hat {y}_{n,L+t} \\ \end{array}}}\right ) }}\right \|_{2}, \tag {15a}\\ \mathrm {FDE}& =\frac {1}{N_{T}}\sum \nolimits _{n=1}^{N_{T}} \left \|{{ \left ({{\begin{array}{l} x_{n,L+L'} \\ y_{n,L+L'} \\ \end{array}}}\right )-\left ({{\begin{array}{l} \hat {x}_{n,L+L'} \\ \hat {y}_{n,L+L'} \\ \end{array}}}\right ) }}\right \|_{2}, \tag {15b}\end{align*}
\begin{align*} \mathrm {ACC}& =\frac {1}{N_{Test}}\sum \nolimits _{k=1}^{K} {CM[i,i]}, \tag {16a}\\ \kappa & =\frac {N_{T}\sum \nolimits _{i=1}^{K} {CM\left [{{ i,i }}\right ]-} \sum \nolimits _{i=1}^{K} {C_{T}\left [{{ i }}\right ]C_{P}[i]} }{N_{T}^{2}-\sum \nolimits _{i=1}^{K} {C_{T}\left [{{ i }}\right ]C_{P}[i]} }, \tag {16b}\end{align*}
\begin{equation*} rd=\frac {\left ({{ d-d_{REF} }}\right )\left ({{ -1 }}\right )^{m}}{d_{REF}+\epsilon }\times 100\%, \tag {17}\end{equation*}
D. Experimental Setup
The Google Colab Tesla T4 Graphics Processing Unit (GPU) notebook with 16GB GPU memory and 17 GB of system memory is used for evaluation. In the experiment, each observed trajectory has a duration of 20 time-instants and an algorithm will predict the trajectory for the next 20 time-instants. Fig. 3 shows the validation protocol following the validation strategy in [16]. Stratified 5-fold cross validation (CV) is employed. Due to stratification and possible chances that the total number of samples is indivisible by 5, the number of samples across folds may vary slightly. Three-folds (~60%), one-fold (~20%), and one-fold (~20%) are used for training, validation, and testing, respectively.
1) Batch Size and Stopping Criterion
Fig. 4 shows the training and validation curves for the proposed WTTFNet under batch sizes 256, 512,1024 and 2048. For batch size 256, the validation curve is quite noisy and fluctuates rapidly and hence it is not considered. For batch sizes 512, 1024 and 2048, the training accuracy starts to level off around epoch 100 but the validation accuracy remains roughly around a certain range. This suggests more epochs do not necessarily lead to better validation performance. Hence, 1000 epochs are chosen as stopping criterion. Overall, batch size 1024 attained the lowest variance in validation accuracy and hence it is chosen. For each CV fold, the model obtained at the epoch attaining the best validation accuracy is chosen and is used to evaluate the testing data.
Validation protocol and learning curves for various batch sizes. Stratified 5-fold cross validation (CV) is used.
2) Hyperparameters
Hyperparameters same as the PoPPL are adopted for the baseline classifier and trajectory models. Dropout parameter of 0.5 and hidden size of 128 are adopted. For the proposed WTTFNet, the weighing factor in the focal loss is chosen as
E. Experimental Results
In this sub-section, the proposed WTTFNet is compared against various algorithms. Since the proposed WTTFNet can be attached to arbitrary deep neural network baseline models, the PoPPL is adopted as baseline for illustration. In general, other deep neural network based intent classifier, such as transformers, can be adopted as the baseline model. Since the PoPPL is a technique that combined clustering and LSTM, we also compared with Vanilla LSTM.
Table 5 shows the overall performance of all algorithms. The proposed WTTFNet performed better than the original PoPPL, Vanilla LSTM and the linear model for all cases considered. Particularly, the proposed WTTFNet surpasses the original PoPPL 23.67% in classification accuracy, 9.16% reduction in ADE and 7.07% reduction in FDE. Significant p-values of (
1) Ablation Test
The intended destination classifier of the proposed WTTFNet is made up of i) baseline model ii) focal loss iii) deep supervision (preliminary and final classifiers co-trained with joint loss function) and iv) incorporation of WT information using GMU. To study the incremental contribution of each component of the proposed novel WTTFNet and show the role of weather and time-of-day in improving the prediction, we consider Table 6, which compare the following four different settings:
PoPPL (baseline model): The original PoPPL with entropy loss. In general, other baseline models can be used.
PoPPL (baseline model) + FL: PoPPL modified with Focal Loss.
WTTFNet without WT information (second last column of Tables 5 and 6): GMU is bypassed and WT information is not incorporated. Deep supervision is used to co-train the preliminary and final classifiers.
WTTFNet with WT information (final column of Tables 5 and 6): Between the preliminary and final classifiers, the GMU is inserted and the WT information is fused with the preliminary pedestrian intent probabilities at the GMU.
Comparing between PoPPL and PoPPL+FL (Setting 1 vs 2), it can be seen that the use of focal loss improves the ACC as it helps to tackle the class imbalance existed among the clusters. After adding the proposed WTTFNet (Setting 2 vs Setting 3), even without the WT information, around 4% relative improvement in ACC is observed. This suggests even when the GMU is bypassed and WT information is not supplied, deep supervision employed in the WTTFNet is useful in refining both the preliminary and final classifiers optimized using auxiliary and final loss functions based on focal loss. This leads to improved classification accuracy (Table 6), reduction in ADE and FDE (Table 5).
Finally, to study the role of weather and time-of-day in improving the performance, we compare WTTFNet without/with WT information (Setting 3 vs Setting 4). We can observe that the best performance (highest classification accuracy, lowest ADE and FDE) can be attained in Tables 6 and 5, respectively, after incorporation of WT information into the proposed WTTFNet, which suggests the usefulness in adding WT information in prediction.
Overall, the ACC increased from 71.5% to 71.95% after adding WT information in the proposed WTTFNet. To validate its statistical significance, we performed a McNemar’s test and a significant p-value (
2) Quantitative Analysis of the Role of Weather and Time-of-Day
In this section, further analysis on the role of weather and time-of-day in improving the prediction performance is studied. Following the significance p-value obtained for the McNemar’s test in previous sub-section, which suggests that there is significant improvement in classification accuracy after adding WT information to the proposed approach. Moreover, 3008 pedestrians were found to have significant improvement after WT information were added. This motivates us to analyze the average displacement error (ADE) and final displacement error (FDE) of the 3008 pedestrians.
Figs. 5 and 6 compare the ADE and FDE of the proposed approach under two settings, respectively: with/without the incorporation of WT information. “No effect on Destination” means the predicted destination are same under the both settings, whereas “Influenced the Predicted Destination” means the predicted destination was altered after incorporating the WT information.
Average Displacement Error (ADE) of the proposed approach with/without the incorporation of weather-time information. Significant reduction in ADE can be observed (7.8m to 7.4m) for the significant 3008 pedestrians out of all 28536 pedestrians.
Final Displacement Error (FDE) of the proposed approach with/without the incorporation of weather-time information. Significant reduction in FDE can be observed (14.11m to 13.04m) for the significant 3008 pedestrians out of all 28536 pedestrians.
Fig. 5 shows the ADE of the proposed approach with/without WT information incorporated. From the figure, it can be shown that similar ADE was attained when the WT information has no effect on the predicted destination. On the other hand, if the predicted destination changed because of the varying weather (Influenced the predicted destination), the proposed WTTF with WT information incorporated will attain lower ADE (7.4083m) in compared to without WT information (7.83m). One-sided Mann-Whitney U test was used to test the significance in ADE reduction (7.83m to 7.4083m after adding WT information) and a p-value of
Fig. 6 shows the FDE under the two settings (with/without WT information) were compared for the proposed approach. Similar to the observation in the previous comparison, same FDE was attained when the WT information has no effect on the predicted destination (FDE =9.99m) and improved FDE (reduction from 14.11m to 13.04m) for the proposed WTTF approach when it changes the predicted destination after incorporating WT information. One-sided Mann-Whitney U test was used to test the significance in FDE reduction (14.11m to 13.04m after adding WT information) and a p-value of
Overall, 5.47% (7.8m to 7.4m) and 7.58% (14.11m to 13.04m) improvement in ADE and FDE reduction were obtained for the 3008 pedestrians, and the reduction is found significant according to one-sided Mann-Whitney U tests. (
3) Qualitative Analysis of the Role of Weather and Time-of-Day
To illustrate the usefulness of adding WT information in the proposed WTTFNet, we consider four different cases, where Figs. 7(a) and (b) are extracted from the significant 3008 pedestrians and Figs 7(c) and (d) are extracted from the remaining pedestrians, whose destination was not affected by weather-time conditions.
Illustration of predicted trajectories, where the weather-time condition has (a,b) significant influence on destination (chosen from the 3008 significant pedestrians), and (c,d) no influence on destination (chosen from remaining pedestrians). The first half of the trajectory (denoted in black) is used to predict the latter half of the trajectory. Since the two lines of WTTFNet with/ without WT overlapped in (c) and (d), both settings are merged to one line.
Comparing between the proposed WTTFNet and other algorithms, the proposed WTTFNet (solid blue line with dots) generally aligns the best with the actual trajectory (solid black). In particular, the linear model, vanilla LSTM and PoPPL diverged inferiorly in Figs. 7(a) and 7(b).
To study the role of weather and time-of-day, we compare between the two different settings of the proposed WTTFNet: with/without WT information. From Figs 7(a) and 7(b), the WTTFnet with WT information (solid blue line with dots) aligns much better than the counterpart without WT information (solid red line with diamonds), which diverges in the middle of the path. For the remaining non-significant pedestrians, both settings nearly the same performance in Figs. 7(c) and 7(d) and hence only one of them are plot on the graphs.
Overall, the quantitative (Figs. 5 and 6) and qualitative (Fig. 7) analyses show that weather-time information helps to improve prediction performance significantly for the 3008 cases considered. The proportion of 3008 out of 28536 was also statistically significant according to the McNemar’s test, suggesting that these 3008 pedestrians showing improved performance out of 28536 cases were very unlikely a random event. This suggests the proposed approach may serve as an attractive approach for incorporating WT information to improve pedestrian trajectory prediction and it also serves as a systematic approach to test the significance of WT conditions.
Conclusion
A new deep WTTFNet has been presented. Experimental results using the Osaka ATC dataset [3] show that the proposed approach attained better performance than other state-of-the-art methods considered under varying weather-time conditions. A statistical test is also used to establish the significance of time-of-day and weather conditions. The proposed refinement framework can be adopted on other baseline models to improve these performance under varying weather-time conditions.