Processing math: 100%
WTTFNet: A Weather-Time-Trajectory Fusion Network for Pedestrian Trajectory Prediction in Urban Complex | IEEE Journals & Magazine | IEEE Xplore

WTTFNet: A Weather-Time-Trajectory Fusion Network for Pedestrian Trajectory Prediction in Urban Complex


A new Weather-Time-Trajectory Fusion Network (WTTFNet) is proposed for Pedestrian Trajectory Prediction in Urban Complex. Key innovations lie in the Intended Destination ...

Abstract:

Pedestrian trajectory modelling in an urban complex is challenging because pedestrians can have many possible destinations, such as shops, escalators, and attractions. Mo...Show More

Abstract:

Pedestrian trajectory modelling in an urban complex is challenging because pedestrians can have many possible destinations, such as shops, escalators, and attractions. Moreover, weather and time-of-day may affect pedestrian behavior. In this paper, a new weather-time-trajectory fusion network (WTTFNet) is proposed to incorporate weather and time-of-day (WT) information to refine the predicted destination and trajectories. First, a word embedding is used to encode the WT information and its representation can be further optimized according to the loss function. Afterwards, a gate multimodal unit is used to fuse the WT information and preliminary pedestrian intent probabilities obtained from a preliminary baseline classifier. A joint loss function based on focal loss is used to co-optimize both the preliminary and final classifiers, which helps to improve the accuracy under possible class imbalances. Finally, a destination adapted trajectory model is used predict the trajectories guided by the predicted destination. Experimental results using the Osaka Asia and Pacific Trade Center (ATC) dataset shows improved performance of the proposed approach over state-of-the-art algorithms by 23.67% increase in classification accuracy, 9.16% and 7.07% reduction of average and final displacement error. The proposed approach may serve as an attractive approach for improving existing baseline trajectory prediction models when they are applied to scenarios with influences of weather-time conditions. It can be employed in numerous applications such as pedestrian facility engineering, public space development and technology-driven retail.
A new Weather-Time-Trajectory Fusion Network (WTTFNet) is proposed for Pedestrian Trajectory Prediction in Urban Complex. Key innovations lie in the Intended Destination ...
Published in: IEEE Access ( Volume: 12)
Page(s): 126611 - 126623
Date of Publication: 28 August 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Predicting pedestrian trajectories in crowd scenario is essential in smart city. It has numerous applications such as self-driving cars [1], smart road crossings and intelligent retail [2]. KB models describe pedestrian dynamics using physical, social or psychological rules. Pioneer KB models are Social Force model [3] and collision avoidance [4]. Deep learning (DL) approaches leverage extensive observations. They can be mainly categorized into Recurrent Neural Network (RNN) [5], [6], Convolutional Neural Network (CNN) [7], Transformer (TF) [8], Generative Adversarial Network (GAN) [9], [10], [11]. Most recent research focuses on Social-awareness incorporated deep neural network architectures [10] and graph convolutional network (GCN) [12] to further improve performance.

While much attention is directed towards modelling the trajectory in outdoor scenarios with applications to autonomous vehicles, this paper focuses on modelling the pedestrian trajectory within an urban complex. Recently, Indoor Pedestrian Trajectory Generator (IPTG) [13] was reported, which uses a GAN based approach to generate trajectories for a fictional conference scenario. Han et al. [14] employed trajectory clustering in modeling pedestrian flow for indoor design space. D’Orazio et al. [15] simulated the pedestrian flow of a building using agent-based model with proximity and exposure time based rules to estimate the spread of Coronavirus Disease (COVID-19) in building. However, there is few existing literature about indoor pedestrian trajectory modelling under the influence of weather and time-of-day, a.k.a. weather-time (WT) condition. Xue et al. [16] studied the modelling of pedestrian movement in a train station and proposed a Pedestrian Trajectory Prediction method by LSTM with Automatic Route Class Clustering (PoPPL). It employed k-mean clustering to label pedestrian trajectories followed with subsequent LSTM based intent classification and trajectory prediction. However, the train station dataset only contained video lasting for 30 minutes with same weather and it mainly serves the purpose of transportation.

Weather-time (WT) conditions refer to weather and time-of-day variations. An objective of this paper is to study the effect of weather and time-of-day for pedestrian movement pattern in urban complex. Typical indoor environment, such as residential apartments, offices, factories, etc., are single functional premises. Individuals usually share common location-of-interest (LOI), i.e. going home/going to work. In contrast, pedestrian behavior in urban complexes exhibits much more randomness as the pedestrians could have different destinations to functional objects [17] that serves a wide range of purposes, such as retail, shopping malls, office accommodations, and business functions. Previous studies [18], [19] suggested that weather has an impact in affecting pedestrian behavior. In particular, bad weather may discourage consumers from shopping. Also, adverse weather conditions may lead to delays or cancellations of public transportation services [20], which affects pedestrian traffic. Time-of-day will affect commuter traffic and hence pedestrian flow [21], [22]. This study aims to improve understanding on how the weather and time-of-day influence the choice of destination and hence the trajectories of pedestrians, which will help to facilitate flow management [23] and intelligent retail [2]. With the increasing popularity of multimodal transportation in large metropolises to decrease reliance on private cars and greenhouse emission, many urban complexes are designed with multimodal transportation [24] capabilities. They serve as interconnection points to facilitate seamless transfers between buses and trains. Examples are Osaka station (Osaka, Japan) [24] and Chatswood interchange shopping mall (Sydney, Australia).

Three practical issues may arise in modelling the pedestrian trajectory under different weather-time (WT) conditions in urban complex are i) appropriate preprocessing and feature selection, ii) effective fusion, iii) choice of clusters under the effect of different WT conditions.

First, the format of weather information may not directly fit for use and require appropriate preprocessing and feature selection. Directly concatenating this information to the deep neural network may even confuse the classifier and lead to inferior performance. For instance, Time-of-day information is commonly available as numeric values and the classifier may perceive it as ordinal, i.e. 9 o’clock is larger than 8 o’clock, which is not logical at all.

Second, it is not trivial on where and how to fuse the WT information. For example, direct concatenation of one-hot encoded WT information to the raw pedestrian trajectories does not yield satisfactory performance.

Third, although the use of trajectory prediction guided by pedestrian intent have been reported before, it is mainly used to predict the pedestrian’s intent for road crossing in outdoor scenarios [25] involving pedestrian-vehicle interaction. Unlike the road crossing scenario, where pedestrians will need to cross the road under different weather conditions, the pedestrian behavior in urban complex can be affected by weather, especially in destinations for retail and entertainment.

To overcome these challenges in improving the pedestrian trajectory prediction accuracies of baseline deep learning models, we propose a new weather-time-trajectory network (WTTFNet) for pedestrian trajectory prediction The WTTFNet is made up of the following components:

  1. Weather-time (WT) Embedding: To tackle the issue of preprocessing and feature selection of WT information, a word embedding is used to encode the WT information and it has the advantage to be further optimized according to the final loss function.

  2. A new statistical test based on the Pearson’s chi-squared \chi ^{2} statistic is used to test the significance of the WT condition and determine whether to incorporate the WT information.

  3. Novel WTTFNet based intended destination (ID) classifier: The ID classifier is used to predict the destination based on input trajectories. Motivated by the rationale that weather-time conditions can influence the decision of reaching a destination, the proposed WTTF architecture employs the Gated Multimodal Unit (GMU) to fuse the WT embedding with preliminary pedestrian intent probabilities obtained from a baseline deep neural network based classifier. The fused representation is used to train the final classifier, which generates predicted destinations refined by the weather and time.

  4. Deep supervision [27] is used to co-train the preliminary and final classifiers together using auxiliary and final loss functions. While the preliminary pedestrian intent probabilities provide supervisory signals to train the baseline classifier, the final loss function optimizes the whole architecture. The Focal Loss [28] is used to cater for possible class imbalance. A Destination adapted trajectory predictor (DATP) is used to perform subsequent trajectory prediction. Multiple trajectory models targeted to different destinations are trained and the trajectory model that points to the predicted destination will be chosen.

To illustrate the effectiveness of the proposed approach in improving a baseline pedestrian trajectory model, the public dataset obtained from Asia and Pacific Trade Center (ATC) [29] in Osaka is considered. It is an urban complex serving as a multimodal transportation hub, which connects the intercity ferry pier and Osaka metro line, as well as accommodating a trade center and multi-entertainment complex. Pedestrian trajectories obtained on a sunny (22nd May, 2013) and cloudy day (29th September, 2013) were used. There were roughly 1.5 times more pedestrians during peak hours in compared to off-peak hours. A significant log p-value of −104.8395 (\ll log(0.05)))1 is attained using the proposed statistical test, which suggests that there is significant deviation in pedestrian flow across weather and off/peak hours.

Experimental results show that the proposed WTTFNet surpasses state-of-the-art algorithm by reduction of 9.16 % and 7.07% in average displacement error (ADE) and final displacement error (FDE), respectively. It also improves the classification accuracy (ACC) and Cohen’s Kappa (\kappa ) of the baseline model (i.e. PoPPL) by 23.67% and 28.13%, respectively.

To study the role of weather and time-of-day in improving prediction performance, ablation test is performed to compare between the proposed WTTFNet with/without incorporation of weather-time information. Significant McNemar’s test [30] p-value of p=0.0196 \lt 0.05 was attained, which suggests the improvement in classification accuracy from 71.5% to 71.95% after adding weather-time information was significant because of the large sample size of 28536 pedestrians.

Further analysis of the 3008 significant pedestrians identified by McNemar’s test shows that an overall 5.47% (7.8m to 7.4m) and 7.58% (14.11m to 13.04m) improvement in ADE and FDE reduction were obtained for the significant 3008 pedestrians, and significant one-sided Mann-Whitney U test [32] p-values were attained for ADE (p=0.0203 \lt 0.05 ) and FDE (p=0.00533 \lt 0.05 ), respectively. This shows that weather-time information helps to improve prediction performance significantly for the 3008 cases considered. Overall, the ratio 3008 out of 28536 pedestrians was also statistically significant according to the McNemar’s test, suggesting that these 3008 pedestrians showing significantly improved performance out of 28536 cases were very unlikely a random event. This suggests the proposed approach may serve as an attractive approach for incorporating WT information to improve pedestrian trajectory prediction and it also serves as a systematic approach to test the significance of WT conditions.

Finally, with the increasing popularity of multimodal transportation in large metropolises to decrease reliance on private cars and reduce greenhouse emission, understanding pedestrians’ behavior in urban complex is increasingly important. Walking networks with interconnecting urban complexes will be increasingly prevalent to facilitate smooth transfers between different modes of transportation and contribute to the economic development of nearby areas. There are also numerous applications in public space development [33], evacuation planning [34], and advancements in technology-driven retail [2].

The rest of this paper is organized as follows. Section II presents a review on the background and related works, whereas the proposed WTTFNet is presented in Section III. In Section IV, experimental results and comparisons with state-of-the-art algorithms are presented. The proposed statistical test is also used to test the significance of weather-time effects. Finally, conclusion is drawn in Section V.

SECTION II.

Background and Related Work

Pedestrian trajectory prediction (PTP) methods can be categorized according to input modality, network architecture, features, and prediction tasks [35], [36]. Traditionally, PTP is achieved using knowledge based methods such as social force [3] collision avoidance [4], kinetic models [37]. In the last decade, deep learning approaches have gained much popularity for its powerfulness in leveraging extensive observations. They can be mainly categorized into

  1. Recurrent neural network (RNN): Examples are Long Short Term Memory (LSTM) [5], Social LSTM [38], Gated Recurrent Unit (GRU) and Conv-LSTM [39]. LSTM are renowned for its capability to handle sequence-to-sequence prediction. Social LSTM further extends LSTM to model social interactions. Conv-LSTM replaces the fully connected layers in conventional LSTM with convolutional layers, which enables the capturing of both spatial and temporal information for intent and trajectory prediction in [39].

  2. Convolutional neural networks (CNN): The CNNs are usually used for PTP approaches that uses images/videos to predict the trajectories. CNN is used to extract spatial-temporal features [7] or skeleton keypoints [40] for classifying pedestrian behaviour.

  3. Transformer: VOSTN [8] used a variational one-shot transformer for trajectory prediction together with a cross-attention module to model the inter-relationship between trajectory and ego-motion. AgentFormer [10] integrated a transformer architecture with agent-aware attention mechanism and a conditional variational autoencoders (CVAE) based trajectory prediction framework.

  4. Generative adversarial network (GAN): POI-GAN [41] used generative model that integrates interest point model, field of view angle, and observed trajectories, to produce projected pedestrian trajectories for future time frames. Social GAN [9] employs a LSTM model to capture temporal structure of individual pedestrian and a social pooling mechanism to aggregate pedestrian interactions. The resultant deep features are used to train the GAN.

Over the past 5 years, most research focuses on incorporation of Social-awareness [9], [10], or contextual information [25], [39] to improve prediction performance. Social-awareness approaches such as social LSTM Mann and Whitney [32], social GAN [9], Sophie [42], AgentFormer [10] etc., primarily center around predicting trajectories and modeling interactions among a fixed number of pedestrians based on social pooling mechanisms. GCN based approaches, such as Social Spatial Temporal Graph CNN (SSTGCNN) [43], which models pedestrian interactions as graphs and extract spatial-temporal feature from the graphs using convolutional operations.

Context-based approaches incorporates context information to predict pedestrian intent and use it to guide subsequent trajectory prediction [23], [33]. Typical pedestrian intent includes crossing road and other walking gestures [44]. These intents are predicted from video or LIDAR sequences. Examples of contextual information are road topology, maps, pedestrian attributes, road boundaries and ego-vehicle information [23], [33].

While much attention is directed towards modelling the trajectory in outdoor scenarios with applications to autonomous vehicles, this paper focuses on modelling the pedestrian trajectory within an urban complex, which is challenging because pedestrians can have many possible destinations, such as shops, escalators, and attractions. Moreover, weather and time-of-day may affect pedestrian behavior. A new weather-time-trajectory fusion network (WTTFNet) is proposed to incorporate weather and time-of-day (WT) information to refine the predicted destination and trajectories. In the next section, the proposed methodology will be discussed.

SECTION III.

Proposed Methodology

Fig. 1 shows an illustration of the pedestrian trajectory prediction problem, where the proposed WTTFNet predicts the final destination and future trajectory from partially observed trajectory, e.g. half of the trajectory in this paper. The proposed WTTFNet is made up of the following components:

  1. Destination-driven clustering: It is used to label the pedestrian trajectories of the training set with destinations assigned by k-mean clustering for subsequent training of the intended-destination (ID) classifier.

  2. The proposed statistical test based on the Pearson’s chi-squared \chi ^{2} statistic is designed to determine the minimum sample size required for each cluster and determine whether to incorporate the WT information.

  3. ID classifier: It predicts the final destination that occurs in future from an observed “historical” trajectory of the pedestrian. The training set is provided by the destination-driven clustering. It is made up of a baseline deep neural network based classifier and the proposed WTTFNet, which serve as the preliminary and final classifiers, respectively. The baseline classifier will generate a set of preliminary pedestrian intent probabilities indicating the chances of reaching different destinations. Afterwards, the WTTFNet fuses the WT information and the preliminary pedestrian intent probabilities for subsequent training of the final classifier, which generates the final intent probabilities.

  4. Destination adapted trajectory predictor (DATP): After the final pedestrian intent probabilities are generated, the destination with the highest probability is chosen. The target trajectory model trained using the clustered trajectories of surrounding the chosen destination is used to predict the future trajectory. As an illustration, the PoPPL-def sub-LSTM [16] is adopted as the trajectory model. In general, other trajectory prediction models can be used.

FIGURE 1. - A pedestrian trajectory prediction problem. The observed trajectory is used to predict the future trajectory and final destination in this paper.
FIGURE 1.

A pedestrian trajectory prediction problem. The observed trajectory is used to predict the future trajectory and final destination in this paper.

A. Destination-Driven Clustering Module

In a pedestrian trajectory prediction problem, an observed trajectory \boldsymbol {s}_{\boldsymbol {n}} for the n-th pedestrian of length L is used to predict the future L\mathrm {'} observations trajectory \hat {z}_{\boldsymbol {n}}\mathrm {:} \begin{align*} \boldsymbol {s}_{\boldsymbol {n}}& =\left \{{{\left ({{ x_{n,1},y_{n,1} }}\right ),\ldots ,\left ({{ x_{n,L},y_{n,L} }}\right ) }}\right \} \tag {1a}\\ \hat {z}_{\boldsymbol {n}}& =\left \{{{\left ({{ \hat {x}_{n,L+1},\hat {y}_{n,L+1} }}\right ),\ldots ,\left ({{ \hat {x}_{n,L+L'},\hat {y}_{n,L+L'} }}\right ) }}\right \}. \tag {1b}\end{align*}

View SourceRight-click on figure for MathML and additional features.However, there are multiple possible destinations of a pedestrian and hence a destination-driven clustering will be beneficial for training destination-specific trajectory models. In the destination-driven clustering module, the end-point of all trajectories, i.e. \Omega _{end}:\{\left ({{ x_{n,L+L^{\prime }},y_{n,L+L^{\prime }} }}\right),n=1,2,\ldots ,N\} from (1b) are passed to the k-means algorithm to form clusters. The membership of an endpoint (x,y) is sought by minimizing its distance from the centroids \sum \nolimits _{k=1}^{K} \sum \nolimits _{(x,y\mathrm {)\in }S_{k}} {\mathrm {||}\left ({{ x,y }}\right)-\left ({{ \mu _{x,k},\mu _{y,k} }}\right)\mathrm {||}}_{2}^{2} \mathrm {} , where \Omega _{k} is the k-th cluster and its centroid is updated as \boldsymbol {\mu }_{\boldsymbol {k}}=\left [{{ \mu _{x,k},\mu _{y,k} }}\right]^{T}=\frac {1}{\vert \Omega _{k}\vert }\sum \nolimits _{(x,y)\in \Omega _{k}} \left ({{ x,y }}\right).\vert \Omega _{k}\vert is the number of elements in \Omega _{k} . After assignment, each trajectory is labelled with the corresponding class from \omega =1,\ldots ,K for sub-sequent training of the pedestrian intent classifier. The raw trajectories are cleaned and resampled so that the total duration of each trajectory is normalized to T_{o} .

The proposed approach also employs a statistical test to test the significance of each cluster and establish the minimum number of samples for each cluster (See Eqn. (12)). If a cluster is found to have insufficient number of samples, it can be merged to one of the clusters using an agglomerative clustering similarity measure, such as centroid linkage criterion\begin{equation*} {min. || \left ({{ \mu _{x,k}, \mu _{y,k} }}\right )-\left ({{ \mu _{{x,k}_{i}},\mu _{{y,k}_{i}} }}\right )||}_{2}^{2}, \tag {2}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \left ({{ \mu _{x,k}, \mu _{y,k} }}\right) is the centroid of the cluster to be merged and \left ({{ \mu _{{x,k}_{i}}, \mu _{{y,k}_{i}} }}\right) are the remaining clusters. It is noted that other similarity measures can be employed. After the clusters have been computed, the training dataset for the ID classifier can be obtained as\begin{align*} \text {Trajectory:} \boldsymbol {s}_{\boldsymbol {n}}& =\left \{{{\left ({{ x_{n,1},y_{n,1} }}\right ),\ldots ,\left ({{ x_{n,L},y_{n,L} }}\right ) }}\right \}, \tag {3a}\\ \text {Destination:} \omega _{n}& =1,\ldots ,K, \tag {3b}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where K is the total number of destinations.

B. Novel Weather-Time-Trajectory Network for Destination Classification

Fig. 2 and Table 1 show the proposed intended destination (ID) classifier, which comprises the weather-time (WT) embedding, baseline model (e.g. PoPPL) and the novel WTTFNet. First, a baseline model is used to learn the micro-level representation of the trajectory. Afterwards, a fully connected (FC) layer is used to learn a preliminary classifier of the destinations. The output preliminary ID class probabilities are then passed to the GMU for fusing with the WT embedding. The fused multimodal representation is passed to a final FC layer for training the final classifier. Both the preliminary and final classifier are co-optimized using the focal loss function. Here, the PoPPL is employed as the baseline model. In general, other trajectory models can be used.

TABLE 1 Structural Details of Proposed WTTFNet
Table 1- Structural Details of Proposed WTTFNet
FIGURE 2. - The proposed WTTFNet. Key innovations lie in the Intended Destination (ID) classifier, which is made up of i) baseline model, ii) focal loss iii) Deep supervision (preliminary and final classifiers optimized using joint loss function), and iv) incorporation of weather and time information via Gated Multimodal Unit (GMU). The structural details are summarized in Table 1.
FIGURE 2.

The proposed WTTFNet. Key innovations lie in the Intended Destination (ID) classifier, which is made up of i) baseline model, ii) focal loss iii) Deep supervision (preliminary and final classifiers optimized using joint loss function), and iv) incorporation of weather and time information via Gated Multimodal Unit (GMU). The structural details are summarized in Table 1.

More precisely, suppose there are C_{w} weather conditions and C_{d} different time-of-day and the total number of weather-time conditions are {C=C}_{w}+C_{d} . For example, in this paper, C_{w}=2 (sunny/rainy) and C_{d}=2 (off-peak/peak hours) are chosen. The proposed Weather-Time (WT) Embedding for the n-th pedestrian is given as\begin{equation*} \text {WT Embedding:} \boldsymbol {e}_{\boldsymbol {WT,n}}= \boldsymbol {\Theta } \left ({{ \boldsymbol {f}_{\boldsymbol {w,n}},\boldsymbol {f}_{\boldsymbol {d,n}} }}\right ), \tag {4}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \mathrm {\Theta ()} is the embedding layer. \boldsymbol {f}_{\boldsymbol {w,n}} and \boldsymbol {f}_{\boldsymbol {d,n}} are the one-hot encodings describing the weather-time condition for the n-th pedestrian. The preliminary ID class probabilities \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right) can be obtained as the softmax probabilities from the preliminary classifier in Fig. 2. Batch normalization and softmax are performed after the FC layer. The preliminary intent probabilities \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right) and preliminary classifier \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {C}} are given as\begin{align*} \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right )& =\boldsymbol {\sigma }_{\mathrm {Soft}}\left ({{ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {C}} }}\right ), \tag {5a}\\ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {C}}& = \boldsymbol {\phi }_{\boldsymbol {BN}}\left ({{ \mathrm {FC}\left ({{ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {base}} }}\right ) }}\right ), \tag {5b}\end{align*}
View SourceRight-click on figure for MathML and additional features.
respectively, where\begin{align*}\boldsymbol {\sigma }_{Soft}\left ({{ u }}\right )& =\frac {1}{\sum \nolimits _{k=1}^{K} e^{u_{k}} }\left [{{ e^{u_{1}},e^{u_{2}}\mathrm {,\ldots ,}e^{u_{K}} }}\right ]^{T}, \quad \text {and}~ \tag {5c}\\ \phi \left ({{ u_{k} }}\right )& =\frac {u_{k}-E\left ({{ u_{k} }}\right )}{\sqrt {var\left ({{ u_{k} }}\right )+\epsilon } }\times w_{\gamma ,k}+w_{b,k} \tag {5d}\end{align*}
View SourceRight-click on figure for MathML and additional features.
represent softmax operation and Batch normalization (BN), respectively. \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {C}} and \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {base}} are the output of the preliminary classifier and base model, respectively for the n-th pedestrian. \mathrm {\boldsymbol {\phi }}_{\boldsymbol {BN}}\left ({{ \boldsymbol {u} }}\right)=\left [{{ \phi \left ({{ u_{1} }}\right),\phi \left ({{ u_{2} }}\right),\ldots ,\phi \left ({{ u_{K} }}\right) }}\right]^{T} is the batch normalization function. \mathrm {FC} \left ({{ \boldsymbol {u} }}\right)=\boldsymbol {W}\cdot \boldsymbol {u} is a fully connected layer with weights \boldsymbol {W} and \mathrm {\boldsymbol {\sigma }}_{\textrm {Soft}}\left ({{ \boldsymbol {u} }}\right) is the softmax function. w_{\gamma ,k} and w_{b,k} are learnable parameters for BN.

The preliminary pedestrian intent probabilities \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right) and the WT embedding \mathrm {\boldsymbol {\Theta }}\left ({{ \boldsymbol {f}_{\boldsymbol {w,n}},\boldsymbol {f}_{\boldsymbol {d,n}} }}\right) are then fused at the GMU. The GMU is used to find an intermediate representation that fuses the two modalities, i.e. preliminary ID probabilities and WT embedding. First, the pedestrian intent probabilities and WT embedding are passed to individual tanh layers, each of which contains a neuron with hyperbolic tangent activation to encode the individual modalities. At the same time, a tied gate neuron learns the contribution of the two modalities, as shown in Fig. 2. The contributions {3}_{\boldsymbol {n}} and ({\boldsymbol {1}-{3}}_{n} ) obtained from the gate neuron will be multiplied in an elementwise manner to the output of the tanh layers of \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right) and \boldsymbol {\Theta }\left ({{ \boldsymbol {f}_{\boldsymbol {w,n}},\boldsymbol {f}_{\boldsymbol {d,n}} }}\right) , respectively. A special feature of this gate unit is that {3}_{\boldsymbol {n}} supports multivariate weighting. To use this feature, the output dimension of the two tanh layers can be modified to a common dimension matching each other. Finally, the fused multimodal representation will be passed to the final classifier for predicting the final class probability.

More precisely, the GMU can be described using the following set of equations:\begin{align*} \boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {v}}& = \mathbf {tanh}\left ({{ \boldsymbol {W}_{\boldsymbol {v}}\cdot \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right ) }}\right ), \tag {6a}\\ \boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {e}}& = \mathbf {tanh}\left ({{ \boldsymbol {W}_{\boldsymbol {e}} \cdot \boldsymbol {\Theta }\left ({{ \boldsymbol {f}_{\boldsymbol {w,n}},\boldsymbol {f}_{\boldsymbol {d,n}} }}\right ) }}\right ), \tag {6b}\\ {3}_{\boldsymbol {n}}& =\mathrm {\boldsymbol {\sigma }}_{sgm}(\boldsymbol {W}_{3}\cdot \left [{{ \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right )^{T},\boldsymbol {\Theta }\left ({{ \boldsymbol {f}_{\boldsymbol {w,n}},\boldsymbol {f}_{\boldsymbol {d,n}} }}\right )^{T} }}\right ]^{T}), \tag {6c}\\ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {fuse}}& ={3}_{\boldsymbol {n}}\mathrm {\odot }\boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {v}}+\left ({{ \boldsymbol {1}-{3}_{\boldsymbol {n}} }}\right )\mathrm {\odot }~\boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {e}}, \tag {6d}\end{align*}

View SourceRight-click on figure for MathML and additional features.where \boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {v}} is the output of tanh layer for \hat {\boldsymbol {p}}_{\boldsymbol {pre}}\left ({{ \omega _{n} }}\right) for the n^{th} pedestrian. \boldsymbol {h}_{\boldsymbol {n}}^{\boldsymbol {e}} is the output of tanh layer for Embedding. {3}_{\boldsymbol {n}} is the output of the gate neuron. \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {fuse}} denotes the fused representation. \mathbf {tanh}\left ({{ \boldsymbol {u} }}\right)=\boldsymbol {}\left [{{ \tanh \left ({{ u_{1} }}\right)\boldsymbol {,}\tanh {\left ({{ u_{2} }}\right)\boldsymbol {,\ldots }} }}\right]^{\boldsymbol {T}} and \tanh {\left ({{ u }}\right)=\frac {e^{u}-e^{-u}}{e^{u}+e^{-u}}} . \mathrm {\boldsymbol {\sigma }}_{sgm}\left ({{ \boldsymbol {u} }}\right)=\left [{{ \sigma _{sgm}\left ({{ u_{1} }}\right), \sigma _{sgm}\left ({{ u_{2} }}\right),\ldots }}\right]^{T} and \mathrm {\sigma }_{sgm}\left ({{ u }}\right)=\frac {1}{1+e^{-u}} . The Hadamard product operator is denoted as \mathrm {\odot } . The set of unknown neural network weights to be learnt in the GMU are \{\boldsymbol {W}_{\boldsymbol {v}}, \boldsymbol {W}_{\boldsymbol {e}},\boldsymbol {W}_{3}\} , which corresponds to the weights of tanh layer for the preliminary pedestrian intent probabilities, tanh layer for the WT embedding and Gate Neuron, respectively. A common dimension M is chosen for the two tanh layers in (6a) and (6b) so that they match the dimension of {3}_{\boldsymbol {n}} . Finally, a FC layer \mathrm {F}\mathrm {C}_{\mathrm {M,K}}() with input dimension M and output dimension K is used to learn the final ID class probabilities. \hat {\boldsymbol {p}}_{\boldsymbol {F}}\left ({{ \omega _{n} }}\right) is the predicted pedestrian intent probabilities obtained from the final classifier and it is given as\begin{equation*} \hat {\boldsymbol {p}}_{\boldsymbol {F}}\left ({{ \omega _{n} }}\right )=\boldsymbol {\sigma }_{\mathrm {Soft}}\left ({{ \boldsymbol {\phi }_{\boldsymbol {BN}}\left ({{ FC_{M,K}\left ({{ \boldsymbol {f}_{\boldsymbol {n}}^{\boldsymbol {fuse}} }}\right ) }}\right ) }}\right ), \tag {7}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \boldsymbol {\phi }_{\boldsymbol {BN}}\mathbf {} and \mathrm {\boldsymbol {\sigma }}_{\mathrm {Soft}} are the batch normalization and softmax operations defined in (5c) and (5d), respectively. The preliminary and final classifiers will be jointly optimized as\begin{align*} L_{T}& =\left ({{ 1-\lambda _{P} }}\right )\mathrm {L}_{\mathrm {focal}}\left ({{ \omega ,\hat {\boldsymbol {p}}_{F}(\omega ) }}\right ) \\ & \quad + \lambda _{P}\mathrm {L}_{\mathrm {focal}}\left ({{ \omega ,\hat {\boldsymbol {p}}_{pre}\left ({{ \omega }}\right ) }}\right ). \tag {8}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where for simplicity, we drop the subscript n in (8). \mathrm {L}_{\mathrm {focal}}\left ({{ \omega ,\hat {\boldsymbol {p}}_{F}(\omega) }}\right) and \mathrm {L}_{\mathrm {focal}}\left ({{ \omega ,\hat {\boldsymbol {p}}_{pre}\left ({{ \omega }}\right) }}\right) the losses for the final and preliminary classifiers, respectively. \lambda _{P} is a parameter controlling the ratio of the two losses. It is chosen as \lambda _{P}=0.5 in this paper. To cater for possible class imbalance, the focal loss [28] is used\begin{align*} \mathrm {L}_{\mathrm {focal}}\left ({{ \omega ,\hat {\boldsymbol {p}} }}\right )& =-\frac {1}{NK}\sum \limits _{n=1}^{N} \sum \limits _{k=1}^{K} I_{k,n} {\beta _{k}\left ({{ 1-\hat {\boldsymbol {p}}_{k,n} }}\right )}^{\gamma } \\ & \quad \times \log \left ({{ \hat {\boldsymbol {p}}_{k,n} }}\right ), \tag {9}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where \hat {\boldsymbol {p}}_{k,n} is the predicted class probability for the k-th class. I_{k,n} is an indicator variable and I_{k,n}=1 when the actual class is K.\gamma is a focusing factor and \beta _{k}\in [{0,1}] is a weighting factor. The final predicted class (i.e. intended destination) can be obtained as\begin{equation*} \hat {\omega }_{n}=\mathrm {max(}\hat {\boldsymbol {p}}_{F}\left ({{ \omega _{n,1} }}\right ),\hat {\boldsymbol {p}}_{F}\left ({{ \omega _{n,2} }}\right ),\ldots ,\hat {\boldsymbol {p}}_{F}(\omega _{n,K})), \tag {10}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \hat {\boldsymbol {p}}_{F}\left ({{ \omega _{n,k} }}\right) is the softmax probability of the k-th destination. \hat {\boldsymbol {p}}_{F}\left ({{ \omega }}\right)={[\hat {\boldsymbol {p}}_{F}\left ({{ \omega _{n,1} }}\right),\hat {\boldsymbol {p}}_{F}\left ({{ \omega _{n,2} }}\right),\ldots ,\hat {\boldsymbol {p}}_{F}(\omega _{n,K})] }^{T} .

C. Destination-Adapted Trajectory Predictor Module

After obtaining the final probabilities, the predicted trajectory can be obtained as\begin{equation*} \hat {z}_{n}=\mathrm {DT}\mathrm {P}_{k=\hat {\omega }_{n}}(s_{n}),\end{equation*}

View SourceRight-click on figure for MathML and additional features.where \hat {z}_{n} is the predicted trajectory for the chosen class. \mathrm {DT}\mathrm {P}_{k=\hat {\omega }_{n}}(s_{n}) is the chosen destination trajectory baseline model based on predicted class \hat {\omega }_{n} . The baseline model is chosen as sub-LSTM (PoPPL-def) in PoPPL [16] for the sake of comparison. The sub-LSTM (PoPPL-def) is an encoder-decoder LSTM with two hidden layers. In the next sub-section, the proposed statistical test will be presented.

D. Statistical Test for Weather-Time Conditions

The proposed statistical test can be used to establish the minimum required samples for each cluster and to quantify whether it is necessary to treat the pedestrian movement pattern in different periods and weathers as different groups and use different trajectory models to describe their behavior. More precisely, Table 2 shows a K\times C contingency table summarizing the number of pedestrians arriving to K destinations under C different weather-time (WT) conditions. The following null hypothesis is proposed:\begin{align*} H_{0}:& \text {The WT condition does not affect} \\ & \quad \text {the choice of destination.} \tag {11}\end{align*}

View SourceRight-click on figure for MathML and additional features.If the null hypothesis is true, then the observed number of pedestrians should not deviate significantly from the expected counts across different WT conditions. According to the \chi ^{2} test, the minimum number of expected samples/trajectories required for each cluster k under condition c is\begin{equation*} e_{kc}=\frac {l_{K}\times \underline {n}_{C}}{\underline {n}} \ge 5, \tag {12}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where l_{k}=\sum \nolimits _{c=1}^{C} l_{kc},\underline {n}_{c}=\sum \nolimits _{k=1}^{K} l_{kc} and \underline {n}=\mathrm {\Sigma }_{k=1}^{K} \mathrm {\Sigma }_{c=1}^{C}~l_{kc} . l_{kc} is the number of observed trajectories/samples in the k-th destination and c-th condition.

TABLE 2 Proposed Statistical Test of Significance of Weather-Time Conditionsa
Table 2- Proposed Statistical Test of Significance of Weather-Time Conditionsa

Once the clusters are established, the test statistic for WT condition reads\begin{equation*} \chi _{obs}^{2}=\sum \limits _{c=1}^{C} \sum \limits _{k=1}^{K} {\frac {\left ({{ o_{kc}-e_{kc} }}\right )^{2}}{e_{kc}},} \tag {13}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where o_{kc} is the actual observed number of pedestrians in condition c and destination k. The p-value of the test is given as\begin{equation*} p=Pr\mathrm {(}\chi ^{2}\ge \chi _{obs,j}^{2}\mathrm {\vert }H_{0}\mathrm {),} \tag {14}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where the test statistic follows a \chi ^{2} distribution with (C-1)(K-1) degree of freedom. At a significance level of 0.05 [31], the null hypothesis will be rejected when the p-value is smaller than 0.05 and it will suggest the difference between the proportion of pedestrian arrival under different conditions and origins is statistically significant.

SECTION IV.

Results and Analysis

For illustrative purposes, the Osaka Asia and Pacific Trade Center (ATC) dataset (Dražen et al. 2013) is considered. The Osaka ATC is a transportation hub linking the Sunflower inter-city Ferry pier to the Osaka City Metro. It contains a multi-entertainment complex and a conference center. The trajectories were collected at 1/F of ATC using 3D range sensors. The full dimension is over 140~m \times 60~m . Trajectories from 0900 to 2000 on 22^{\mathrm {nd}} May, 2013 (sunny) and 29^{\mathrm {th}} September, 2013 (cloudy) are chosen. Trajectories that are too short are removed (i.e. same cluster for origin and destination) as they may be a result of occlusion or tracking loss of the 3D range sensor. After resampling, the trajectory length L+L'=40 . The total number of pedestrian trajectories after pre-processing are 7329 on 22^{\mathrm {nd}} May, 2013 (sunny) and 21207 on 29^{\mathrm {th}} September, 2013 (cloudy) and respectively. Hence, the total number of pedestrian/ trajectories are 28536. Each pedestrian contains only 1 trajectory.

A. Choice of Cluster

A general rule of thumb to choose the number of classes is to study the number of possible entrances/exits of the floor plan [16], [17]. Fig. 3 shows the floor plan of the Osaka ATC Center (1/F). Following this notion, key entrances and exits are chosen as the initialization centroids as in Fig. 3. Table 3 shows the list of initialization centroids.

TABLE 3 List of Initialization Centroids for Osaka ATC Centre 1/F
Table 3- List of Initialization Centroids for Osaka ATC Centre 1/F
FIGURE 3. - Initialization centroids for k-means clustering, 
${K} =10$
 classes added to 1/F floor plan of Osaka ATC Centre [29]. Important functional objects (i.e. ticket office, escalator, kiosk and stairs) are redrawn. The historical weather on 22nd May 2013 (sunny) and 29th September (cloudy), 2013 was obtained from [45].
FIGURE 3.

Initialization centroids for k-means clustering, {K} =10 classes added to 1/F floor plan of Osaka ATC Centre [29]. Important functional objects (i.e. ticket office, escalator, kiosk and stairs) are redrawn. The historical weather on 22nd May 2013 (sunny) and 29th September (cloudy), 2013 was obtained from [45].

B. Statistical Analysis of Time-of-Day and Weather Conditions

In this sub-section, we shall test the significance of time-of-day and weather conditions using the proposed statistical test.

Table 4 shows the number of observed pedestrian arrival during peak hour (12:00-16:59), off-peak, sunny and rainy conditions for K=10 . Using (12), it was found that class \omega _{6} does not meet the minimum sample requirement. Hence, using the centroid linkage criterion, \omega _{6} is merged with \omega _{10} . The observed \chi _{obs}^{2} computed using (13) is 588.64 (degree of freedom 24) and the log(p-value) is -104.8395, which is statistically significant under the typical significance level of 0.05, Ross [31]. This suggests there is a significant deviation in the pedestrian counts across the different clusters under the different conditions. Hence, the proposed approach should be used to model the pedestrian trajectory patterns under the different conditions. Next, we shall evaluate the performance of the various algorithms.

TABLE 4 Number of Pedestrian Arrivals During Peak Hour (12:00-16:59). Off-Peak (09:00 –11:59, 17:00 – 20:00), Sunny and Cloudy for Osaka ATC Dataset (K =10)
Table 4- Number of Pedestrian Arrivals During Peak Hour (12:00-16:59). Off-Peak (09:00 –11:59, 17:00 – 20:00), Sunny and Cloudy for Osaka ATC Dataset (K =10)

C. Baseline and Metric

To evaluate the performance of the proposed approach, we compare the proposed WTTFNet with the following algorithms:

  1. Linear Model: A simple linear model with a hidden layer (nn.linear in Pytorch) [46] is used to predict the trajectories.

  2. Vanilla LSTM: The sub-LSTM in PoPPL-def is used. It employs an encoder-decoder LSTM with 2 hidden layers fitting all the trajectories. The implementation follows the Github codes [16].

  3. PoPPL [16]: The sub-LSTM model is employed together with route class clustering. The implementation follows the Github codes. Following the previous statistical analysis, K=9 destinations were chosen. Route class clustering divides all trajectories according to all combinations of all 9 origins and 9 destinations for training trajectory models.

  4. Proposed WTTFNet: For fair comparison, we adopt the same baseline model as in PoPPL, as shown in Fig. 1. However, the proposed destination-driven clustering and proposed WTTFNet are used. Hyperparameters same as the authors are adopted for the PoPPL baseline model. For the number of destinations, K=9 is chosen as in the previous analysis.

For evaluating the quality of trajectory prediction, the average displacement error (ADE) is the average Euclidean distance between all the actual and all predicted coordinates over all trajectories. The FDE is the average Euclidean distance between the final destination of the predicted and actual trajectories. They are given as\begin{align*} \mathrm {ADE}& =\frac {1}{{N_{T}L}^{\prime }}\sum \nolimits _{n=1}^{N_{T}} \sum \nolimits _{t=1}^{L^{\prime }} \left \|{{ \left ({{\begin{array}{l} x_{n,L+t} \\ y_{n,L+t} \\ \end{array}}}\right )-\left ({{\begin{array}{l} \hat {x}_{n,L+t} \\ \hat {y}_{n,L+t} \\ \end{array}}}\right ) }}\right \|_{2}, \tag {15a}\\ \mathrm {FDE}& =\frac {1}{N_{T}}\sum \nolimits _{n=1}^{N_{T}} \left \|{{ \left ({{\begin{array}{l} x_{n,L+L'} \\ y_{n,L+L'} \\ \end{array}}}\right )-\left ({{\begin{array}{l} \hat {x}_{n,L+L'} \\ \hat {y}_{n,L+L'} \\ \end{array}}}\right ) }}\right \|_{2}, \tag {15b}\end{align*}

View SourceRight-click on figure for MathML and additional features.where \vert \vert .\vert \vert _{2} denotes the Euclidean distance. N_{T} is the total number of testing samples. (x_{n,t},y_{n,t}) is the actual coordinate and (\hat {x}_{n,t},\hat {y}_{n,t}) is the predicted coordinate of the n-th pedestrian’s trajectory. The accuracy of the destination classification is evaluated using classification accuracy (ACC) and Cohen’s Kappa (\kappa ). They are given as\begin{align*} \mathrm {ACC}& =\frac {1}{N_{Test}}\sum \nolimits _{k=1}^{K} {CM[i,i]}, \tag {16a}\\ \kappa & =\frac {N_{T}\sum \nolimits _{i=1}^{K} {CM\left [{{ i,i }}\right ]-} \sum \nolimits _{i=1}^{K} {C_{T}\left [{{ i }}\right ]C_{P}[i]} }{N_{T}^{2}-\sum \nolimits _{i=1}^{K} {C_{T}\left [{{ i }}\right ]C_{P}[i]} }, \tag {16b}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where CM[i,j]=\sum \nolimits _{n=1}^{N_{Test}} {I(\omega _{n}=i \& ~\hat {\omega _{n}}=j)} is the total number of counts of having the actual class i and predicted class j. I is the indicator function. C_{T}\left [{{ i }}\right]=\sum \nolimits _{j=1}^{K} {CM[i,j]} and C_{P}\left [{{ j }}\right]=\sum \nolimits _{i=1}^{K} {CM[i,j]} . While classification accuracy is commonly used to describe the generic performance, Cohen’s Kappa is used more frequently for scenarios with possible class imbalance. The following relative metrics, rd are used to compare between different algorithms,\begin{equation*} rd=\frac {\left ({{ d-d_{REF} }}\right )\left ({{ -1 }}\right )^{m}}{d_{REF}+\epsilon }\times 100\%, \tag {17}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where d can be any metrics, such as the ADE, FDE, ACC and \kappa . d_{REF} is the performance of the reference method. \epsilon ={10}^{-8} is a small constant added to denominator to avoid division by zero. m is a parameter defining metric type. m=0 is used for maximizing metrics with larger value indicating better performance, whereas m=1 is used for loss metrics with smaller value indicating better performance.

D. Experimental Setup

The Google Colab Tesla T4 Graphics Processing Unit (GPU) notebook with 16GB GPU memory and 17 GB of system memory is used for evaluation. In the experiment, each observed trajectory has a duration of 20 time-instants and an algorithm will predict the trajectory for the next 20 time-instants. Fig. 3 shows the validation protocol following the validation strategy in [16]. Stratified 5-fold cross validation (CV) is employed. Due to stratification and possible chances that the total number of samples is indivisible by 5, the number of samples across folds may vary slightly. Three-folds (~60%), one-fold (~20%), and one-fold (~20%) are used for training, validation, and testing, respectively.

1) Batch Size and Stopping Criterion

Fig. 4 shows the training and validation curves for the proposed WTTFNet under batch sizes 256, 512,1024 and 2048. For batch size 256, the validation curve is quite noisy and fluctuates rapidly and hence it is not considered. For batch sizes 512, 1024 and 2048, the training accuracy starts to level off around epoch 100 but the validation accuracy remains roughly around a certain range. This suggests more epochs do not necessarily lead to better validation performance. Hence, 1000 epochs are chosen as stopping criterion. Overall, batch size 1024 attained the lowest variance in validation accuracy and hence it is chosen. For each CV fold, the model obtained at the epoch attaining the best validation accuracy is chosen and is used to evaluate the testing data.

FIGURE 4. - Validation protocol and learning curves for various batch sizes. Stratified 5-fold cross validation (CV) is used.
FIGURE 4.

Validation protocol and learning curves for various batch sizes. Stratified 5-fold cross validation (CV) is used.

2) Hyperparameters

Hyperparameters same as the PoPPL are adopted for the baseline classifier and trajectory models. Dropout parameter of 0.5 and hidden size of 128 are adopted. For the proposed WTTFNet, the weighing factor in the focal loss is chosen as \boldsymbol {\beta }=\left [{{ \beta _{1},\beta _{2},\ldots ,\beta _{K} }}\right]^{T} , \beta _{k}=\left ({{\frac {N/N_{k}}{\sum \nolimits _{k=1}^{K} {N/N_{k}} }}}\right) , where N is the total number of training samples, N_{k} is the number of training samples of class k, K is the total number of classes. The focusing parameter is chosen as \gamma =2 . The ratio between the preliminary and final loss in (8) is chosen as \lambda _{P}=0.5 .

E. Experimental Results

In this sub-section, the proposed WTTFNet is compared against various algorithms. Since the proposed WTTFNet can be attached to arbitrary deep neural network baseline models, the PoPPL is adopted as baseline for illustration. In general, other deep neural network based intent classifier, such as transformers, can be adopted as the baseline model. Since the PoPPL is a technique that combined clustering and LSTM, we also compared with Vanilla LSTM.

Table 5 shows the overall performance of all algorithms. The proposed WTTFNet performed better than the original PoPPL, Vanilla LSTM and the linear model for all cases considered. Particularly, the proposed WTTFNet surpasses the original PoPPL 23.67% in classification accuracy, 9.16% reduction in ADE and 7.07% reduction in FDE. Significant p-values of (p\lt {10}^{-16} ) are attained for improvement in classification accuracy (McNemar’s test [30]), ADE and FDE (one-sided Mann-Whitney U tests [32]).

TABLE 5 Trajectory Prediction Performance of Various Algorithms
Table 5- Trajectory Prediction Performance of Various Algorithms

1) Ablation Test

The intended destination classifier of the proposed WTTFNet is made up of i) baseline model ii) focal loss iii) deep supervision (preliminary and final classifiers co-trained with joint loss function) and iv) incorporation of WT information using GMU. To study the incremental contribution of each component of the proposed novel WTTFNet and show the role of weather and time-of-day in improving the prediction, we consider Table 6, which compare the following four different settings:

  1. PoPPL (baseline model): The original PoPPL with entropy loss. In general, other baseline models can be used.

  2. PoPPL (baseline model) + FL: PoPPL modified with Focal Loss.

  3. WTTFNet without WT information (second last column of Tables 5 and 6): GMU is bypassed and WT information is not incorporated. Deep supervision is used to co-train the preliminary and final classifiers.

  4. WTTFNet with WT information (final column of Tables 5 and 6): Between the preliminary and final classifiers, the GMU is inserted and the WT information is fused with the preliminary pedestrian intent probabilities at the GMU.

TABLE 6 Trajectory Prediction Performance of Various Algorithms
Table 6- Trajectory Prediction Performance of Various Algorithms

Comparing between PoPPL and PoPPL+FL (Setting 1 vs 2), it can be seen that the use of focal loss improves the ACC as it helps to tackle the class imbalance existed among the clusters. After adding the proposed WTTFNet (Setting 2 vs Setting 3), even without the WT information, around 4% relative improvement in ACC is observed. This suggests even when the GMU is bypassed and WT information is not supplied, deep supervision employed in the WTTFNet is useful in refining both the preliminary and final classifiers optimized using auxiliary and final loss functions based on focal loss. This leads to improved classification accuracy (Table 6), reduction in ADE and FDE (Table 5).

Finally, to study the role of weather and time-of-day in improving the performance, we compare WTTFNet without/with WT information (Setting 3 vs Setting 4). We can observe that the best performance (highest classification accuracy, lowest ADE and FDE) can be attained in Tables 6 and 5, respectively, after incorporation of WT information into the proposed WTTFNet, which suggests the usefulness in adding WT information in prediction.

Overall, the ACC increased from 71.5% to 71.95% after adding WT information in the proposed WTTFNet. To validate its statistical significance, we performed a McNemar’s test and a significant p-value (p=\mathrm {0.0196\lt 0.05)} was attained. This suggests the improvement from 71.5% to 71.95% in ACC is very unlikely to be solely due to random under the large sample size of 28536 pedestrians. What follows, the McNemar’s test also identifies 3008 pedestrians out of 28536 to have deviation in identified destination classes after WT information is added, this prompted us to further analyze those two different groups of pedestrians in next subsection.

2) Quantitative Analysis of the Role of Weather and Time-of-Day

In this section, further analysis on the role of weather and time-of-day in improving the prediction performance is studied. Following the significance p-value obtained for the McNemar’s test in previous sub-section, which suggests that there is significant improvement in classification accuracy after adding WT information to the proposed approach. Moreover, 3008 pedestrians were found to have significant improvement after WT information were added. This motivates us to analyze the average displacement error (ADE) and final displacement error (FDE) of the 3008 pedestrians.

Figs. 5 and 6 compare the ADE and FDE of the proposed approach under two settings, respectively: with/without the incorporation of WT information. “No effect on Destination” means the predicted destination are same under the both settings, whereas “Influenced the Predicted Destination” means the predicted destination was altered after incorporating the WT information.

FIGURE 5. - Average Displacement Error (ADE) of the proposed approach with/without the incorporation of weather-time information. Significant reduction in ADE can be observed (7.8m to 7.4m) for the significant 3008 pedestrians out of all 28536 pedestrians.
FIGURE 5.

Average Displacement Error (ADE) of the proposed approach with/without the incorporation of weather-time information. Significant reduction in ADE can be observed (7.8m to 7.4m) for the significant 3008 pedestrians out of all 28536 pedestrians.

FIGURE 6. - Final Displacement Error (FDE) of the proposed approach with/without the incorporation of weather-time information. Significant reduction in FDE can be observed (14.11m to 13.04m) for the significant 3008 pedestrians out of all 28536 pedestrians.
FIGURE 6.

Final Displacement Error (FDE) of the proposed approach with/without the incorporation of weather-time information. Significant reduction in FDE can be observed (14.11m to 13.04m) for the significant 3008 pedestrians out of all 28536 pedestrians.

Fig. 5 shows the ADE of the proposed approach with/without WT information incorporated. From the figure, it can be shown that similar ADE was attained when the WT information has no effect on the predicted destination. On the other hand, if the predicted destination changed because of the varying weather (Influenced the predicted destination), the proposed WTTF with WT information incorporated will attain lower ADE (7.4083m) in compared to without WT information (7.83m). One-sided Mann-Whitney U test was used to test the significance in ADE reduction (7.83m to 7.4083m after adding WT information) and a p-value of p=0.0203\lt 0.05 was attained, suggesting the significance in performance improvement for these pedestrians considered.

Fig. 6 shows the FDE under the two settings (with/without WT information) were compared for the proposed approach. Similar to the observation in the previous comparison, same FDE was attained when the WT information has no effect on the predicted destination (FDE =9.99m) and improved FDE (reduction from 14.11m to 13.04m) for the proposed WTTF approach when it changes the predicted destination after incorporating WT information. One-sided Mann-Whitney U test was used to test the significance in FDE reduction (14.11m to 13.04m after adding WT information) and a p-value of p=0.00533\lt 0.05 was attained, suggesting the significance in performance improvement for these pedestrians considered.

Overall, 5.47% (7.8m to 7.4m) and 7.58% (14.11m to 13.04m) improvement in ADE and FDE reduction were obtained for the 3008 pedestrians, and the reduction is found significant according to one-sided Mann-Whitney U tests. (p=0.0203 (<0.05) and p=0.00533 (<0.05) for ADE and FDE, respectively). For the remaining pedestrians, similar ADE and FDE performance was observed for pedestrians with no effect, because they have the same predicted destination under two settings (with/without WT information).

3) Qualitative Analysis of the Role of Weather and Time-of-Day

To illustrate the usefulness of adding WT information in the proposed WTTFNet, we consider four different cases, where Figs. 7(a) and (b) are extracted from the significant 3008 pedestrians and Figs 7(c) and (d) are extracted from the remaining pedestrians, whose destination was not affected by weather-time conditions.

FIGURE 7. - Illustration of predicted trajectories, where the weather-time condition has (a,b) significant influence on destination (chosen from the 3008 significant pedestrians), and (c,d) no influence on destination (chosen from remaining pedestrians). The first half of the trajectory (denoted in black) is used to predict the latter half of the trajectory. Since the two lines of WTTFNet with/ without WT overlapped in (c) and (d), both settings are merged to one line.
FIGURE 7.

Illustration of predicted trajectories, where the weather-time condition has (a,b) significant influence on destination (chosen from the 3008 significant pedestrians), and (c,d) no influence on destination (chosen from remaining pedestrians). The first half of the trajectory (denoted in black) is used to predict the latter half of the trajectory. Since the two lines of WTTFNet with/ without WT overlapped in (c) and (d), both settings are merged to one line.

Comparing between the proposed WTTFNet and other algorithms, the proposed WTTFNet (solid blue line with dots) generally aligns the best with the actual trajectory (solid black). In particular, the linear model, vanilla LSTM and PoPPL diverged inferiorly in Figs. 7(a) and 7(b).

To study the role of weather and time-of-day, we compare between the two different settings of the proposed WTTFNet: with/without WT information. From Figs 7(a) and 7(b), the WTTFnet with WT information (solid blue line with dots) aligns much better than the counterpart without WT information (solid red line with diamonds), which diverges in the middle of the path. For the remaining non-significant pedestrians, both settings nearly the same performance in Figs. 7(c) and 7(d) and hence only one of them are plot on the graphs.

Overall, the quantitative (Figs. 5 and 6) and qualitative (Fig. 7) analyses show that weather-time information helps to improve prediction performance significantly for the 3008 cases considered. The proportion of 3008 out of 28536 was also statistically significant according to the McNemar’s test, suggesting that these 3008 pedestrians showing improved performance out of 28536 cases were very unlikely a random event. This suggests the proposed approach may serve as an attractive approach for incorporating WT information to improve pedestrian trajectory prediction and it also serves as a systematic approach to test the significance of WT conditions.

SECTION V.

Conclusion

A new deep WTTFNet has been presented. Experimental results using the Osaka ATC dataset [3] show that the proposed approach attained better performance than other state-of-the-art methods considered under varying weather-time conditions. A statistical test is also used to establish the significance of time-of-day and weather conditions. The proposed refinement framework can be adopted on other baseline models to improve these performance under varying weather-time conditions.

References

References is not available for this document.