Introduction
“Vehicle-Road-Cloud” collaboration is a current research hotspot and development trend in the field of intelligent transportation. It is a crucial competitive area in the new round of scientific and technological development and industry growth, holding significant importance for improving traffic safety, alleviating congestion, promoting energy conservation and emission reduction, and driving the development of upstream and downstream industries. Li et al. (2020) reported that roadside perception, as a vital component of collaborative perception in the “Vehicle-Road-Cloud” system, plays an indispensable role in enhancing perception accuracy and expanding the perception range. However, roadside perception is a challenging task that requires overcoming various difficulties and obstacles. Among the challenges faced, tracking moving targets using cameras in complex environments is a crucial aspect of roadside perception.
Currently, target tracking solutions using a single camera are relatively mature. However, cameras on real roads often operate independently, and there is no information exchange between cameras, leading to different identifications of the same target in different cameras. Cross-camera target tracking technology assigns consistent identities to the same target in different camera views, enabling continuous tracking across cameras.
Related Studies
2.1 Vehicle Detection
Liu et al. (2016) introduced the Single Shot MultiBox Detector (SSD), Girshick (2015) proposed the Fast Region with CNN features(R-CNN), and Ren et al. (2017) presented the Faster R-CNN. These models address the challenges associated with identifying and tracking vehicles. Fast R-CNN and Faster R-CNN are two-stage detectors that typically offer higher accuracy and flexibility but at the cost of being more time consuming. On the other hand, You Only Look Once (YOLO), proposed by Redmon et al. (2016), and SSDs are single-stage detectors. SSD uses a set of bounding boxes with different aspect ratios and sizes to predict object classes and their locations. YOLO effectively turns an object detection task into a classification task, achieving a balance between speed and accuracy, which enables real-time operation.
2.2 Single-Camera Tracking
In most single camera tracking (SCT) methods, the task is typically split into two main steps: First, a detection step is conducted, followed by an association step. During the association step, Bewley et al. (2016), Bochinski et al. (2017, 2018), and Ren et al. (2021) linked the detection of the same targets together based on a similarity measure.
The tracking process can be conducted either offline, for tasks such as traffic analysis, or in real time alongside camera or video input frames.
In offline methods, the model utilizes detection across the entire sequence of frames, followed by global optimizations, including graph-based and hierarchical methods. Standard offline methods often utilize a graph-based model, which can be further improved through techniques such as the minimum cost flow proposed by Wang et al. (2015) and subgraph decomposition proposed by Tang et al. (2015).
On the other hand, online methods follow the tracking-by-detection paradigm, utilizing only the current and previous frames to link detection results for each frame. The primary challenge in online methods is to associate features between tracked objects and detection results. To address this issue, processes such as the Kalman filter-based approach proposed by Bewley et al. (2016) can be employed. Cao et al. (2023) introduced Observation-Centric Simple Online and Realtime Tracking (OC-SORT), while Wojke et al. (2017) proposed DeepSORT for object association, leveraging pure motion tracking and deep visual features. Using deep OC-SORT, Maggiolino et al. (2023) proposed introducing visual appearance to OC-SORT, which adaptively integrates appearance matching into existing motion-based high-performance methods.
2.3 Cross-Camera Object Tracking
Recently, the field of multiple camera view tracking (MCVT) proposed by Javed et al. (2003) has gained significant attention, driven by the increasing demands of city-scale traffic management. The core of MCVT lies in cross-camera tracking technology. Presently, methods employed in cross-camera tracking research can be divided into two main categories: image feature-based tracking and motion feature-based tracking. In terms of image features, shortly after Porikli proposed brightness transfer functions, Madden et al. (2007) suggested using the main color spectrum features of targets as the basis for cross-camera target matching. Additionally, Collins et al. (2000), Porikli and Divakaran (2003) and Velipasalar et al. (2008) focused on studying the robustness of UV chromaticity features as features for cross-camera target appearance. Regarding motion features, Javed et al. (2003) used the Parzen window method to estimate target motion parameters within the camera field of view, such as speed, entry and exit positions, and transition time intervals. Dick and Brooks (2004) described human target motion within and between different fields of view using random transformation matrices. Caspi and Irani (2000) and Stein (1999) proposed methods based on temporal consistency between cameras to establish spatiotemporal constraints.
Recent research trends involve the integration of image feature information and motion feature information for collaborative tracking. Tran et al. (2022) proposed a method that combines spatial region partitioning and image features for vehicle association. Building on this approach, Yao et al. (2022) introduced a time decay mechanism, leading to improved tracking performance. However, most of the research on cross-camera tracking has been conducted offline, with relatively less emphasis on online tracking. For instance, Chen et al. (2021) utilized dual-appearance matrices to achieve cross-domain tracking of vehicles in overlapping regions.
Research Gaps and Objectives
Depending on the distribution of camera fields of view, cross-camera tracking research can be classified into overlapping and nonoverlapping scenarios. In nonoverlapping scenarios, current research methods are mainly divided into visual feature matching and spatiotemporal relationship-based association methods. Visual feature matching methods use neural networks to extract image features for target association. However, due to variations in matched angles and the impact of illumination on camera imaging, visual-based methods suffer from poor tracking performance. On the other hand, methods based on spatiotemporal information establish the topological structure between cameras according to the entrances and exits of the cameras, and trajectory similarity is calculated based on the transfer time of targets between cameras. However, these methods place high demands on the target detector, and issues such as detector false positives and false negatives can render the approach ineffective.
Therefore, this study proposes a cross-camera tracking method to address the shortcomings of both spatiotemporal-based and visual-based methods. The primary objectives of this research are as follows:
A cross-camera online tracking method based on real road scenes is proposed.
A spatial point matching-based filtering mechanism is proposed to enhance the utilization of spatial information.
A Gaussian model with microlevel traffic patterns is established to integrate visual features and spatiotemporal information to compensate for their respective shortcomings.
Method
The cross-camera vehicle tracking (CCVT) system framework is detailed in this section. As depicted in Fig. 1, the system is divided into two key components: single-camera vehicle tracking and intercamera association. Within single-camera vehicle tracking, three modules are integrated―vehicle detection, vehicle visual feature extraction, and vehicle tracking. Concurrently, CCVT is primarily composed of an intercamera association. The main contribution of this article lies in the intercamera association module.
4.1 Vehicle Detection and Feature Extraction
In single-camera vehicle tracking, the fundamental tasks involve detection and feature extraction, and there are well-established solutions for these tasks. We will follow established methods for generating detection boxes and reidentification (ReID) features in this study.
For the detection task, we employ YOLOv8x1 as our single-stage detector because of its finely tuned equilibrium between speed and accuracy. For the ReID feature extraction task, we use CSP-Darknet-53 as the backbone to obtain robust and discriminative visual feature representations for vehicles. This backbone is pretrained on the COCO dataset.
4.2 Single-Camera Vehicle Tracking
To ensure better real-time performance and accuracy in single-camera tracking, this research adopts Deep OC-SORT as the tracker. The deep OC-SORT algorithm builds upon the OC-SORT algorithm, improving multiobject tracking performance by integrating a novel visual appearance method. Moreover, Deep OC-SORT secured first and second places in the MOT20 and MOT17 competitions, respectively. This also demonstrates the outstanding tracking performance of Deep OC-SORT.
4.3 Intercamera Association
Yang et al. (2022) reviewed the intercamera association (ICA), which serves as the final yet crucial module in the CCVT. By leveraging the trajectories produced by the preceding modules, ICA links all tracklets with identical identities based on visual features and spatiotemporal information. It employs two consecutive cameras to match tracklets in accordance with road entry and exit points.
4.3.1 Spatial Point Matching-Based Filtering Mechanism
To better utilize spatial information, this study proposes a spatial point matching-based filtering mechanism. The filtering mechanism focuses on road surveillance cameras that are arranged with a pair of reverse-direction cameras. This structure enables adjacent cameras to have small blind zones and symmetrical fields of view. The small blind zone between cameras makes it challenging for vehicles to undergo significant transfers within the blind zone. The spatial topological relationships between adjacent cameras are established based on the symmetrical field of view. This means that when a vehicle departs from a certain pixel point in the field of view of the front camera, it will appear at the corresponding pixel point in the field of view of the rear camera. However, considering the tendency of lane changes in the blind zone, the pixel coordinates of the target vehicle when leaving from one camera's field and entering from another camera's field of view may not strictly satisfy the point-to-point matching relationship. Therefore, the spatial searching range needs to be enlarged. Expanding from a single pixel point to an entire region, the size of which is determined by the blind zone between the two cameras, as illustrated in Fig. 2. Under the influence of this filtering mechanism, whenever a vehicle appears in the rear-view camera, we can identify the area where the vehicle disappears in the front-view camera based on the spatial topological relationship. This helps narrow the range for vehicle association.
4.3.2 Temporal-Based Vehicle Association
In terms of temporal relationships, this study observes real road scenes and summarizes the time transfer patterns of vehicles between adjacent roadside cameras. The functionality of associating target vehicles is achieved based on this pattern. In real road scenarios, the limited speed of vehicles, constrained by both small blind zones between adjacent cameras and urban road traffic regulations, ensures that the vehicle entering the camera's blind zone first is also the first to exit. Therefore, when a camera detects a new target vehicle, the study associates it with the vehicles leaving adjacent cameras based on the time transfer pattern between cameras.
While the time transfer pattern of vehicles between cameras aligns with most scenarios on real traffic roads, there is still a possibility of a target vehicle being overtaken by other vehicles in the blind zone, leading to association errors. Therefore, the method of target association based on time transfer patterns still has certain limitations.
4.3.3 Visual Feature-Based Vehicle Association
In addition to the target association method based on time transfer patterns, this study also incorporates the conventional approach of visual feature matching for target association. The visual feature matching method comprises two key components: feature extraction and nearest-neighbor matching. For efficient feature extraction while maintaining real-time performance, this study utilizes the same feature extraction network, CSP-DarkNet, as YOLOv8.
The nearest-neighbor matching component involves assessing the similarity between image feature vectors extracted through the CSP-DarkNet network via cosine distance calculations. The cosine similarity algorithm measures the dissimilarity between two entities by evaluating the cosine value of the angle between the vectors in a vector space, as represented by Eq. (1):\begin{equation*}
\text{Similarity}=\cos\theta=\frac{\pmb{A}\cdot \pmb{B}}{\Vert \pmb{A}\Vert\ \Vert \pmb{B}\Vert }=\frac{\sum_{i=1}^{n}\pmb{A}_{i}\pmb{B}_{i}}{\sqrt{\sum_{i=1}^{n}\pmb{A}_{i}^{2}}\sqrt{\sum_{i=1}^{n}\pmb{B}_{i}^{2}}}
\tag{1}
\end{equation*}
4.3.4 Online Learning-Based Gaussian Model
In response to the limitations of both the target association models based on time transfer patterns and visual feature matching, this study proposes a Gaussian model-based approach utilizing online learning to complement the strengths and address the weaknesses of each individual model. Due to various factors such as traffic signals, speed limits, and regional development affecting real roads, each road embodies specific traffic patterns. This study statistically conducts a dataset of German motorways, and it shows that the speed of vehicles traveling on highways follows a Gaussian distribution, as shown in Fig. 3. The dataset used for the statistical analysis was the recently released HighD dataset from the Institute for Automotive Engineering at RWTH Aachen University in Germany, with a sample size of 1,044,634. This suggests that the transit time of vehicles between the blind zones of adjacent cameras should also follow a Gaussian distribution.
Hence, real-time transfer data of vehicles between adjacent roadside cameras enables the establishment of a Gaussian model reflecting microlevel traffic patterns. However, the time transfer model will fail in the following two situations. The first is when the minimum time difference between gallery samples leaving the field of view of one camera and newly detected query samples from the adjacent camera exceeds the 95% confidence interval. The second is when the transfer time of associated vehicles in the blind zone deviates significantly from the average passage time.
In contrast, the target association model based on visual feature matching directly searches for vehicles in the gallery with image features most similar to the query sample, effectively avoiding issues associated with the target association model based on time transfer patterns. Therefore, in certain situations, a model based on visual feature matching can be employed to compensate for the limitations of models based on time transfer patterns. When the time difference adheres to the 95% confidence interval of the Gaussian model, the target can be associated using the model based on time transfer patterns. After completing the association, the transfer time of target vehicles between adjacent cameras is continually updated to maintain the Gaussian model reflecting the microlevel traffic patterns on the road, as illustrated in Fig. 4.
The overall association accuracy of the target association model based on time transfer patterns surpasses that of the model based on visual feature matching. Therefore, when the sample size of the Gaussian model is inadequate, it is necessary to use the target association model based on time transfer patterns to accomplish target association and sample collection. The acquired transfer times of vehicles between adjacent cameras are then utilized to continually update the Gaussian model, facilitating the summarization of microlevel traffic patterns on the road for online learning purposes.
Experimental
5.1 Datasets
In this study, real road datasets were utilized to validate the accuracy of the cross-camera tracking algorithm. The dataset used for the experiment is based on real urban road scenarios, with the selected road being Fuxing Road in Haidian District, Beijing, as illustrated in Fig. 5. The chosen route is a bidirectional eight-lane road. Considering the limited lateral field of view of the cameras, simultaneously monitoring both directions on a bidirectional road would decrease the quality of the captured vehicle images. This, in turn, could impact the performance of the tracking algorithm. Therefore, in this experiment, video collection focused on the traffic flow of a single-direction four-lane segment of the road.
The parameters for recording the road traffic videos were set at 30 frame/s, with a resolution of 1,920 × 1,080 pixels. The total duration of the road traffic videos was 1 h, including 40 min of daytime recording from 10:00 to 11:00 and 20 min of nighttime recording from 19:00 to 21:00.
5.2 Programming Environment
The programming environment for the experiments in this research is shown in Table 1. The average computation time for tracking per frame is 32 ms.
5.3 Evaluation Metric
IDF1 is a commonly used performance evaluation metric in multiobject tracking. It quantifies the ratio of correctly identified and tracked targets by the tracker to the total number of targets identified and tracked by the tracker. IDP and IDR, representing precision and recall, respectively, are two crucial metrics utilized to assess the performance of a classification model. IDs refer to the identities for cross-camera tracking.\begin{equation*}
\begin{cases}
\text{IDF}1 =\frac{2\text{IDTP}}{2\text{IDTP}+\text{IDTP}+\text{IDFN}}\\
\text{IDP} =\frac{\text{IDTP}}{\text{IDTP}+\text{IDFP}}\\
\text{IDR} =\frac{\text{IDTP}}{\text{IDTP}+\text{IDFN}}
\end{cases}
\tag{2}
\end{equation*}
5.4 Experimental Results
5.4.1 Comprehensive Experiment
Based on the method described in Section 4, we conducted a quantitative analysis of the cross-domain tracking algorithm on the above datasets. The overall tracking results are shown in Table 2.
Fig. 6 illustrates the tracking performance of the cross-camera vehicle tracking algorithm. Cars with ID 8 and ID 6, along with an Sport Utility Vehicle(SUV) with ID 5, exit the field of view of camera 1 and reappear from camera 2 after some time. The cross-camera tracking algorithm correctly associates the targets in these instances.
As indicated in Table 2, in the daytime scenario, both the single-camera detection and tracking module and the cross-camera tracking module demonstrate stability, showing accurate target detection with fewer ID jumps. The overall performance of cross-camera tracking is commendable. However, during the nighttime, the stability of the single-camera detection and tracking module significantly decreases. The frequency of ID jumps is greater, resulting in a substantial increase in IDFN. In camera 1, the occurrence of multiple ID jumps interferes with the normal association of targets by the cross-camera tracking algorithm, leading to a reduction in the accuracy of cross-camera tracking.
5.4.2 Comparative Experiment
To validate the effectiveness of our proposed algorithm, we conducted comparative experiments on five distinct target association methods: visual feature-based target association, time-based target association, visual feature-based target association under spatial constraints, time-based target association under spatial constraints, and our proposed target association method, which integrates both temporal and spatial information with visual features. These experiments were conducted using the YOLOv8 detector for object detection and the Deep OC-SORT tracker for tracking as the baseline.
The experimental results are presented in Table 3 and Figs. 7–9. With the Zone-based Target Candidates Filter, a method proposed in this study, both temporal information-based target association and visual feature-based target association methods achieved significant improvements in tracking performance. Even compared with the reidentification algorithm proposed by He et al. (2019), the method introduced in this study demonstrates commendable performance.
In terms of tracking accuracy, the temporal information-based target association method outperformed the visual feature-based method, demonstrating superior performance in both daytime and nighttime tracking. Regarding tracking stability, the experiments revealed that in scenarios with dense road traffic or when false detections occurred in a single-camera detector, as shown in Fig. 10, the temporal information-based target association method exhibited continuous association errors for trailing vehicles. In contrast, the visual feature-based target association method, due to the characteristics of its working principles, was less affected by the introduction of falsely detected targets, resulting in higher tracking stability.
This study combines visual feature-based target association methods with temporal information-based target association methods and proposes a novel cross-camera target association method that seamlessly integrates visual features and spatiotemporal information. Even in scenarios with dense road traffic or false detections in a single-camera detector, the proposed method maintains stable tracking performance, thereby significantly enhancing the overall accuracy of cross-domain tracking algorithms.
Conclusions
To address these challenges and enhance tracking performance in cross-camera vehicle tracking, this study proposes advanced solutions and strategies. First, it addresses issues in both visual-based vehicle association and spatiotemporal-based vehicle association in cross-camera tracking, compensating for their shortcomings by combining visual features and spatiotemporal information. Next, by utilizing the number of vehicles transferred between cameras, a Gaussian model is established and updated to facilitate the summarization of microlevel traffic patterns on roads for online learning. Finally, the proposed method achieves an IDF1 score of 0.932 in real-road scenarios.
Replication and Data Sharing
The python program code within this research can be made accessible upon request via email to the corresponding author.
ACKNOWLEDGEMENTS
This study was funded by the National Natural Science Foundation of China (52172389), Natural Science Foundation of Guangdong Province (2022A1515012080), and Tsinghua-Toyota Joint Research Institute Interdisciplinary Program.