Introduction
Multi-objects tracking (MOT) is an important problem in computer vision, which has wide applications in visual surveillance, traffic safety and robotics and so on. It aims to locate multiple objects, maintain their identities and yield their individual trajectories throughout an image sequence [1]. While the tracked objects can be pedestrian, vehicle or animal, we mainly focus on pedestrians in our work since it is a more challenging problem in crowed scenes. The crowded scenes always have similar appearance with occlusion, intersecting trajectories, missing data and camera motion.
Due to the development of pedestrian detectors [2]–[4], the tracking-by-detection (TBD) MOT methods are the state-of-the-art tracking in recent years. Data association is a key issue in TBD methods. With the detection responses, the MOT task is formulated as an association optimization problem, in where short tracklets or detection responses are linked to yield long trajectories by learning the appearance, motion, location and size of the targets through the video sequences. There are two main components in data association based MOT system. The first one is association affinity model, which is used to measure the similarity or likelihood between tracklets in consecutive frames. The second one is association optimization model, which determines the tracklets linking between short tracklets or detection responses according to their association affinity model.
Several association optimization methods have been proposed to improve association problem in MOT [1]. Most existing association optimization methods either devise for batch tracking or online tracking. The global association method, considering past and future information, is proposed to particularly solve association optimization in batch tracking. While the local association method, only considering up to time information, mainly focuses on solving association optimization in online tracking.
To improve MOT performance, in this paper, an online multi-pedestrian tracking method is proposed, which formulates the association optimization problem into two step association by relying on an improved association affinity model. The framework of the proposed method is shown in Figure 1. The improved association affinity model is calculated by sparse-based appearance affinity model and rank-based motion affinity model. In our work, the motion affinity model is computed by a rank based motion model relying on the history position of a trajectory. Different from the general rank calculation method, an incremental SVD and downdate SVD decomposition method is devised to calculate the rank of the Hankel matrix in rank based motion model. This is specially devised for online tracking since the data explosion with time from stream data and it is too large to be stored or even buffered. In addition, we pay attention on sparse-based appearance model, use the sparse reconstruction error to analysis the missing frames in tracklet confidence calculation. To sum up, the main contributions of our work are in three folds:
Association optimization is implemented by two step data association considering the tracklet with high confidence and low confidence, which is guided by sparse-based appearance model and rank-based motion model of the target.
We construct the appearance affinity model based on sparse representation and rely on the reconstruction error to analysis the occlusion or detection failure. The analyzed result is fed back to compute the tracklet confidence during association optimization.
The rank based motion affinity model is learned online and updated by adaptively deciding the order of history trajectory of a tracked target related with the new frame motion model. We devise an incremental SVD and downdate SVD decomposition method to online estimate the rank of the Hankel matrix, the estimated result is fed back to compute the tracklets confidence during association optimization.
Related Works
In recent years, many existing MOT works focus on devising a robust appearance model to improve the tracking performance since the appearance features are very important factors influenced the MOT performance. thus, several features, such as histogram-based features, sparse features and CNN-based features, are used to build appearance model of the tracked target by integrating different affinity measures [5]. Kuo et al. [6] devised an online learned discriminative appearance model called OLDAMs to improve MOT performance. Bae et al. [7] design a deep appearance model based on the Siamese network and introduce online transfer learning into deep appearance model for discriminating different object. In [8], an instance-aware tracker is proposed by integrating single object tracking algorithm to MOT with CNN-based dynamically refreshing model to eliminate noise. In [9], Sadeghian et al. propose a MOT method by integrating deep CNN based appearance model and LSTM based motion model. In addition, significant improvement have achieved in sparse-based tracking. In [10], Zhang et al. proposed a robust structural sparse representation (RSST) method to handle occlusion and illumination changes. Fagot-Bouquet et al. [11] introduce sparse representations into multi-frame data association to gain the MOT performance.
To further improve MOT performance, there are also many works pay their attention on fusing motion affinity and appearance affinity of the tracked target. For example, Yang et al. [12] integrate the non-linear motion model with appearance model to construct a robust association affinity model under the framework of multiple instance learning algorithm. Bae and Yoon [13] exploit trajectory confidence and incremental linear discriminate appearance to assist their two-step data association. Collins et al. [14] showed that higher-order motion constraint has a major effect on improving the quality of data association in MOT, especially when multi-object with similar appearance. In [15], both individual and mutual relation models are introduced in MOT to build graph model. In [16], the pairwise relative motion model is introduced as an additional term to construct CRF energy function. Most recently, the notion of relative motion network is proposed in [17], which is designed to improve the data association performance by utilizing the relative spatial constraint between objects.
However, these methods either devise a discriminative appearance affinity model by exactly extracting the targets features or build pairwise motion constraint to distinguish the target in neighborhood of an image. It will cost huge time to achieve the discriminative appearance feature of the targets in MOT. While the motion constraint often depend on the position of the target estimated by Kalman or particle filter. This kind of motion affinity model has a good performance in where continuous detections without occlusion and non-stationary camera in short period time. However, the MOT tracking system is with different kinds of noise, e.g., missing detections, occlusions and unstable camera motion.
The improve MOT performance, in our work, we formulate the association affinity model by sparse-based appearance affinity model and rank-based motion affinity model as follow:\begin{equation*} S_{link} (T\vert z)=S_{app} (T\vert z)S_{m} (T\vert z)\tag{1}\end{equation*}
Online Tracking With Improved Two-Step Data Association
A. Problem Formulation
The task of online MOT is to implement tracklets association between the existing trajectories and a set of detections. By estimating the object state from the current detection responses, it generates long trajectories for the tracked targets. Let \begin{equation*} {\mathbb {T}}_{1:t}^{\ast } =\mathop {\arg \max }\limits _{{\mathbb {T}}_{1:t}} p({\mathbb {T}}_{1:t},\vert {\mathbb {Z}}_{1:t})\tag{2}\end{equation*}
In Eq. (2), the optimal tracklet association is solved relied on the association affinity model and association optimization strategy. In our work, association affinity model is constructed by sparse appearance model and rank based adaptively dynamic motion affinity model. The association optimization strategy is performed by two step data association, high confidence tracklets association in two consecutive frames and low confidence tracklets association by considering the history tracklet information. Furthermore, we introduce the sparse reconstruction error for occlusion analysis and an incremental SVD and downdate SVD method is devised to compute the rank for Hankel matrix in the rank based motion model. The rank also determines the order of history object states used to perform the association between tracklets and detections. This is quite different with the most similar work in [13] (more details works are described in section III.C and section IV).
B. Improved Two-Step Data Association
To perform the two-step data association, we have bipartition the tracklets into high confidence tracklets and low confidence tracklets based on the tracklet confidence. As defined in [13], the tracklet confidence is mainly determined by three factors: the length of a tracklet, occlusion and affinity score between the tracklet and the associated detection. In our work, we define the tracklet confidence relied above factors as follows:\begin{equation*} conf(T^{i})=\left({\frac {1}{L_{i}}\sum \limits _{t\in [t_{s}^{i},t_{e}^{i}]} {S_{link} (T^{i},z^{j})} }\right)\times \exp \left({-\beta \cdot \frac {W}{L_{i} }}\right)\tag{3}\end{equation*}
The major difference between our work and the most similar work [13] in tracklet confidence calculation is the way to calculate the association affinity model
1) The First Step Association for High Confidence Tracklets in Two Consecutive Frames
By the tracklet confidence measurement in Eq. (3), \begin{equation*} X_{k\times n}^{l} =[x_{ij}]_{k\times n},x_{ij} =-\log (S_{link} (T_{L_{i} }^{i(h_{i})},z^{j}))\tag{4}\end{equation*}
2) The Second Step Association for Low Confidence Tracklets by Considering the History Tracklet Information
The low confidence tracklets
The cost matrix in the second step association is defined as follows:\begin{equation*} X_{(q+{q}')\times (h+q)}^{g} =\begin{bmatrix} {A_{q\times h}} & {D_{q\times q}} \\ {-\log (\tau)_{q'\times h}} & {B_{q'\times q}} \\ \end{bmatrix}\tag{5}\end{equation*}
C. Association Affinity Model With Sparse Representation and Rank Based Motion Estimation
Follow Eq. (1), the association affinity model in our work is consisted by two affinity terms. The first term in Eq. (1) is the appearance affinity score between the tracklet and the associated detection, which is calculated based on sparse representation. The second term is the motion affinity computed by rank based dynamics motion model in section IV. For a detection \begin{equation*} \alpha =\mathop {\arg \min }\limits _{\boldsymbol {\alpha }} \frac {1}{2}\left \|{ {z-F\alpha } }\right \|_{2}^{2} +\lambda \left \|{ \alpha }\right \|_{2,1}\tag{6}\end{equation*}
\begin{equation*} R_{e} (z)=\left \|{ {z-F\alpha } }\right \|_{2}^{2}\tag{7}\end{equation*}
1) The Appearance Affinity Score for a Pairwise Tracklet-Detection Association in High Confidence Tracklets Association
As described in section III.B, the association optimization is performed by two-step. Since the high confidence tracklets association is only implemented between the detection \begin{equation*} S_{app} (T^{(hi)},z^{j})=\frac {1}{\xi }\exp (-R_{e} (z^{j}))\tag{8}\end{equation*}
2) The Appearance Affinity Score for the Low Confidence Tracklets Association
Since the appearance affinity score in low confidence tracklets association involved two kind associations, the pairwise tracklet-tracklet association between the low confidence tracklets \begin{equation*} S_{app} (T_{L_{i}}^{(l_{o})},T_{L_{j}}^{(h_{i})})=\frac {1}{\xi }\exp \left ({{-\frac {\sum \limits _{j=1}^{N_{s}} {R_{e} (z^{j})}}{N_{s} }} }\right)\tag{9}\end{equation*}
In terms of tracklet-detection pairs association between
3) Missing Frames Analysis Based on Sparse Reconstruction Error
Due to the main concept behind our sparse-based appearance affinity model is that a detection response should be best represented by the detections of its own trajectory rather than detections from other trajectories [19]. Therefore, the sparse reconstruction error between a detection and a tracklet can reflect the similarity between them, then, it is naturally to use the sparse reconstruction error analyzed the missing frames in Eq. (2), the missing frames often caused by occlusion or detection failure, in our work, if the smallest reconstruction error
The motion affinity
Improved Rank Based Motion Affinity Model
The rank based motion affinity model proposed in [20] is defined as follows:\begin{equation*} y_{t} =\sum \limits _{i=1}^{m} {a_{i} y_{t-i}},\quad m\le \mathbf {L},~t\ge s+m\tag{10}\end{equation*}
As defined in [20], for an ordered tracklet sequence \begin{equation*} H_{T^{i}} =\begin{bmatrix} {y_{s}} & {y_{s+1}} & \cdots & {y_{s+\mathbf {L}_{i} /2}} \\ {y_{s+1}} & {y_{s+2}} & \cdots & {y_{s+\mathbf {L}_{i} /2+1}} \\ \vdots & \vdots & \vdots & \vdots \\ {y_{s+\mathbf {L}_{i} /2}} & {y_{s+\mathbf {L}_{i} /2+1}} & \cdots & {y_{s+\mathbf {L}_{i}}} \\ \end{bmatrix}\tag{11}\end{equation*}
One major challenging for motion affinity in [20] is how to effectively estimate the rank of a Hankel matrix for a tracklet dynamics model in Eq. (10). The other challenging is that the Hankel matrix constructed by the positions of a trajectory is corrupted by noise, which is caused by false associations between tracklets or missing detections. In addition, most existing methods use the fully singular value decomposition (SVD) to compute the rank of Hankel, which is exponentially growth as the length
Aim to solve above issues, in our work, we follow the idea of [20] and define the improved rank based motion affinity model as Eq. (12), which introduce the incremental SVD and downdate SVD decomposition for real-time streaming data to estimate the rank for Hankel matrix.\begin{equation*} S_{m} (T^{i},T^{j})=\frac {rank(\mathrm {H}(T^{i})+\mathrm {rank}(\mathrm {H}(T^{i})))}{rank(H(T^{ij}))}-1\tag{12}\end{equation*}
In our work, the motion affinity
1) Online Rank Estimation Uses Incremental SVD Decomposition With New Data Arrival
In online tracking, we devise a scheme for rank estimation of the Hankel matrix by an incremental SVD decomposition of real-time streaming data [23], it incorporate the new arrive data into the SVD as it arrives. The incremental SVD decomposition of a
In order to consideration the noise in trajectory, let \begin{equation*} \mathbf {y}_{t} ={\hat {\boldsymbol {y}}}_{t} +\boldsymbol {\eta }_{t}\tag{13}\end{equation*}
\begin{equation*} H=Y+E\tag{14}\end{equation*}
Eq. (14) is rewritten via the incremental PCP algorithm as follows:\begin{equation*} \mathop {argmin}\limits _{A,E} \left \|{ Y }\right \|_{\ast } +\lambda \left \|{ E }\right \|_{1},\quad s.t.~ H=Y+E\tag{15}\end{equation*}
By solving Eq. (15), the noiseless trajectory of a target is achieved. The convex relation form of Eq. (15) is as follows:\begin{equation*} \mathop {argmin}\limits _{\mathbf {a},\eta }~ rank(Y)+\lambda \left \|{ E }\right \|_{0},\quad s.t.~ H=Y+E\tag{16}\end{equation*}
For frame \begin{equation*} \mathbf {y}_{t} =\hat {\boldsymbol {y}}_{t}^{n} +\boldsymbol {\eta }_{t}^{n}\tag{17}\end{equation*}
Given \begin{equation*} H_{t} =[H_{t-1},\mathbf {y}]=[H_{t-1} \;\mathbf {0}]+\mathbf {ye}^{T}\tag{18}\end{equation*}
We compute the SVD of \begin{equation*} H_{t} =\begin{bmatrix} U & {\mathbf {y}} \\ \end{bmatrix}\begin{bmatrix} \sum & {\mathbf {0}} \\ {\mathbf {0}^{T}} & 1 \\ \end{bmatrix}\begin{bmatrix} {V^{T}} & {\mathbf {0}} \\ {\mathbf {0}^{T}} & 1 \\ \end{bmatrix}\tag{19}\end{equation*}
By Gram-Schmidt orthonormalization of \begin{equation*} H_{t} =\begin{bmatrix} U & {\mathbf {p}} \\ \end{bmatrix}\begin{bmatrix} \sum & {U^{T}\mathbf {y}} \\ {\mathbf {0}^{T}} & \rho \\ \end{bmatrix}\begin{bmatrix} {V^{T}} & {\mathbf {0}} \\ {\mathbf {0}^{T}} & 1 \\ \end{bmatrix}\tag{20}\end{equation*}
\begin{equation*} \rho =\left \|{ {\mathbf {y}-UU^{T}\mathbf {y}} }\right \|_{2},\mathbf {p}=\frac {\mathbf {y}-UU^{T}\mathbf {y}}{\rho }.\end{equation*}
Let the middle matrix of the right-hand-side of Eq. (20) is \begin{equation*} {M}'=\begin{bmatrix} \sum & {U^{T}\mathbf {y}} \\ {\mathbf {0}^{T}} & \rho \\ \end{bmatrix}\in \Re ^{(n+1)\times (n+1)}.\end{equation*}
\begin{equation*} SVD({M}')=G\hat {\sum }Q^{T}\tag{21}\end{equation*}
Then the SVD of \begin{equation*} H_{t} =\begin{bmatrix} U & {\mathbf {p}} \\ \end{bmatrix}\left({G\hat {\sum }Q^{T}}\right)\begin{bmatrix} {V^{T}} & {\mathbf {0}} \\ {\mathbf {0}^{T}} & 1 \\ \end{bmatrix}\tag{22}\end{equation*}
From Eq. (21) and (22), we can see that \begin{equation*} {U}'=\begin{bmatrix} U & {\mathbf {p}} \\ \end{bmatrix}G,\quad {\sum }'=\hat {\sum },~{V}'=\begin{bmatrix} {V^{T}} & {\mathbf {0}} \\ {\mathbf {0}^{T}} & 1 \\ \end{bmatrix}Q.\end{equation*}
2) Online Rank Estimation Uses Downdate SVD Decomposition With Deleting ‘Too Old’ Data
Due to the data are too large to be stored or even buffered with repeatedly update \begin{equation*} [\mathbf {0},H_{t-r\;:t}]=[\mathbf {y}_{t-r+1},H_{t-r\;:t}]+(-\mathbf {y}_{t-r+1})\mathbf {e}^{T}\tag{23}\end{equation*}
\begin{align*} [\mathbf {0},H_{t-r\;:t}]=&[\mathbf {y}_{t-r+1},H_{t-r\;:t}]+(-\mathbf {y}_{t-r+1})\mathbf {e}^{T} \\=&\begin{bmatrix} U & {\mathbf {-y}_{t-r+1}} \\ \end{bmatrix}\begin{bmatrix} \sum & {\mathbf {0}} \\ {\mathbf {0}^{T}} & 1 \\ \end{bmatrix}\begin{bmatrix} {V^{T}} \\ {\mathbf {e}^{T}} \\ \end{bmatrix}\tag{24}\end{align*}
By Gram-Schmidt orthonormalization of \begin{align*} [\mathbf {0},H_{t-r\;:t}]=&[\mathbf {y}_{t-r+1},H_{t-r\;:t}]+(-\mathbf {y}_{t-r+1})\mathbf {e}^{T} \\=&\begin{bmatrix} U & {\mathbf {0}} \\ \end{bmatrix}\begin{bmatrix} {\sum -\sum \mathbf {ww}^{T}} & {-\rho _{\mathbf {w}} \cdot \sum \cdot \mathbf {w}} \\ {\mathbf {0}^{T}} & {\mathbf {0}} \\ \end{bmatrix} \\&\times \,\begin{bmatrix} {V^{T}} \\ {\mathbf {q}^{T}} \\ \end{bmatrix}\tag{25}\end{align*}
\begin{align*} \mathbf {w}=&V^{T}\mathbf {e}=V(t-r+1,:)^{T},\rho _{\mathbf {w}} =\sqrt {1-\mathbf {w}^{T}\mathbf {w}},\\ \mathbf {q}=&\frac {\mathbf {e}-V\mathbf {w}}{\rho _{\mathbf {w}} },\mathbf {y}_{t-r+1} =U\sum \mathbf {w}.\end{align*}
Let the middle matrix of the right-hand-side of Eq. (25) is\begin{equation*} M=\begin{bmatrix} {\sum -\sum \mathbf {ww}^{T}} & {-\rho _{\mathbf {w}} \cdot \sum \cdot \mathbf {w}} \\ {\mathbf {0}^{T}} & {\mathbf {0}} \\ \end{bmatrix}.\end{equation*}
\begin{equation*} SVD(M)=G\hat {\sum }Q^{T}\tag{26}\end{equation*}
Then the downdate SVD of \begin{equation*} [\mathbf {0}, H_{t-r\;:t}]=\begin{bmatrix} U & {\mathbf {0}} \\ \end{bmatrix}\cdot \left({G\hat {\sum }Q^{T}}\right)\cdot \begin{bmatrix} {V^{T}} \\ {\mathbf {q}^{T}} \\ \end{bmatrix}\tag{27}\end{equation*}
From Eq. (26) and (27), we see that \begin{equation*} {U}'=\begin{bmatrix} U & {\mathbf {0}} \\ \end{bmatrix}G,\quad {\sum }'=\hat {\sum },~{V}'=\begin{bmatrix} {V^{T}} \\ {\mathbf {q}^{T}} \\ \end{bmatrix}\cdot Q.\end{equation*}
From the above analysis, we can find the optimal rank of the Hankel matrix in Eq. (11) and achieve the cleaned trajectory at any frame
Finally, after the association optimization between the tracklets and the detections, the non-associated detections are remained to initialize new potential trajectories. A new trajectory is found when it grows over five consecutive frames. The non-associated trajectories are terminated if they are unassociated in five consecutive frames. The main steps of the proposed online multi-object tracking method are summarized in Algorithm 1.
The Overall Algorithm for the Proposed Online Multi-Object Tracking
object state set
updated trajectory set Tt
construct association affinity model
sparse-based appearance affinity model:
calculate the appearance affinity score for pairwise tracklet-detection association and tracklet-tracklet association as in Eq. (8) and Eq. (9), respectively;
compute the rank based motion affinity as Eq. (12);
data association optimization
calculate the tracklet confidence according to Eq. (3), bipartition tracklets into high and low confidence tracklets;
missing frames analysis based on sparse reconstruction error as in section III.C;
perform the first step data association for high confidence tracklets in two consecutive frames;
perform the second step data association for low confidence tracklets by considering the history tracklet information as in section III.B;
construct association matrix
update the trajectory confidence and the tracklets after assigning the associated results.
Experiments
In this section, we validate the effectiveness of the proposed method and make a comparison with several state-of-the-art multi-object tracking methods.
A. Datasets
To evaluate the performance of the proposed method, experiments are conducted on three typical public dataset: the PETS 2009 dataset, the TUD dataset,and MOT 2015 dataset. The PETS 2009 dataset contains pedestrian sequences with different crowded density, and we use the S2 dataset, which is devised for pedestrian tracking (S1 focuses on person counting and density estimation while S3 is devised for flow analysis and event recognition). The TUD dataset contain sequences captured on streets at a very low camera angle in close range. Due to the low viewpoint of camera, there are often full occlusions among pedestrians. The crossing, campus and Stadtmitte sequences from TUD dataset are used for evaluation. The MOT2015 dataset include 11 challenging sequences cover various scenes, such as occlusions, clutter background, scale and shape changes, moving camera and stationary camera.
B. Evaluation Metrics
To comprehensively evaluate the performance for multi-object tracking algorithm, the CLEAR MOT metrics [24] as well as the metrics defined in [25] are used for quantitative evaluation, including MOTP(
C. Parameter Setting and Baseline Methods
The proposed online MOT algorithm is performed on MATLAB with an Intel Core i7 8GHz PC. In the experimental, we set the pre-defined threshold
We compare our work with 21 state-of-the-art MOT methods, including online trackers and batch trackers. Ten of them are online methods (RMOT [17], SCEA [26], CMOT [13], TSML [20], HDAOH [27], MHMHT [28], PRIMPT [29], PHD_GSDL [30], TENSOR [31], SAS_MOT15 [32], RNN_LSTM [33]), the rest 11 are batch methods (DCT [34], SMOT [35], CEM [36], KSP [37], [20], DTLE [38], H2T [39], MTTCEM [40], DP_NMS [41], JPDA_100 [42], SiameseCNN [43], CNNTCM [44]). Because some of the state-of-the-art methods do not provide publicly available source codes, we just use the results reported in their published papers.
D. Results and Analyses
1) Evaluation on the Pets Dataset
Some exemplars of the tracking results from the PETS 2009 S2.L1, S2.L2 and S2.L3 sequences are shown in Fig. 2. The quantitative evaluation results are shown in Table 1. This dataset contains different levels of crowded density (low to high crowded), frequent interactions between pedestrians and illumination changes.
The tracking results of the proposed method on PETS 2009 Dataset sequences. At each frame, objects are attached with bounding boxes and ID labels with different colors.
a: Low-Density Sequence
The S2.L1 sequence is one of the widely used low-density but long-frame multi-pedestrian tracking sequences with frequently occlusion, closely moved pedestrians with similar appearance and complicated motion. Table 1 shows that the proposed method outperforms the state-of-the-art trackers with higher scores in MOTA and MT, lower scores in IDs and FM. Compared with the CMOT tracker [13], the most related work, our method has improved performance, the MOTA and MT scores are improved 9.06% and 26.3%, respectively. The number of ID switches and FM are only 3 and 6, respectively. The significant low IDS and FM indicates that our association affinity model is superior to the method by [13]. This is owing to the rank based motion affinity model, which is used to build the affinity association model, can effectively handle missing detections and complicated motion.
b: Medium-Density Sequence
The S2.L2 sequence contains 436 frames with large number of pedestrians, who moved in different directions with significantly appearance changes due to occlusions and illumination changes. The frequently occlusions include pedestrian occluding each other and static occlusion with traffic sign. As expected, the two-step association optimization with sparse-based appearance model and rank-based motion model can increase the tracking performance, this is because the sparse based appearance reconstruction is used to count the missing frames, and the rank based motion model use useful tracklet information to calculate the tracklet confidence during association optimization. The MT and MOTA scores of the proposed method are 79% and 71.9%, respectively. Some of the online trackers, such as RMOT, SCEA, CMOT and TSML, have poor tracking performance with MT 9.5%, 0, 71.62% and 31%, respectively. The MOTA scores are 37.2%, 21.1%, 70.12% and 59.7%, respectively. Meanwhile, the offline trackers, such as CEM, KSP, DTLE, DP_NMS, JPDA_100, H2T and MTTCEM also not have satisfied tracking performance. The MT scores for those offline trackers are all less than 40%, the MOTA scores are smaller than 65%. The good tracking performance of the proposed method shows that the rank based motion affinity model in our work, which takes the high order object states to guide association optimization during tracking, is different from the traditional motion models like Kalman or Markov model. It is beneficial to deal with the ambiguities caused by occlusions or detection failures.
c: High-Density Sequence
The S2.L3 sequence is a high crowded pedestrian sequence contains occlusions, shadows and illumination changes. The proposed method achieves superior performance comparing with other competing methods with higher scores in MT, MOTA and lower scores in ML, IDs and FM. Our low IDs indicates that our method can effectively overcome the tracking fragments, proves the effectiveness and robustness of the improved association affinity model. The improved association affinity, consisted by the sparse appearance representation and the rank based motion affinity, aims to depress the ambiguous detection responses in online tracking. The sparse reconstruction error plays an occlusion indicator in our work. In addition, we devise the incremental SVD and downdate SVD decomposition method to calculate the rank of the Hankel matrix in rank based motion model. It considers the noise in the trajectory, and achieves the noiseless trajectory information of a target in the rank estimation process. The cleaned trajectory information is used to predict object state for next frame, this is useful to eliminate ambiguous caused by similar appearance of the targets and occlusions. These strategies are helpful for reducing the link switches of the association between tracklets, hence, our method has relatively low IDs.
2) Evaluation on the TUD Dataset
Some exemplars of the tracking results for the TUD Stadtmitte, crossing and campus sequences in the TUD datasets are shown in Fig. 3. The quantitative evaluation results are shown in Table 2. From the results, we can see that our method has a superior performance. Compared with the state-of-the-art trackers, the proposed method has higher scores in MOTA and MT, lower scores in FN, FP, IDs and FM. The good performance for our work is owned to the proposed affinity association model, which is calculated by fusing the sparse-based appearance affinity model and a discriminative dynamic motion affinity model. Moreover, the proposed rank-based dynamic motion affinity model, which is different from the general rank calculation method, is based on the incremental SVD and downdate SVD decomposition method. This is specially devised for online tracking because of the data explosion with time from stream data or it is too large to be stored or even buffered. Then the rank-based motion affinity model is used to distinguish targets with similar appearance and deal with noise. By using the noiseless high order trajectory information to build dynamic motion model of the target, it is beneficial to recover fragmented tracklets and maintain target identities.
The tracking results of the proposed method on TUD Dataset sequences. At each frame, objects are attached with bounding boxes and ID labels with different colors.
3) Evaluation on the PET2005 Dataset
We compare our algorithm with 15 state-of-the-art MOT trackers in 2D MOT 2015 dataset, including 8 online trackers and 7 offline tracking. In addition, 4 deep learning methods are included to compare with our MOT tracker. Some exemplars of the tracking results for MOT2015 dataset are shown in Fig. 4. The quantitative evaluation results are shown in Table 3. As shown in Table 3, our tracker achieves the best performance in MT and ML and the second best MOTA and IDs with 32.14% and 438, respectively. The good tracking performance of the proposed method indicates that the association optimization model in our method is more effective than other trackers. Our two-step association optimization method pays more attention on how to accurately compute the tracklet confidence in order to bipartite tracklets into high confidence and low confidence tracklets. The tracklet confidence computation is guided by rank based motion estimation to analysis the useful history trajectory information and relied on the sparse reconstruction to analysis the missing frames caused by occlusion or detection failure. This is helpful for eliminating ambiguous association, recovering fragment tracklets being tracked and preventing objects identities from switches. Meanwhile, the order of history trajectory is online determined based on the rank estimation method, which is different from the state-of-the-art methods. In our work, the incremental SVD and downdate SVD decomposition method is specially devised for online tracking by considering new data and forgetting ‘too old’ data. Both those strategies are helpful for eliminating ambiguous association and preventing objects identities from switches.
The tracking results of the proposed method on MOT2015 Dataset sequences. At each frame, objects are attached with bounding boxes and ID labels with different colors.
4) Ablation Analysis
For better exploration the contribution of each part in our proposed method, we have conducted ablation studies to validate the effectiveness of the sparse based appearance affinity model and the rank-based motion affinity model in 2D MOT 2015 training set. The 2D MOT 2015 training set is split into a training set with 5 sequences and a validation set with 6 sequences. In terms of the rank-based motion affinity model, we divide it into the higher order motion model based fully SVD decomposition and the improved rank based motion model based incremental SVD and downdate SVD decomposition.
First, we use color histogram to calculate the appearance affinity in Eq. (1) and denotes it as P1 tracker. Then, to validate how the motion affinity in our work influenced the tracking performance, the motion affinity model in Eq. (10) is firstly replaced by the Markov model, donates it as P2 tracker. Then, the fully SVD decomposition is used to estimate the rank of Hankel matrix in Eq. (12) instead of the incremental SVD and downdate SVD decomposition in our work, we denote it as P3 tracker.
Table 4 shows the evaluation results of P1-P3 trackers and the proposed method. As expected, the proposed method has an improved performance than P1 and P2 trackers and has a comparable tracking result with P3. The poor performance for P1 tracker is caused by the appearance representation. Since the proposed method not only use the sparse representation to construct appearance affinity in Eq.(1), but also rely on the sparse reconstruction to analysis the occlusion, the appearance affinity model build by sparse representation is more robust than the color histogram as in P1 tracker. Hence, an apparent decline of MOTA and MT are witnessed in P1 tracker. Similar effect is happened to P2 tracker, in addition, the number of IDs is increased in P2 tracker since the Markov motion model is used to replace the high order motion in our work. The reason for this phenomenon is that the limit history states information is used to estimate the target states. When the occlusion or miss detection occurs, seldom reliable information is available to predict object state, especially frequently occlusions or ambiguities occur in long period of time. As a result, the association affinity model calculated based on the confused object state, has no ability to distinguish object from similar appearance or occlusion. Hence, it easily cause identity switches. Contrast, the higher order motion model in our work, different from the Markov motion model, it can use the history reliable information to guide the tracker prediction object states, especially in situation where no detection is available due to missing detections or occlusions. The proposed motion affinity model allows the tracking system correctly maintaining identities of the tracked targets and recovers trackets from occlusion, which guarantees our algorithm with low IDs. Table 4 supports this analysis. Furthermore, by comparing our method with P3 tracker, as shown in Table 4, the performance of our approximation algorithm, using incremental SVD and downdate SVD decomposition to compute the motion affinity, is not fade the tracking performance. The MOTA, MT and IDs for the proposed method are slightly poorer than P3 tracker, which imply that the tracking performance in our work is seldom influenced by the approximation algorithm.
5) Runtime Performance
We implement the proposed online MOT algorithm on MATLAB with an Intel Core i7 8GHz PC without any optimization and parallel programming. In the experiment, given the detection responses, we present the average runtime performance for trackers by averaging 5 times for all the evaluated sequences. The results are shown in Table 5 and Fig.5. The velocity of the trackers is measured by frame-per-second (FPS). From Table 5 and Fig. 5, we can see that the proposed method performs well in running time, which has a comparative result with the state-of-the-art methods in MOT. Our average velocity reaches 17.02 FPS, which is able to satisfy many tracking applications.
The performances compare between the proposed method and several competing trackers. Each marker denotes a tracker accuracy and speed measured in FPS, the higher and more right is better.
Conclusion
In this paper, we propose an online multi-object tracking algorithm that aims to gain the tracking performance by building an optimized data association framework and association affinity model. The data association task is performed by two step association based on the tracklet confidence, which relies on the improved association affinity model. The association affinity model, consisted by the sparse appearance representation and the rank based motion affinity, aims to depress the ambiguous detection responses in online tracking. The sparse reconstruction error plays an occlusion indicator in our work. We have devised the incremental SVD and downdate SVD decomposition method to calculate the rank of the Hankel matrix in rank based motion model. It considers the noise in the trajectory, and achieves the noiseless trajectory information of a target in the rank estimation process. Experimental results on three challenging datasets show that the proposed method has a good tracking performance.