Introduction
As a new Earth observation technology, satellite video is able to provide a period of continuous observation over an area, providing rich dynamic information of an object, such as the moving trajectory, speed, and directions. Satellite video is important for numerous applications, such as space-based surveillance [1], traffic monitoring, and disaster rescue.
As an important task based on satellite videos, small and dim moving object detection (MOD) has attracted increasing attention in recent years. However, this task is highly challenging due to several facts.
Low Spatial Resolution: Due to the long distance between a target and the imaging platform, the object is extremely small. Besides, the appearance of objects changes significantly between the consecutive frames.
Large Field of View: Each frame in a satellite video is typically on the order of several to hundreds of megapixels, resulting in a large searching space and illumination variation.
Heterogeneous Backgrounds and Complex Noise: Objects are usually immersed and densely packed in highly heterogeneous and complex backgrounds.
Statistical methods (i.e., median (mean) model and statistical model (VIBE) [2]) usually compare each video frame with an adaptive background model (which is free of moving objects). Ahmadi et al. [3] employed a median background model to detect objects and used the nearest neighbor algorithm to produce trajectories. However, these statistical methods do not consider the structure knowledge of a video (e.g., temporal similarity of background and spatial contiguity of foreground). Consequently, their detection performance cannot be further improved, especially in complex and dynamic backgrounds.
To address this limitation, RPCA [4], [5], [7] was introduced to encode the temporal similarities of video backgrounds and mostly useful foreground prior structures (e.g., sparsity and spatial continuity). Pflugfelder et al. [6] and Zhang et al. [8] proposed several methods based on the low-rank and structured sparse decomposition (LSD) framework [5] to achieve MOD in satellite videos. However, these matrix RPCA-based methods can only convert the videos with a natural 3-D structure to a 2-D data, which can destroy the structure information and reduce the detection performance. In addition, these methods cannot achieve robust performance and fast processing speed in complex and highly heterogeneous backgrounds.
Motivated by the work for exploiting spatial–temporal and structural information in [9], [10], we incorporate a spatial–temporal tensor with RPCA (tensor RPCA) and employ the weighted Schatten
We introduce a tensor representation to preserve the spatial–temporal information of pixels within a satellite video.
We propose a tensor RPCA analysis framework with bounded noise and a generalized WSNM to separate objects from the background by estimating the low-rank components. In addition, we adopt tensor singular value decomposition (t-SVD) for efficient inference.
We employ the alternating direction method of multipliers (ADMM) to solve the low-rank component recovery problem in our tensor RPCA analysis framework. Extensive experiments have demonstrated the superiority of our WSNM-STTN to the state-of-the-art methods.
Proposed Model
A. Matrix Decomposition Model for MOD
The extended matrix decomposition model (E-LSD) [8] considered foreground detection from a viewpoint of decomposition and optimization problem, which can be defined as \begin{equation*} \bf {D} = \bf {B} +{\mathbf{S}} + \bf {E}.\tag{1}\end{equation*}
Here,
In E-LSD, an optimization problem is defined as \begin{align*} (\mathbf{ {B^{*},S^{*},E^{*}}})=&\mathop {\arg \min }_{\mathbf{ {B,S,E}}} ||{\mathbf{B}}|{|_{*}} + \lambda _{1}||{\mathbf{S}}|{|_{\ell 1/\ell \infty }} + \lambda _{2}||{\mathbf{E}}||_{F}^{2} \\&\text {s.t.} ~\mathbf{ {D = B + S + E}}\tag{2}\end{align*}
However, the matrix decomposition model cannot preserve the structural information of the input video. It also cannot make good use of the spatiotemporal correlation prior to the background and spatiotemporal continuity of the foreground. In addition, E-LSD adopts convex nuclear norm minimization (NNM) to characterize the low-rank background, while NNM treats singular values equally. As a result, the accuracy of the estimated low-rank component is reduced in highly noisy scenarios [11], [12], and the low-rank component shrinks too much, which is called the over contraction problem [11].
B. Spatial–Temporal Tensor Model for MOD
Since a satellite video has a 3-D structure, a matrix extension of RPCA to Tensor RPCA can be used to address the aforementioned problem. Furthermore, we propose a tensor RPCA analysis framework with bounded noise to preserve the structure information in a satellite video and dig out interframe correlations within a satellite video. The problem of MOD in satellite videos can be formulated as \begin{equation*} \mathcal {D}= \mathcal {B}+ \mathcal {T}+ \mathcal {N}\tag{3}\end{equation*}
In order to recover the low-rank component more accurately and separate the object from background more perfectly, we incorporate WSNM [11] into the low-rank tensor approximation model. This is because the principle of WSNM is to assign different weights to the \begin{equation*} {\left \|{ \mathcal {X} }\right \|_{w,{S_{p}}}} = {\left ({{\sum _{i = 1}^{\min \left \{{ {n,m} }\right \}} {w_{i}\sigma _{i}^{p}} } }\right)^{\frac {1}{p}}}\tag{4}\end{equation*}
In our model, we generalize the definition of WSNM to tensor \begin{align*} \left \|{ \mathcal {B} }\right \|_{\mathcal {W},{S_{p}}}^{p}=&\frac {1}{L}{\sum _{i = 1}^{r} {\sum _{j = 1}^{n_{3}} {\left ({{\mathcal {W}\left ({{i,i,j} }\right){{\left ({{\bar {\mathcal {S}}\left ({{i,i,j} }\right)} }\right)}^{p}}} }\right)} } ^{\frac {1}{p}}} \tag{5}\\ \mathcal {W}\left ({{i,i,j} }\right)=&\frac {{C\sqrt {mn} }}{{\bar {\mathcal {S}}\left ({{i,i,j} }\right) + \varepsilon }}\tag{6}\end{align*}
Then, the overall framework can be formulated as \begin{align*} \min _{\mathcal {B},\mathcal {T},\mathcal {N}} \left \|{ \mathcal {B} }\right \|_{\mathcal {W},{\mathcal {S}_{p}}}^{p} + \lambda {\left \|{ \mathcal {T} }\right \|_{1}} + \beta \left \|{ \mathcal {N} }\right \|_{F}^{2} \quad \text {s.t.}~\mathcal {D} = \mathcal {B} + \mathcal {T} + \mathcal {N} \\\tag{7}\end{align*}
C. Solution of the Proposed Model
To solve the proposed model, we adopt ADMM [13] and the inexact augmented Lagrangian multiplier (IALM) [14]. The problem in (7) can be rewritten by IALM as \begin{align*} L\left ({{\mathcal {B},\mathcal {T},\mathcal {N},y,\mu } }\right)=&\left \|{ \mathcal {B} }\right \|_{ \mathcal {W},{ \mathcal {S}_{p}}}^{p} + \lambda {\left \|{ \mathcal {T} }\right \|_{1}} + \beta \left \|{ \mathcal {N} }\right \|_{F}^{2} \\&+ \left \langle{ {y,\mathcal {D}-\mathcal {B}-\mathcal {T}-\mathcal {N}} }\right \rangle \\&+ \frac {\mu }{2}\left \|{ {\mathcal {D}-\mathcal {B}-\mathcal {T}-\mathcal {N}} }\right \|_{F}^{2}\tag{8}\end{align*}
Updating
with other variables fixed, the formulation (8) can be defined as\mathcal {B} To solve the problem in (9), we incorporate the generalized soft-thresholding (GST) method [11] into tensor singular value thresholding (t-SVT) [15], [16]. Consequently, (9) can be rewritten as\begin{align*}&\hspace {-0.3pc}{\mathcal {B}^{k + 1}} \\&= \!\mathop {\arg \min }_{\mathcal {B}} \left \|{ \mathcal {B} }\right \|_{\mathcal {W},{\mathcal {S}_{p}}}^{p} \!+\! \frac {\mu ^{k}}{2}\left \|{ {\mathcal {D} \!-\! {\mathcal {B}^{k}} \!-\! {\mathcal {T}^{k}} \!-\! {\mathcal {N}^{k}} + \frac {y^{k}}{\mu ^{k}}} }\right \|_{F}^{2}. \\\tag{9}\end{align*} View Source\begin{align*}&\hspace {-0.3pc}{\mathcal {B}^{k + 1}} \\&= \!\mathop {\arg \min }_{\mathcal {B}} \left \|{ \mathcal {B} }\right \|_{\mathcal {W},{\mathcal {S}_{p}}}^{p} \!+\! \frac {\mu ^{k}}{2}\left \|{ {\mathcal {D} \!-\! {\mathcal {B}^{k}} \!-\! {\mathcal {T}^{k}} \!-\! {\mathcal {N}^{k}} + \frac {y^{k}}{\mu ^{k}}} }\right \|_{F}^{2}. \\\tag{9}\end{align*}
where\begin{equation*} {\mathcal {B}^{k + 1}} = {\mathcal {D}_{\mathcal {W},{\mathcal {S}_{p}}{{\left ({{\mu ^{k}} }\right)}^{ - 1}}}}\left ({{\mathcal {D} - {\mathcal {T}^{k}} - {\mathcal {N}^{k}} + \frac {y^{k}}{\mu ^{k}}} }\right)\tag{10}\end{equation*} View Source\begin{equation*} {\mathcal {B}^{k + 1}} = {\mathcal {D}_{\mathcal {W},{\mathcal {S}_{p}}{{\left ({{\mu ^{k}} }\right)}^{ - 1}}}}\left ({{\mathcal {D} - {\mathcal {T}^{k}} - {\mathcal {N}^{k}} + \frac {y^{k}}{\mu ^{k}}} }\right)\tag{10}\end{equation*}
denotes the ADMM algorithm. It should be noticed that the weights{\mathcal {D}_{\mathcal {W},{\mathcal {S}_{p}}{{({\mu ^{k}})}^{ - 1}}}}(\cdot) are in a nondescending order, and the singular values satisfy a nonascending order:w = [{w_{1}, \ldots,{w_{r}}}] .{\sigma _{1}} \ge {\sigma _{2}} \ge \cdots \ge {\sigma _{r}} Updating
with other variables fixed, the formulation can be defined as\mathcal {T} The problem in (11) is a typical\begin{align*}&\hspace {-0.3pc}{\mathcal {T}^{k + 1}} \\&=\!\mathop {\arg \min }_{\mathcal {T}} \lambda {\left \|{ \mathcal {T} }\right \|_{1}} \!+\! \frac {\mu ^{k}}{2}\left \|{ {\mathcal {D} \!-\! {\mathcal {B}^{k + 1}} - {\mathcal {T}} \!-\! {\mathcal {N}^{k}} + \frac {y^{k}}{\mu ^{k}}} }\right \|_{F}^{2}. \!\!\!\!\! \\\tag{11}\end{align*} View Source\begin{align*}&\hspace {-0.3pc}{\mathcal {T}^{k + 1}} \\&=\!\mathop {\arg \min }_{\mathcal {T}} \lambda {\left \|{ \mathcal {T} }\right \|_{1}} \!+\! \frac {\mu ^{k}}{2}\left \|{ {\mathcal {D} \!-\! {\mathcal {B}^{k + 1}} - {\mathcal {T}} \!-\! {\mathcal {N}^{k}} + \frac {y^{k}}{\mu ^{k}}} }\right \|_{F}^{2}. \!\!\!\!\! \\\tag{11}\end{align*}
regularized minimization problem. Therefore, we can obtain the overall optimal solution through an elementwise shrinkage operation [17]{l_{1}} where\begin{equation*} {\mathcal {T}^{k + 1}} = {\mathcal {F}_{{\lambda / {\mu {}^{k}}}}}\left ({{\mathcal {D} - {\mathcal {B}^{k + 1}} - {\mathcal {N}^{k}} + \frac {y^{k}}{\mu ^{k}}} }\right)\tag{12}\end{equation*} View Source\begin{equation*} {\mathcal {T}^{k + 1}} = {\mathcal {F}_{{\lambda / {\mu {}^{k}}}}}\left ({{\mathcal {D} - {\mathcal {B}^{k + 1}} - {\mathcal {N}^{k}} + \frac {y^{k}}{\mu ^{k}}} }\right)\tag{12}\end{equation*}
represents the elementwise shrinkage operator.\mathcal {F}_{\lambda /{\mu {}^{k}}}(\cdot) Updating
with other variables fixed, the formulation can be defined as\mathcal {N} The solution of the above problem can be obtained by\begin{align*}&\hspace {-2.5pc} {\mathcal {N}^{k + 1}} = \mathop {\arg \min }_{\mathcal {N}} \beta \left \|{ \mathcal {N} }\right \|_{F}^{2} \\&+ \frac {\mu ^{k}}{2}\left \|{ {\mathcal {D} - {\mathcal {B}^{k + 1}} \!-\! {\mathcal {T}^{k + 1}} - \mathcal {N} + \frac {y^{k}}{\mu ^{k}}} }\right \|_{F}^{2}.\!\!\! \\\tag{13}\end{align*} View Source\begin{align*}&\hspace {-2.5pc} {\mathcal {N}^{k + 1}} = \mathop {\arg \min }_{\mathcal {N}} \beta \left \|{ \mathcal {N} }\right \|_{F}^{2} \\&+ \frac {\mu ^{k}}{2}\left \|{ {\mathcal {D} - {\mathcal {B}^{k + 1}} \!-\! {\mathcal {T}^{k + 1}} - \mathcal {N} + \frac {y^{k}}{\mu ^{k}}} }\right \|_{F}^{2}.\!\!\! \\\tag{13}\end{align*}
\begin{equation*} {\mathcal {N}^{k + 1}} = \frac {{\mu \left ({{\mathcal {D} - {\mathcal {B}^{k + 1}} - {\mathcal {T}^{k + 1}}} }\right) + {y^{k}}}}{{2\beta + {\mu ^{k}}}}.\tag{14}\end{equation*} View Source\begin{equation*} {\mathcal {N}^{k + 1}} = \frac {{\mu \left ({{\mathcal {D} - {\mathcal {B}^{k + 1}} - {\mathcal {T}^{k + 1}}} }\right) + {y^{k}}}}{{2\beta + {\mu ^{k}}}}.\tag{14}\end{equation*}
Updating multipliers
with other variables fixedy \begin{equation*} {y^{k + 1}} = {y^{k}} + {\mu ^{k}}\left ({{\mathcal {D}- {\mathcal {B}^{k + 1}} - {\mathcal {T}^{k + 1}} - {\mathcal {N}^{k + 1}}} }\right).\tag{15}\end{equation*} View Source\begin{equation*} {y^{k + 1}} = {y^{k}} + {\mu ^{k}}\left ({{\mathcal {D}- {\mathcal {B}^{k + 1}} - {\mathcal {T}^{k + 1}} - {\mathcal {N}^{k + 1}}} }\right).\tag{15}\end{equation*}
Updating
by the following equation:\mu ^{k + 1} \begin{equation*} \mu ^{k + 1} = \min \left ({{\rho {\mu ^{k}},{\mu _{\max }}} }\right).\tag{16}\end{equation*} View Source\begin{equation*} \mu ^{k + 1} = \min \left ({{\rho {\mu ^{k}},{\mu _{\max }}} }\right).\tag{16}\end{equation*}
Algorithm 1 Process of WSNM-STTN
number of frames L, tunning parameter
into the tensor
Update
Update
Update
Update multipliers
Update
Check the convergence conditions
Update
Experimental Results and Analysis
A. Dataset and Metrics
We evaluated the proposed WSNM-STTN on nine satellite video datasets (as listed in Table I). The first two videos (i.e., Video 001 and Video 002) were captured by SkySat.1 Their spatial resolution is 1.0 m, while their frame rate is 30 frames per second (FPS). Videos 003–009 are provided by Chang Guang Satellite Technology Company Ltd.2 Their spatial resolution is 1.0 m and their frame rate is 10 FPS. All these datasets mainly cover traffic scenarios of urban areas. Note that, MOD in videos 003–009 is a challenge due to the complex background. In contrast, the backgrounds of videos 001 and 002 captured by SkySat are mainly composed of roads, which is relatively easy to achieve good detection performance. In our experiments, moving cars are selected as the targets of interest.
We use three evaluation metrics, including precision, recall, and
B. Parameter Setting
In the proposed WSNM-STTN algorithm, parameters are properly set to achieve good object detection performance. The regularized parameter
Recall, precision, and
C. Comparison With the State-of-the-Art Methods
We conduct extensive experiments to demonstrate the robustness of our method to various scenarios in real applications:
1) SkySat Dataset:
To test the effectiveness of our WSNM-STTN on Skysat satellite videos, following [6] and [8], we compare our method with five batch-based state-of-the-art approaches (i.e., RPCA [19], GoDec [4], DECOLOR [7], LSD [5], and E-LSD [8]) and one state-of-the-art online approaches (i.e., O-LSD [6]). As shown in Tables II and III, the WSNM-STTN method achieves the highest overall performance among these batch methods and online method, with an average
2) Jilin-1 Dataset:
To test the effectiveness of our WSNM-STTN method on Jilin-1 satellite videos, we compare our method with three batch RPCA-based state-of-the-art approaches (i.e., GoDec [4], DECOLOR [7], and E-LSD [8]) and three statics modeling-based methods (i.e., MDTT [3], VIBE [2], and D&T [18]). As shown in Table IV, WSNM-STTN achieves the highest overall performance against other methods, with an average precision of 0.90 and an average
In summary, the proposed WSNM-STTN model can achieve robust performance and fast processing in complex and highly heterogeneous backgrounds.
D. Ablation Study
We have demonstrated the effectiveness of introducing bounded noises
Conclusion
In this letter, we propose a WSNM-STTN model to detect dim and small moving objects in satellite video. With the STTN model, the proposed model can dig out temporal information within a sequence. Besides, we propose an extended tensor RPCA with bounded noise and incorporate WSNM to solve the overshrink problem in low-rank estimation, which is superior to noiseless modeling methods. Then, we optimize our model by ADMM to detect objects. Extensive experiments show that WSNM-STTN can achieve a high detection rate and a low false alarms rate under complex background with heavy noise. In addition, WSNM-STTN converges faster than the matrix decomposition approach by a large margin.