Introduction
The analysis of temporal image series is necessary and important in many remote sensing applications, such as vegetation/crop monitoring and estimation [1], evapotranspiration estimation [2], atmosphere monitoring [3], land-cover/land-use change detection [4], surface dynamic mapping [5], ecosystem monitoring [6], soil water content analysis [7], and detailed analysis of human–nature interactions [8]. These applications require time series of high-spatial-resolution (HR) images to properly model the ground surface. In addition, time series of high-temporal-resolution images are also needed to capture the changes in the ground surface that occur over short periods of time.
However, there is a tradeoff between the temporal and spatial resolution of satellite sensors, and no single sensor can satisfy both requirements simultaneously. For example, the Landsat sensors can acquire images with an HR of 30 m, but they have a revisit period of up to 16 dates. On the other hand, the Moderate Resolution Imaging Spectroradiometer (MODIS) sensors can acquire images for the same scene at least once per date, but the images are at a low spatial resolution (LR) of 500 m [9]. Therefore, the simultaneous acquisition of image series of high spatial and high temporal resolution is a major challenge in the remote sensing community [10]. A simple solution to this challenge is to perform super-resolution on the corresponding single LR image to estimate the unobserved HR image [11], [12]. However, it is too difficult because the spatial resolution gap between the two satellite images is often quite large.
Spatiotemporal (ST) fusion (ST fusion) addresses this challenge by utilizing pairs of HR and LR images taken on reference dates that are temporally close to the target date. Specifically, the unobserved HR image on the target date is estimated by combining detailed spatial structure extracted from the HR images on the reference dates and spectral changes captured from the differences between the LR images on the reference and target dates. In ideal situations, where a large number of reliable reference images are available, achieving accurate ST fusion would be straightforward because the correct spatial structure and spectral changes are readily available. In real-world applications, however, such situations are very rare. Therefore, to achieve the desired ST fusion in real-world applications, the following two requirements are important.
Minimum Reference Dates: ST fusion methods that require minimum reference dates are preferred. In many remote sensing applications, only one pair of images on a reference date may be available due to cloud contamination, inconsistencies in image acquisition timing, or other factors. In addition, preparing another pair of images can be time-consuming. Therefore, ST fusion methods using a single pair of HR and LR images on a reference date apply to a wider range of cases than those using multiple pairs, although such a situation is obviously challenging [13].
Robustness to Noise: Due to the measurement equipment and/or environment, satellite images are often contaminated with various types of noise, such as random noise, outliers, and missing values [14], [15]. An estimated HR image that remains noisy would significantly affect subsequent processing. Therefore, it is imperative to develop noise-robust ST fusion methods.
A. Prior Research
Many ST fusion methods have been proposed over the past decades. They are generally categorized into five groups [16]: unmixing-based, weight-function-based, Bayesian-based, learning-based, and hybrid methods. Unmixing-based methods estimate the pixel values of an HR image by unmixing the pixels of the input LR images based on the linear spectral mixing theory [17], [18]. Weight-function-based methods generate an HR image by combining information of all the input images based on weight functions [9], [19]. Bayesian-based methods use Bayesian estimation theory to fuse the input images in a probabilistic manner [20], [21]. In the Bayesian framework, ST fusion can be considered a maximum a posteriori (MAP) problem, that is, the goal is to obtain an HR image on the target date by maximizing its conditional probability relative to the input HR and LR images. Learning-based methods model the relationship between observed HR and LR image pairs and then predict an unobserved HR image using machine-learning algorithms such as sparse representation learning [22], [23], random forest [24], and neural networks [25], [26], [27], [28], [29], [30], [31]. Hybrid methods integrate two or more techniques from the above four categories [32], [33].
Some of the unmixing-based, weight-function-based, and Bayesian-based methods allow an arbitrary number of reference dates. In other words, they can handle the cases with a single reference date. However, they are sensitive to noise because they estimate an HR image at the pixel level based entirely on reference images, which may be noisy. Thus, if the input images are noisy, the output image will be severely degraded.
On the other hand, in the context of learning-based methods, a robust ST fusion network (RSFN) [28] has been established to account for Gaussian noise. RSFN automatically filters noise and prevents predictions from being corrupted by using convolutional neural networks (CNNs), generative adversarial networks (GANs), and an attention mechanism. Specifically, the RSFN uses the attention mechanism to ignore noisy pixels in two reference HR images and focus on clean pixels. This method only works in situations where noisy pixels in the two reference HR images do not appear at the same location. In real-world measurements, however, such situations are rare because noise generally contaminates the entire image, not just parts of it. Furthermore, as mentioned above, RSFN requires two reference dates.
B. Contributions and Paper Organization
Now, a natural question arises: How to achieve robust ST fusion with only a single reference date? In this article, we propose a novel ST fusion framework for satellite images, named robust optimization-based ST fusion (ROSTF), to estimate an HR image on the target date while simultaneously denoising an HR image on a single reference date.
Before formulating our optimization problem, we newly define two observation models (they will be detailed in Section III-A).
The first model describes the relationship between an observed noisy image and the oracle noiseless image. The model is designed under the assumption that the observed image is not only contaminated with random noise but also with outliers and missing values. Specifically, random noise is modeled by Gaussian noise while outliers and missing values are modeled by sparse noise.
The second model represents the relationship between an HR image and the corresponding LR image, based on a super-resolution model [34].
We also introduce the following two assumptions about satellite images (they will be detailed in Section III-B).
The reflectance may change significantly between the reference and target dates, but the land structure (the locations of the edges) does not. This is a very natural assumption in the context of ST fusion.
An HR image and the corresponding LR image have similar brightness. This assumption is necessarily true if the HR and LR sensors have similar spectral resolutions, as is the case with sensors like Landsat and MODIS [9].
Based on the observation models and assumptions, we formulate the fusion problem as a constrained convex optimization problem. Subsequently, we develop an efficient algorithm based on the preconditioned primal–dual splitting method (P-PDS) [35] with an operator-norm-based design method of variable-wise diagonal preconditioning, named OVDP [36], which can automatically determine the appropriate stepsizes for solving the optimization problem.
The main contributions of the article are given as follows.
Robustness to Random Noise, Outliers, and Missing Values: As described above, no existing ST fusion methods can handle random noise, outliers, and missing values, although this type of noise contaminates satellite images due to the measurement equipment and environment. Thanks to the formulation that incorporates the first observation model developed in Section III-A, ROSTF is robust to such mixed noise.
Single Reference Date: Assumption 1) is very simple but effective for ST fusion. By incorporating this assumption as a constraint in the optimization problem, we realize the mechanism to promote the estimated HR image on the target date and the denoised HR image on the reference date to have edges at similar locations. Thanks to such a mechanism, ROSTF performs well even based on a single reference date.
Facilitation of Parameter Adjustment: The objective function of the optimization problem of ROSTF consists only of image regularization terms to promote spatial piecewise smoothness. The other components, corresponding to data fidelity based on the two observation models and our two assumptions, are imposed as hard constraints. Such a formulation using constraints instead of adding terms to the objective function has the advantage of simplifying parameter setting [37], [38], [39], [40], [41], [42]: the appropriate parameters in the constraints do not depend on each other and can be determined independently for each constraint.
Automatic Stepsize Adjustment: To solve our optimization problem for ST fusion, we develop an efficient algorithm based on P-PDS with OVDP. The appropriate stepsizes of the standard PDS [43] (and most other optimization methods) would be different depending on the problem structure, which means that we have to adjust them manually. On the other hand, P-PDS with OVDP can automatically determine the appropriate stepsizes based on the problem structure, and thus our algorithm is free from such a troublesome task.
This paper is organized as follows. We first cover the mathematical preliminaries of our method in Section II and then proceed to the establishment of our method in Section III. In Section IV, we demonstrate the performance of ROSTF and the effectiveness of each key component of ROSTF through comparative experiments and ablation studies, respectively. Experimental results show that ROSTF performs comparably to several state-of-the-art ST fusion methods for noiseless images and outperforms them for noisy images, thanks to the effective work of each key component.
The preliminary version of this work, without considering sparse noise, mathematical details, comprehensive experimental comparison, deeper discussion, or implementation using P-PDS with OVDP, has appeared in conference proceedings [44].
Preliminaries
A. Notations
Let \begin{equation*} S_{\alpha }^{\omega }:= \left \{{ \mathbf {z} | \,| \omega - \mathbf {1}^{\top }\mathbf {z} | \leq \alpha }\right \}. \tag{1}\end{equation*}
\begin{equation*} B_{p, \varepsilon }^{\mathbf {c}}:= \left \{{ \mathbf {z} | \| \mathbf {z} - \mathbf {c} \|_{p} \leq \varepsilon }\right \}. \tag{2}\end{equation*}
\begin{align*} \iota _{C}:= \begin{cases} \displaystyle 0, & \mathrm {if} \, \mathbf {x} \in C, \\ \displaystyle \infty, & \mathrm {otherwise}. \end{cases} \tag{3}\end{align*}
B. Proximal Tools
The optimization problem of ROSTF, to be formulated in Section III-B, involves nonsmooth convex functions. To solve such a problem, we introduce the notion of the proximity operator of index \begin{align*} \mathrm {prox}_{\gamma f}: \mathbb {R}^{NB} \rightarrow \mathbb {R}^{NB}: \mathbf {x} \mapsto \mathop {\mathrm{ argmin}}\limits _{\mathbf {y} \in \mathbb {R}^{NB}} f\left ({\mathbf {y}}\right) + \frac {1}{2\gamma }\|\mathbf {x} - \mathbf {y}\|_{2}^{2}. \tag{4}\end{align*}
The Fenchell–Rockerfellar conjugate function \begin{equation*} f^{\ast}\left ({\mathbf {y}}\right):= \sup _{\mathbf {x}\in \mathbb {R}^{NB}} \left \{{ \langle \mathbf {x},\mathbf {y}\rangle - f\left ({\mathbf {x}}\right) }\right \}. \tag{5}\end{equation*}
\begin{equation*} \mathrm {prox}_{\gamma f^{\ast}}\left ({{\mathbf x}}\right) = {\mathbf x}- \gamma \mathrm {prox} _{\frac {1}{\gamma }f}\left ({\tfrac {1}{\gamma } {\mathbf x}}\right). \tag{6}\end{equation*}
Below, we present the specific proximity operators of the functions used in this article. The proximity operator of the \begin{equation*} \left [{\mathrm {prox}_{\gamma \|\cdot \|_{1,2}}\left ({\mathbf {x}}\right)}\right]_{b,n} = \max \left \{{ 1 - \frac {\gamma }{\sqrt {\sum _{b'=1}^{B} |x_{b',n}|^{2}}},0 }\right \} x_{b,n}. \tag{7}\end{equation*}
\begin{align*} \mathrm {prox}_{\gamma \iota _{S_{\alpha }^{\omega }}}\left ({\mathbf {x}}\right) = \begin{cases} \displaystyle \mathbf {x} + \frac {\eta _{1} - \mathbf {1}^{\top }\mathbf {x}}{NB}\mathbf {1}, \; \; & \mathrm {if} \, \mathbf {1}^{\top }\mathbf {x} < \eta _{1}, \\ \displaystyle \mathbf {x} + \frac {\eta _{2} - \mathbf {1}^{\top }\mathbf {x}}{NB}\mathbf {1}, \; \; & \mathrm {if} \, \mathbf {1}^{\top }\mathbf {x} > \eta _{2}, \\ \displaystyle \mathbf {x}, & \mathrm {otherwise} \end{cases} \tag{8}\end{align*}
\begin{align*} \mathrm {prox}_{\gamma \iota _{B_{2, \varepsilon }^{\mathbf {c}}}}\left ({\mathbf {x}}\right) = \begin{cases} \displaystyle \mathbf {x}, & \mathrm {if} \, \mathbf {x} \in \iota _{B_{2, \varepsilon }^{\mathbf {c}}}, \\ \displaystyle \mathbf {c} + \frac {\varepsilon \left ({\mathbf {x} - \mathbf {c}}\right)}{\| \mathbf {x} - \mathbf {c} \|_{2}}, & \mathrm {otherwise} \end{cases} \tag{9}\end{align*}
C. P-PDS With OVDP
The standard PDS [43] is a versatile and efficient proximal algorithm that can solve a wide class of nonsmooth convex optimization problems without using matrix inversion. However, it is troublesome to manually set the appropriate stepsizes of the standard PDS. Therefore, we adopt P-PDS [35] with OVDP [36], a method that automatically determines the appropriate stepsizes according to the problem structure.
Let \begin{align*} \min _{\substack {\mathbf y_{1}, {\dots }, {\mathbf y}_{N} \\ {\mathbf z}_{1}, {\dots }, {\mathbf z}_{M}}} & \sum _{i=1}^{N}g_{i}\left ({{\mathbf y}_{i}}\right)+\sum _{j=1}^{M}h_{j}\left ({{\mathbf z}_{j}}\right), \\ {\mathrm{ s.t.}} & \begin{cases} \displaystyle {\mathbf z} _{1} = \sum \nolimits _{i=1}^{N} \mathbf {G}_{1,i} {\mathbf y}_{i}, \\ \displaystyle \vdots \\ \displaystyle {\mathbf z} _{M} = \sum \nolimits _{i=1}^{N} \mathbf {G}_{M,i} {\mathbf y}_{i} \end{cases} \tag{10}\end{align*}
\begin{align*} \begin{cases} \displaystyle \bar {\mathbf y}_{1}^{\left ({n}\right)} &\leftarrow {\mathbf y} _{1}^{\left ({n}\right)}-\gamma _{1,1}\sum \nolimits _{j=1}^{M}\mathbf {G}_{j,1}^{\top } {\mathbf z}_{j}^{\left ({n}\right)}, \\ \displaystyle {\mathbf y} _{1}^{\left ({n+1}\right)} &\leftarrow \mathrm {prox} _{\gamma _{1,1}g_{1}}\left ({\bar {\mathbf y}_{1}^{\left ({n}\right)}}\right), \\ \displaystyle &\vdots \\ \displaystyle \bar {\mathbf y}_{N}^{\left ({n}\right)} &\leftarrow {\mathbf y} _{N}^{\left ({n}\right)}-\gamma _{1,N}\sum \nolimits _{j=1}^{M}\mathbf {G}_{j,N}^{\top } {\mathbf z}_{j}^{\left ({n}\right)}, \\ \displaystyle {\mathbf y} _{N}^{\left ({n+1}\right)} &\leftarrow \mathrm {prox} _{\gamma _{1,N}g_{N}}\left ({\bar {\mathbf y}_{N}^{\left ({n}\right)}}\right), \\ \displaystyle \bar {\mathbf z}_{1}^{\left ({n}\right)} &\leftarrow {\mathbf z} _{1}^{\left ({n}\right)} + \gamma _{2,1}\sum \nolimits _{i=1}^{N}\mathbf {G}_{1,i}\left ({2 {\mathbf y}_{i}^{\left ({n+1}\right)}- {\mathbf y}_{i}^{\left ({n}\right)}}\right), \\ \displaystyle {\mathbf z} _{1}^{\left ({n+1}\right)} &\leftarrow \mathrm {prox} _{\gamma _{2,1}h_{1}^{\ast}}\left ({\bar {\mathbf z}_{1}^{\left ({n}\right)}}\right), \\ \displaystyle &\vdots \\ \displaystyle \bar {\mathbf z}_{M}^{\left ({n}\right)} &\leftarrow {\mathbf z} _{M}^{\left ({n}\right)} + \gamma _{2,M}\sum \nolimits _{i=1}^{N}\mathbf {G}_{M,i}\left ({2 {\mathbf y}_{i}^{\left ({n+1}\right)}- {\mathbf y}_{i}^{\left ({n}\right)}}\right), \\ \displaystyle {\mathbf z} _{M}^{\left ({n+1}\right)} &\leftarrow \mathrm {prox} _{\gamma _{2,M}h_{M}^{\ast}}\left ({\bar {\mathbf z}_{M}^{\left ({n}\right)}}\right) \end{cases}\end{align*}
\begin{equation*} \gamma _{1,i} = \frac {1}{\sum _{j=1}^{M} \left \|{ \mathbf {G}_{j,i}}\right \|_{\mathrm {op}}^{2}},\quad \gamma _{2,j} = \frac {1}{N} \tag{11}\end{equation*}
\begin{equation*} \left \|{\mathbf {G}}\right \|_{\mathrm {op}}:= \sup _{\mathbf x\neq \mathbf {0}}\frac {\left \|{\mathbf {G} {\mathbf x}}\right \|_{2}} {\left \|{ {\mathbf x}}\right \|_{2}}. \tag{12}\end{equation*}
Proposed Method
From now on, we will focus on cases with a single reference date. Specifically, we consider a situation where both HR and LR sensors observe the same scene on the single reference date, but on the target date, only the LR sensor observes that scene and not the HR sensor. Let the HR image on the reference date, the LR image on the reference date, and the LR image on the target date be
A. Observation Models
Let \begin{align*} \mathbf {h} &= \widehat {\mathbf {h}} + \mathbf {n}_{h}+ \mathbf {s}_{h}, \\ \mathbf {l} &= \widehat {\mathbf {l}} + \mathbf {n}_{l}+ \mathbf {s}_{l}. \tag{13}\end{align*}
On the other hand, \begin{equation*} \widehat {\mathbf {l}} = \mathbf {S}\mathbf {B}\widehat {\mathbf {h}} + \mathbf {m} \tag{14}\end{equation*}
B. Problem Formulation
We introduce the following two assumptions about the noiseless HR and LR images on the reference and target dates, that is,
The reflectance may change significantly between the reference and target dates, but the land structure does not. This is a very natural assumption in ST fusion and is implicitly accepted in previous studies. If the land structure has not changed significantly, the edges of
and$\widehat {\mathbf {h}}_{r}$ appear at almost the same locations, implying that the difference between$\widehat {\mathbf {h}}_{t}$ and$\mathbf {D}\widehat {\mathbf {h}}_{r}$ tends to be small. We measure the similarity of these edges using the$\mathbf {D}\widehat {\mathbf {h}}_{t}$ ($\ell _{p}$ or 2) norm as$p=1$ .$\| \mathbf {D}\widehat {\mathbf {h}}_{r} - \mathbf {D}\widehat {\mathbf {h}}_{t} \|_{p}$ The HR and LR images taken on the same date have similar average brightness per band. For example, the difference in average brightness of the bth band of
and$\widehat {\mathbf {h}}_{t}$ , expressed as$\widehat {\mathbf {l}}_{t}$ is expected to be small. This is necessarily true if the HR and LR sensors have similar spectral resolutions, as is the case for Landsat and MODIS [9].\begin{equation*} \left |{\frac {1}{N_{l}}\mathbf {1}^{\top }\left [{\widehat {\mathbf {l}}_{t}}\right]_{b}-\frac {1}{N_{h}}\mathbf {1}^{\top }\left [{\widehat {\mathbf {h}}_{t}}\right]_{b}}\right | \tag{15}\end{equation*} View Source\begin{equation*} \left |{\frac {1}{N_{l}}\mathbf {1}^{\top }\left [{\widehat {\mathbf {l}}_{t}}\right]_{b}-\frac {1}{N_{h}}\mathbf {1}^{\top }\left [{\widehat {\mathbf {h}}_{t}}\right]_{b}}\right | \tag{15}\end{equation*}
Based on these assumptions and the observation models in (13) and (14), we formulate the fusion problem as the following constrained convex optimization problem:\begin{align*} \min _{\substack { \widetilde {\mathbf {h}}_{r}, \widetilde {\mathbf {h}}_{t}, \\[4pt] \widetilde {\mathbf {s}}_{hr}, \widetilde {\mathbf {s}}_{lr}, \widetilde {\mathbf {s}}_{lt}}} \: & \left \|{\mathbf {D} \widetilde {\mathbf {h}}_{r}}\right \|_{1,2} + \lambda \left \|{\mathbf {D} \widetilde {\mathbf {h}}_{t}}\right \|_{1,2} \\[4pt] \mathrm {s.t.} \: & \begin{cases} \displaystyle \| \mathbf {D} \widetilde {\mathbf {h}}_{r} - \mathbf {D} \widetilde {\mathbf {h}}_{t} \|_{p} \leq \alpha, \\[4pt] \displaystyle \left |{ c_{b}- \mathbf {1}^{\top }\left [{\widetilde {\mathbf {h}}_{t}}\right]_{b}/{N_{h}}}\right | \leq \beta _{b} \, \left ({b = 1, \ldots, B}\right), \\[4pt] \displaystyle \| \mathbf {h}_{r}- \left ({\widetilde {\mathbf {h}}_{r}+ \widetilde {\mathbf {s}}_{hr}}\right) \|_{2}\leq \varepsilon _{h}, \\[2pt] \displaystyle \| \mathbf {l}_{r}- \left ({{ \mathbf {S}} \mathbf {B} \widetilde {\mathbf {h}}_{r} + \widetilde {\mathbf {s}}_{lr}}\right) \|_{2}\leq \varepsilon _{l},\\[1pt] \displaystyle \| \mathbf {l}_{t}- \left ({{ \mathbf {S}} \mathbf {B} \widetilde {\mathbf {h}}_{t} + \widetilde {\mathbf {s}}_{lt}}\right) \|_{2}\leq \varepsilon _{l},\\[2pt] \displaystyle \| \widetilde {\mathbf {s}}_{hr} \|_{1} \leq \eta _{h}, \\[2pt] \displaystyle \| \widetilde {\mathbf {s}}_{lr} \|_{1} \leq \eta _{l}, \\[2pt] \displaystyle \| \widetilde {\mathbf {s}}_{lt} \|_{1} \leq \eta _{l} \end{cases} \tag{16}\end{align*}
The two terms in the objective function promote spatial piecewise smoothness of
and$\widetilde {\mathbf {h}}_{r}$ , with the hyperspectral total variation (HTV) [50] as regularization.$\widetilde {\mathbf {h}}_{t}$ The first constraint encourages
and$\mathbf {D}\widetilde {\mathbf {h}}_{r}$ to be similar based on Assumption 1). The parameter$\mathbf {D}\widetilde {\mathbf {h}}_{t}$ controls the degree of similarity. Hereafter, the constraint is referred to as the edge constraint.$\alpha $ The second constraint is designed based on Assumption 2). Since
is contaminated by noise, we do not use the average brightness of$\mathbf {l}_{t}$ itself, that is,$[\mathbf {l}_{t}]_{b}$ , but the parameter$\mathbf {1}^{\top }[\mathbf {l}_{t}]_{b}/N_{l}$ , which is determined based on$c_{b}$ and the noise intensity. The parameter$\mathbf {1}^{\top }[\mathbf {l}_{t}]_{b}/N_{l}$ controls the strength of this constraint. Hereafter, the constraint is referred to as the brightness constraint.$\beta _{b}$ The third constraint serves as data-fidelity based on the observation model in (13). The parameter
depends on the Gaussian noise intensity on the HR image, that is,$\varepsilon _{h}$ .$\sigma _{h}$ The fourth and fifth constraints are to ensure that the solutions follow the observation model in (14). The parameter
depends on the Gaussian noise intensity on the LR images, that is,$\varepsilon _{l}$ .$\sigma _{l}$ The last three constraints characterize the sparse noise using the
norms. The parameters$\ell _{1}$ and$\eta _{h}$ depend on the sparse noise intensity on the HR and the LR images, respectively, that is,$\eta _{l}$ and$r_{h}$ .$r_{l}$
Using constraints instead of adding terms to the objective function in this way simplifies the parameter setting [37], [38], [39], [40], [41], [42]: we can determine the appropriate parameters for each constraint independently because they are decoupled. The detailed setting of these parameters is discussed in Section IV-C.
C. Optimization
For solving (16) by an algorithm based on P-PDS with OVDP, we need to transform (16) into (10). First, using the hyperslab, the \begin{align*} \min _{\substack { \widetilde {\mathbf {h}}_{r}, \widetilde {\mathbf {h}}_{t}, \\ \widetilde {\mathbf {s}}_{hr}, \widetilde {\mathbf {s}}_{lr}, \widetilde {\mathbf {s}}_{lt}}} & \left \|{\mathbf {D} \widetilde {\mathbf {h}}_{r}}\right \|_{1,2} + \lambda \left \|{\mathbf {D} \widetilde {\mathbf {h}}_{t}}\right \|_{1,2} + \iota _{B_{p, \alpha }^{\mathbf {0}}}\left ({\mathbf {D} \widetilde {\mathbf {h}}_{r} - \mathbf {D} \widetilde {\mathbf {h}}_{t} }\right) \\ &\quad + \sum _{b=1}^{B}\iota _{S_{\beta _{b}'}^{c_{b}'}}\left ({\left [{ \widetilde {\mathbf {h}}_{t}}\right]_{b}}\right) + \iota _{B_{2,\varepsilon _{h}}^{ \mathbf {h}_{r}}}\left ({\widetilde {\mathbf {h}}_{r}+ \widetilde {\mathbf {s}}_{hr}}\right) \\ &\quad +\iota _{B_{2,\varepsilon _{l}}^{ \mathbf {l}_{r}}}\left ({{ \mathbf {S}} \mathbf {B} \widetilde {\mathbf {h}}_{r} + \widetilde {\mathbf {s}}_{lr}}\right) + \iota _{B_{2,\varepsilon _{l}}^{ \mathbf {l}_{t}}}\left ({{ \mathbf {S}} \mathbf {B} \widetilde {\mathbf {h}}_{t} + \widetilde {\mathbf {s}}_{lt}}\right) \\ &\quad + \iota _{B_{1, \eta _{h}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{hr}}\right) + \iota _{B_{1, \eta _{l}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{lr}}\right) + \iota _{B_{1, \eta _{l}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{lt}}\right) \tag{17}\end{align*}
\begin{align*} \min _{\substack { \widetilde {\mathbf {h}}_{r}, \widetilde {\mathbf {h}}_{t}, \\ \widetilde {\mathbf {s}}_{hr}, \widetilde {\mathbf {s}}_{lr}, \widetilde {\mathbf {s}}_{lt}}} & \left \|{ {\mathbf z}_{1}}\right \|_{1,2} + \lambda \left \|{ {\mathbf z}_{2}}\right \|_{1,2} + \iota _{B_{p, \alpha }^{\mathbf {0}}}\left ({{\mathbf z}_{3}}\right) + \sum _{b=1}^{B}\iota _{S_{\beta _{b}'}^{c_{b}'}}\left ({\left [{ \widetilde {\mathbf {h}}_{t}}\right]_{b}}\right) \\ &\quad + \iota _{B_{2,\varepsilon _{h}}^{ \mathbf {h}_{r}}}\left ({{\mathbf z}_{4}}\right) +\iota _{B_{2,\varepsilon _{l}}^{ \mathbf {l}_{r}}}\left ({{\mathbf z}_{5}}\right) + \iota _{B_{2,\varepsilon _{l}}^{ \mathbf {l}_{t}}}\left ({{\mathbf z}_{6}}\right) \\ &\quad + \iota _{B_{1, \eta _{h}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{hr}}\right) + \iota _{B_{1, \eta _{l}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{lr}}\right) + \iota _{B_{1, \eta _{l}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{lt}}\right) \\ \mathrm {s.t.} &\begin{cases} \displaystyle {\mathbf z} _{1} = \mathbf {D} \widetilde {\mathbf {h}}_{r}, \\ \displaystyle {\mathbf z} _{2} = \mathbf {D} \widetilde {\mathbf {h}}_{t}, \\ \displaystyle {\mathbf z} _{3} = \mathbf {D} \widetilde {\mathbf {h}}_{r} - \mathbf {D} \widetilde {\mathbf {h}}_{t}, \\ \displaystyle {\mathbf z} _{4} = \widetilde {\mathbf {h}}_{r}+ \widetilde {\mathbf {s}}_{hr}, \\ \displaystyle {\mathbf z} _{5} = { \mathbf {S}} \mathbf {B} \widetilde {\mathbf {h}}_{r} + \widetilde {\mathbf {s}}_{lr}, \\ \displaystyle {\mathbf z} _{6} = { \mathbf {S}} \mathbf {B} \widetilde {\mathbf {h}}_{t} + \widetilde {\mathbf {s}}_{lt}. \end{cases} \tag{18}\end{align*}
\begin{align*} &g_{1}\left ({\widetilde {\mathbf {h}}_{r}}\right)=0, g_{2}\left ({\widetilde {\mathbf {h}}_{t}}\right)=\sum _{b=1}^{B}\iota _{S_{\beta _{b}'}^{c_{b}'}}\left ({\left [{\widetilde {\mathbf {h}}_{t}}\right]_{b}}\right), \\ &g_{3}\left ({\widetilde {\mathbf {s}}_{hr}}\right)=\iota _{B_{1, \eta _{h}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{hr}}\right), g_{4}\left ({\widetilde {\mathbf {s}}_{lr}}\right)=\iota _{B_{1, \eta _{l}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{lr}}\right), \\ &g_{5}\left ({\widetilde {\mathbf {s}}_{lt}}\right)=\iota _{B_{1, \eta _{l}}^{\mathbf {0}}}\left ({\widetilde {\mathbf {s}}_{lt}}\right), h_{1}\left ({{\mathbf z}_{1}}\right) = \left \|{ {\mathbf z}_{1}}\right \|_{1,2}, \\ &h_{2}\left ({{\mathbf z}_{2}}\right) = \lambda \| {\mathbf z}_{2}\|_{1,2}, h_{3}\left ({{\mathbf z}_{3}}\right) = \iota _{B_{p, \alpha }^{\mathbf {0}}}\left ({{\mathbf z}_{3}}\right), \\ &h_{4}\left ({{\mathbf z}_{4}}\right) = \iota _{B_{2,\varepsilon _{h}}^{ \mathbf {h}_{r}}}\left ({{\mathbf z}_{4}}\right), h_{5}\left ({{\mathbf z}_{5}}\right) = \iota _{B_{2,\varepsilon _{l}}^{ \mathbf {l}_{r}}}\left ({{\mathbf z}_{5}}\right), \\ &h_{6}\left ({{\mathbf z}_{6}}\right) = \iota _{B_{2,\varepsilon _{l}}^{ \mathbf {l}_{t}}}\left ({{\mathbf z}_{6}}\right) \tag{19}\end{align*}
The algorithm for solving (16) is summarized in Algorithm 1. The stepsizes are determined based on OVDP [36] as follows:\begin{align*} \gamma _{1,1} &= \frac {1}{2\left \|{ \mathbf {D}}\right \|_{\mathrm {op}}^{2} + \left \|{ \mathbf {I}}\right \|_{\mathrm {op}}^{2} + \left \|{ { \mathbf {S}} \mathbf {B}}\right \|_{\mathrm {op}}^{2}} = \frac {1}{18}, \\ \gamma _{1,2} &= \frac {1}{2\left \|{ \mathbf {D}}\right \|_{\mathrm {op}}^{2} + \left \|{ { \mathbf {S}} \mathbf {B}}\right \|_{\mathrm {op}}^{2}} = \frac {1}{17}, \\ \gamma _{1,3} &=\gamma _{1,4} = \gamma _{1,5} = \frac {1}{\left \|{ \mathbf {I}}\right \|_{\mathrm {op}}^{2}} = 1, \\ \gamma _{2,i} &= \frac {1}{5}, for i = 1, {\dots }, 6. \tag{20}\end{align*}
Algorithm 1 P-PDS-Based Solver for (16)
Require:
Ensure:
Initialize
Set
while until a stopping criterion is not satisfied do
for
end for
end while
D. Detailed Computations and Their Complexity
Table I shows the computational complexity (in
Steps 4, 5, 23, and 24:
.${\mathcal {O}}(kN_{h}B)$ Steps 6, 7, 14, 15, 16, 19, 20, 21, 22, 25, 26, 27, and 28:
when${\mathcal {O}}(N_{h} B)$ .$p=2$ Step 9:
.${\mathcal {O}}(N_{h})$ Steps 17, 18, 29, and 30:
.${\mathcal {O}}(N_{l} B)$ Steps 11 and 27:
when${\mathcal {O}}(N_{h} B \log (N_{h} B))$ .$p=1$ Steps 12 and 13:
.${\mathcal {O}}(N_{l} B \log (N_{l} B))$
Experiments
We demonstrate the effectiveness of our ST fusion method, ROSTF, through comprehensive experiments using simulated and real data for two sites. Our experiments aim to verify the following three items.
ROSTF is as effective as state-of-the-art ST fusion methods in noiseless cases and outperforms them in noisy cases. We conducted comparative experiments on four cases of noise contamination. The experimental results for simulated data and real data are presented in Sections IV-D and IV-E, respectively.
Key components of ROSTF, such as the assumption-based constraints and the denoising mechanism, operate as expected. To measure their influence, ROSTF without each key component is compared with the original ROSTF in terms of fusion accuracy and convergence speed in Section IV-F.
ROSTF is practical in terms of computational time. For a fair comparison, only nondeep-learning-based methods are compared to ROSTF in Section IV-G.
Note that there are two options for ROSTF:
A. Data Description
We tested our methods both on real data and simulated data. In the case of satellite observations, radiometric and geometric inconsistencies exist between two different image sensors. This means that the fusion capability of each method cannot be accurately evaluated in experiments using real data because these inconsistencies affect performance, as also addressed in [32]. Therefore, we generated simulated data based on the observation models and verified the performance of each method using this data in addition to the real data. Specifically, in experiments using simulated data, the simulated LR images were generated from the corresponding real HR images according to (14) with
We used MODIS and Landsat time-series images for the following two sites in our experiments.
Site1:
The first site is situated in the Daxing district in the south of Beijing city (39.0009° N, 115.0986° E) [16]. For Site1, we employed MODIS and Landsat time-series images acquired on May 29, 2019 (a reference date) and June 14, 2019 (a target date).
Site2:
The second site is located in southern New South Wales, Australia (34.0034° S, 145.0675° E) [51]. For Site2, MODIS and Landsat time-series images acquired on January 4, 2002 (a reference date) and February 12, 2002 (a target date) were used.
B. Compared Methods
Our method was compared with STARFM [9], VIPSTF [33], RSFN [28], RobOt [23], and SwinSTFM [31]. Table II shows the characteristics of these methods and ROSTF. Note that, unlike the other methods, RSFN requires input images obtained on two reference dates. Since our experiments assume a scenario with only one reference date, the same HR-LR image pair was input as two different reference image pairs for RSFN.
As the parameters of these existing methods, we used the values recommended in each reference. It should be noted that RSFN and SwinSTFM require significantly more data than our and other existing methods due to training and validation processes. For our experiments, we trained and validated RSFN and SwinSTFM using a different set of images from those used for the tests described in Section IV-A. Specifically, 24 groups from Site1 and two groups from Site2 were used for training, and one group from Site1 and two groups from Site2 were used for validation.
C. Experimental Setup
Our method, ROSTF, is implemented using MATLAB. The source code is available on the GitHub1 platform. For these experiments, the spatial spread transform matrix
To verify both the pure fusion capability and the robustness against the noise of the existing methods and ours, we conducted the following four combinations of Gaussian noise with different standard deviations and sparse noise (salt-and-pepper noise) with different rates.
Case1:
The observed HR and LR images are noiseless, that is,
in (13).$\sigma _{h}=\sigma _{l}=r_{h}=r_{l}=0$ Case2:
The observed HR images are contaminated with Gaussian noise with a standard deviation
while the observed LR images are noiseless, that is,$\sigma _{h}=0.05$ in (13).$\sigma _{h}=0.05, \sigma _{l}=r_{h}=r_{l}=0$ Case3:
The observed HR images are contaminated with sparse noise with a superimposition rate
, while the observed LR images are noiseless, that is,$r_{h}=0.05$ in (13).$r_{h}=0.05, \sigma _{h}=\sigma _{l}=r_{l}=0$ Case4:
The observed HR images are contaminated with Gaussian noise with a standard deviation
and sparse noise with a superimposition rate$\sigma _{h}=0.05$ , while the observed LR images are noiseless, that is,$r_{h}=0.05$ in (13).$\sigma _{h}=r_{h}=0.05, \sigma _{l}=r_{l}=0$
For the quantitative evaluation, we used the following four metrics: the root mean square error (RMSE): \begin{equation*} \mathrm {RMSE} = \sqrt {\frac {1}{N_{h}B} \left \|{ \widetilde {\mathbf {h}}_{t}- \widehat {\mathbf {h}}_{t}}\right \|_{2}^{2}} \tag{21}\end{equation*}
\begin{equation*} \mathrm {SAM} = \frac {1}{N_{h}}\sum _{n=1}^{N_{h}}\mathrm {arccos}\left ({\frac {< \widetilde {\mathbf {e}}_{n},\widehat {\mathbf {e}}_{n}>}{\|\widetilde {\mathbf {e}}_{n}\|_{2} \cdot \left \|{\widehat {\mathbf {e}}_{n}}\right \|_{2}}}\right) \tag{22}\end{equation*}
\begin{equation*} \mathrm {MSSIM} = \frac {1}{B}\sum _{b = 1}^{B}\mathrm {SSIM}\left ({\left [{ \widetilde {\mathbf {h}}_{t}}\right]_{b}, \left [{ \widehat {\mathbf {h}}_{t}}\right]_{b} }\right) \tag{23}\end{equation*}
\begin{equation*} \mathrm {CC} = \frac {s_{ \widetilde {\mathbf {h}}_{t} \widehat {\mathbf {h}}_{t} }}{s_{ \widetilde {\mathbf {h}}_{t}}s_{ \widehat {\mathbf {h}}_{t}}} \tag{24}\end{equation*}
D. Experimental Results With Simulated Data
Table IV shows the RMSE, SAM, MSSIM, and CC results in experiments with the simulated data. In Case1, STARFM, VIPSTF, RobOt, SwinSTFM, ROSTF-1, and ROSTF-2 perform equally well. In contrast, the results of RSFN are not good for both the Site1 and Site2 data. This may be because the size of the training data described above was insufficient to effectively train RSFN. Thus, if more training data were used, RSFN might have produced better results. However, collecting noise-free training data is challenging in real-world applications, and this is the scenario considered in these experiments. Next, we focus on the results in Case2, Case3, and Case4, where the observed reference images are contaminated with noise. While STARFM, VIPSTF, RobOt, SwinSTFM, and RSFN demonstrate significantly worse performance due to the influence of noise, ROSTF-1 and ROSTF-2 show no significant performance degradation in Case2, Case3, and Case4. This is because ROSTF estimates the target HR image while simultaneously denoising the reference HR image.
Figs. 2 and 3 show the estimated results in Case1 for the Site1 and Site2 simulated data, respectively. In the zoomed-in areas, there are significant temporal changes in brightness between the reference HR image
Next, we focus on the results in noisy cases. Fig. 4 shows the estimated results for the Site2 data in noisy cases, that is, Case2, Case3, and Case4. The results of STARFM, VIPSTF, and RobOt are contaminated with a noise similar to that of
ST fusion results for the noisy Site2 simulated data. (Top, Middle, and Bottom) Results in Case2, Case3, and Case4, respectively.
The impact of noise on each method is also visually evident in Figs. 5 and 6. The scatter plots in Fig. 5 reveal the difference between the ground-truth and the estimated values of each method for the simulated Site2 data. In Case2, STARFM, VIPSTF, RSFN, RobOt, and SwinSTFM exhibit greater variance compared to Case1 due to the influence of Gaussian noise. In Case3, STARFM, VIPSTF, and RobOt estimate the wrong values close to 0 or 1 affected by sparse noise while the RSFN and SwinSTFM results have no such outliers but show greater variance. Furthermore, in Case4, the distributions indicate that STARFM, VIPSTF, RSFN, RobOt, and SwinSTFM are affected by both Gaussian and sparse noise. In contrast, the results of ROSTF-1 and ROSTF-2 have minimal variance and no outliers, indicating their robustness to Gaussian, sparse, and mixed noise. Spectral profiles of a specific pixel in the results of each method for the Site2 data in Case1 and Case4 are depicted in Fig. 6. STARFM, VIPSTF, RobOt, SwinSTFM, and ROSTF estimate similarly accurate values in Case1, that is, they are comparable in the noiseless case. In Case4, STARFM, VIPSTF, RobOt, and SwinSTFM estimate completely wrong values for the third band affected by sparse noise and perform worse for the other bands due to the influence of Gaussian noise. In contrast, ROSTF-1 and ROSTF-2 have accurate estimates for all bands, even in the noisy case.
Scatter plots of the ground truth and the estimated values for the Site2 simulated data.
Spectral profiles of a specific pixel in the results of each method for the Site2 data in (a) Case1 and (b) Cas 4.
E. Experimental Results With Real Data
Table V shows the RMSE, SAM, MSSIM, and CC results in experiments with the real data. Compared to the results for the simulated data in Table IV, the performance of ROSTF-1 and ROSTF-2 degrades due to radiometric and geometric inconsistencies between the Landsat and MODIS sensors. Despite the performance degradation, ROSTF-1 and ROSTF-2 perform as well as the existing methods in the noiseless case, that is, Case1, and outperform them in the noisy cases, that is, Case2, Case3, and Case4, as in the experiments with the real data. Thus, it can be concluded that ROSTF is robust to noise even for the simulated data.
Figs. 7 and 8 show the estimated results in Case1 for the Site1 and Site2 real data, respectively. Compared to the results for the simulated data in Figs. 2 and 3, ROSTF-1 and ROSTF-2 perform worse due to modeling errors between the real HR images
Fig. 9 shows the estimated results for the Site2 data in the noisy cases, that is, Case2, Case3, and Case4. The results of the existing methods are not good, especially STARFM, VIPSTF, and RobOt generate noisy outputs. The estimated images of ROSTF seem to be blurred in the zoomed area in Fig. 9. This is due to oversmoothing by the HTV regularization terms in (16), which might have undesirable effects on some applications. Nevertheless, according to the difference map in Fig. 10, the results of ROSTF exhibit the least error, and the accuracy evaluation in Table V also shows that ROSTF achieves the best performance in all metrics.
ST fusion results for the noisy Site1 simulated data. (Top, Middle, and Bottom) Results in Case2, Case3, and Case4, respectively.
Difference map (absolute errors) of the fusion results in the zoomed-in area in the noisy Site1 real data. (Top, Middle, and Bottom) Results in Case2, Case3, and Case4, respectively.
F. Ablation Study
We conducted ablation experiments focusing on the following three items.
The edge constraint,
, to encourage similarity in the land structure, specifically the edges, between the reference and target HR images based on Assumption 1).$\| \mathbf {D} \widetilde {\mathbf {h}}_{r} - \mathbf {D} \widetilde {\mathbf {h}}_{t} \|_{p} \leq \alpha $ The brightness constraint,
, which is designed based on Assumption 2) to ensure that the estimated target HR image has a similar average brightness to the target LR image.$| c_{b}- \mathbf {1}^{\top }[\widetilde {\mathbf {h}}_{t}]_{b}/{N_{h}}| \leq \beta _{b} \, (b = 1, \ldots, B)$ The denoising mechanism for the reference HR image
, that is, the first regularization term$\mathbf {h}_{r}$ and the third, fourth, sixth, and seventh constraints,$\| \mathbf {D} \widetilde {\mathbf {h}}_{r} \|_{1,2}$ ,$\| \mathbf {h}_{r}- (\widetilde {\mathbf {h}}_{r}+ \widetilde {\mathbf {s}}_{hr}) \|_{2}\leq \varepsilon _{h}$ ,$\| \mathbf {l}_{r}- ({ \mathbf {S}} \mathbf {B} \widetilde {\mathbf {h}}_{r} + \widetilde {\mathbf {s}}_{lr}) \|_{2}\leq \varepsilon _{l}$ ,$\| \widetilde {\mathbf {s}}_{hr} \|_{1} \leq \eta _{h}$ .$\| \widetilde {\mathbf {s}}_{lr} \|_{1} \leq \eta _{l}$
We tested ROSTF with each of the three components mentioned above removed. In the following, we present the ablation studies on the two constraints, the edge constraint and the brightness constraint, followed by the ablation study on the denoising mechanism. The hyperparameters in each optimization problem and the stopping criterion of each P-PDS-based algorithm to solve it were set as in the original ROSTF. On the other hand, the stepsizes in each algorithm were set as the values computed according to the operator-norm-based design method of variable-wise diagonal preconditioning in (11).
1) Edge and Brightness Constraints:
First, we measure the effectiveness of the two constraints based on Assumptions 1) and 2), that is, the edge constraint and the brightness constraint, in terms of convergence speed and fusion performance.
Table VI shows the average number of iterations spent before each algorithm stopped and the average performance results for all sites (Site1 or Site2), data types (real or simulated data), and noise cases (from Case1 to Case4). Note that each algorithm always stopped at the maximum number of iterations, 50000, even if the stopping criterion was not met. The original ROSTF performs best with the fewest number of iterations on average, indicating that both of the two constraints contribute to achieving higher fusion performance with fewer iterations.
Fig. 11 illustrates the transition of the RMSE values for the original ROSTF-2, ROSTF-2 without the edge constraint, and ROSTF-2 without the brightness constraint in the experiment on the Site1 simulated data in Case1. ROSTF-2 without the edge constraint does not meet the stopping criterion until 50000 iterations, possibly because the solution space of the optimization problem without the edge constraint is too large to efficiently reach an optimal solution. This suggests that the edge constraint has the effect of making the solution space moderately small and speeding up convergence. On the other hand, ROSTF-2 without the brightness constraint converges faster than ROSTF-2 without the edge constraint but exhibits unstable behavior, especially in the early iterations. This may be because the variable
Behavior of the original ROSTF-2, ROSTF-2 without the edge constraint, and ROSTF-2 without the brightness constraint in the experiment on the Site1 simulated data in Case1. The transition of the RMSE values (a) until the algorithms stopped and (b) in early iterations.
Fig. 12 provides a visual comparison of the original ROSTF-2 and ROSTF-2 without these constraints. The image estimated by ROSTF-2 without the edge constraint (a) loses spatial structure, and ROSTF-2 without the brightness constraint and (b) produces an image with incorrect brightness. On the other hand, the original ROSTF-2 (c) estimates brightness close to that of the ground truth while still preserving spatial structure, indicating that both constraints work effectively.
ST fusion results of (a) ROSTF-2 without edge constraint, (b) ROSTF-2 without brightness constraint, and (c) original ROSTF-2 for the Site2 simulated data in Case3. (d) Ground truth.
2) Denoising Mechanism:
Next, we move on to the ablation study of the denoising mechanism of ROSTF. The optimization problem of ROSTF without the denoising mechanism is formulated as \begin{align*} \min _{\substack { \widetilde {\mathbf {h}}_{t}, \widetilde {\mathbf {s}}_{lt}}} \: & \|\mathbf {D} \widetilde {\mathbf {h}}_{t}\|_{1,2} \\ \mathrm {s.t.} \: & \begin{cases} \displaystyle \| \mathbf {D} \mathbf {h}_{r} - \mathbf {D} \widetilde {\mathbf {h}}_{t} \|_{p} \leq \alpha, \\ \displaystyle | c_{b}- \mathbf {1}^{\top }\left [{\widetilde {\mathbf {h}}_{t}}\right]_{b}/{N_{h}}| \leq \beta _{b} \, \left ({b = 1, \ldots, B}\right), \\ \displaystyle \| \mathbf {l}_{t}- \left ({{ \mathbf {S}} \mathbf {B} \widetilde {\mathbf {h}}_{t} + \widetilde {\mathbf {s}}_{lt}}\right) \|_{2}\leq \varepsilon _{l},\\ \displaystyle \| \widetilde {\mathbf {s}}_{lt} \|_{1} \leq \eta _{l}. \end{cases} \tag{25}\end{align*}
Table VII displays the average RMSE, SAM, MSSIM, and CC results for each noise case. As expected, in the noisy cases, that is, Case2, Case3, and Case4, ROSTF without the denoising mechanism performs significantly worse than the original ROSTF due to the direct impact of noise. This shows that the denoising mechanism works effectively to make ROSTF robust to noise. On the other hand, we newly found that in the noiseless case, that is, Case1, ROSTF without the denoising mechanism achieves slightly better fusion results than the original one. This result indicates that in noiseless cases, the observed HR image
Fig. 13 illustrates the fusion results of ROSTF-1 without the denoising mechanism for the Site2 simulated data. It is also visually apparent that in Case1, ROSTF-1 without the denoising mechanism produces a satisfactory result using only two input images. On the other hand, in Case2, Case3, and Case4, the results of ROSTF without the denoising mechanism are contaminated with noise. This is because the edge constraint copies not only the true edge or spatial structure, but also the noise in the reference HR image to the estimated target HR image. The results of this ablation study confirm that the denoising mechanism plays an effective role in avoiding such noise effects.
ST fusion results of ROSTF-1 without the denoising mechanism for the Site2 simulated data.
G. Computational Cost
We measured the actual running times using MATLAB (R2022b) on a Windows 11 computer equipped with an Intel Core i9-13900 1.0 GHz processor, 32 GB of RAM, and NVIDIA GeForce RTX 4090. We used the Site1 and Site2 data with
ROSTF-1 and ROSTF-2 each took about 4–10 min. STARFM and VIPSTF took longer than ROSTF because they estimated the target HR image pixel by pixel. On the other hand, RobOt ran much faster than ROSTF, possibly because the Least Absolute Shrinkage and Selection Operator (LASSO) problem in RobOt has a closed-form solution in our experiment with only one reference date. This result indicates that ROSTF is slower than RobOt, but we believe that ROSTF remains practical in terms of computational cost.
H. Summary
We summarize the insights from the experiments as follows.
From the results of the experiments in Case1, we see that ROSTF is comparable to state-of-the-art ST fusion methods in noiseless cases. Therefore, the observation model in (14) and the assumptions introduced in Section III-B are valid for ST fusion.
The results of the experiments in Case2, Case3, and Case4 confirm that ROSTF has good performance even when observed HR images are degraded by random noise, missing values, and outliers.
The ablation studies demonstrate that the key components, such as the edge constraint, the brightness constraint, and the denoising mechanism, work effectively as expected.
Conclusion
We have proposed an optimization-based ST fusion method, named ROSTF, which is robust to mixed Gaussian and sparse noise in observed satellite images. We have formulated the fusion problem as a constrained optimization problem and have developed the optimization algorithm based on P-PDS with OVDP. ROSTF was tested through experiments using both simulated and real data. The experimental results demonstrate that ROSTF achieves performance comparable to state-of-the-art ST fusion methods in noiseless cases and significantly better in noisy cases. ROSTF will have a strong impact on the field of remote sensing, including the estimation of satellite image series with high spatial and temporal resolution from observed image series taken in measurement environments with severe degradation.