Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 14

Hyperspectral Image Superresolution via Structure-Tensor-Based Image Matting

Abstract:

Hyperspectral (HS) imaging has achieved breakthroughs in many applications, such as remote sensing and object recognition. However, the spatial resolution of HS images is...Show More

Metadata

Abstract:

Hyperspectral (HS) imaging has achieved breakthroughs in many applications, such as remote sensing and object recognition. However, the spatial resolution of HS images is still insufficient due to the limitations of sensor technology and cost. In this article, we propose an HS image superresolution method that combines low-resolution (LR) HS images and high-resolution (HR) panchromatic (PAN) images. To exploit the spectral signatures in the LR-HS images while introducing details from the HR-PAN images during the image fusion procedure, an image matting model is used to fuse the original LR-HS images and the HR-PAN images. Specifically, to preserve the spectral components during the fusion procedure, two different alpha channels in the image matting model are generated based on the HS and PAN image structure tensors, which suppress spectral distortion and improve the quality of the reconstructed HR-HS image. Experimental results based on public datasets demonstrate the advantage of our proposed method in both preserving spectral information and enhancing HS image spatial resolution.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 14)

Page(s): 7994 - 8007

Date of Publication: 05 August 2021

ISSN Information:

DOI: 10.1109/JSTARS.2021.3102579

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Hyperspectral (HS) imaging has become an important research topic in recent years. HS images contain dozens or hundreds of narrow bands within a certain wavelength range. The spectral resolution can be improved to the nanometer level, which contributes to various applications, such as earth remote sensing [1]–[6] and computer vision applications, including object segmentation, tracking, and recognition [7]–[9]. However, while HS images benefit from excellent spectroscopic properties, the spatial resolution is relatively insufficient compared with multispectral and panchromatic (PAN) images due to the inevitable tradeoff between spectral and spatial sensitivities. As a consequence, the imaging fusion scheme, which combines a low-resolution (LR) HS image with a high-resolution (HR) PAN image, has become an effective and popular approach that improves HS image spatial resolution. To acquire HR-HS images, HR-PAN or multispectral images are generally used as reference images. Compared to multispectral images, PAN images usually have higher spatial resolution but less prior knowledge in the spectral domain, which leads to considerable spectral distortion and makes it more challenging to reconstruct high-quality HR-HS images.

In this article, we propose an HS image superresolution method that recovers an HR-HS image using an LR-HS source image and an HR-PAN reference image. This method primarily exploits prior information; specifically, this method exploits the fact that the corresponding foreground and background intensities for each band in an HS image tend to be spatially smooth, when the alpha channel for image matting contains most of the image's edge information in a local window. Therefore, we reconstruct HR-HS images by designing a regularization term based on the image matting model, which extracts the spectral information from the foreground and background and the spatial information from the alpha channel. Specifically, two alpha channels are generated for the image matting procedure. The first alpha channel is calculated using a weighted strategy based on the structure tensors calculated from the iteratively generated HR-HS term and the PAN image, which ensures the smoothness of the foreground and background in the spatial domain. The second alpha channel is obtained from the PAN image after contrast compression, which introduces edge information into the reconstructed HS image. In this manner, the spatial details in the HS image are enhanced by the PAN image, and the spectral accuracy is also preserved. Experimental results demonstrate that the proposed method is superior when compared to state-of-the-art superresolution methods in terms of improving the quality of the fused HS images.

The rest of this article is organized as follows. In Section II, the representative literature on HS image fusion is briefly reviewed. The proposed method is presented in Section III. Section IV lists the experimental results and the comparative analysis of the different fusion methods. Finally. Section V concludes this article.

SECTION II.

Related Works

A. Fusion-Based Image Superresolution Approaches

Various HS image superresolution methods have been developed in recent years, and the existing approaches can generally be categorized into four classes: component substitution [10]–[17], matrix factorization [18]–[26], tensor factorization [27]–[30], and other approaches [31]–[34].

Component-substitution-based approaches decompose the HS image's spatial and spectral components by transforming the image into another domain. Then, the spatial component is substituted with the multispectral or PAN image, and finally the HR-HS image is reconstructed with inverse transformation. Most existing component substitution methods are based on the Gram–Schmidt method [10], [11], intensity hue saturation [12], [13], or principal component analysis transformation [14]–[17]. These methods can be efficiently implemented and usually achieve good spatial performance. However, they also lead to serious spectral distortion when the HS image's spatial component is directly substituted with the PAN image.

Matrix-factorization-based superresolution approaches assume that each pixel in an HS and PAN image can be represented by linear combinations of several spectral atoms in the HS-HR image to be obtained. Kawakami et al. [18] fused HS images with RGB images obtained from cameras, with a prior that assumes the coefficients are sparse. Yokoya et al. [19] proposed coupled nonnegative factorization, which estimates HR-HS images from a pair of multispectral and HS images. Grohnfeldt et al. [20] fused HS and multispectral images by constructing LR and HR dictionary pairs based on joint sparse representations. Zhou et al. [21] and Veganzones et al. [22] learned the spectral basis for local patches and solved the problem in a patch-by-patch manner, assuming that the HS image is locally low-rank. Simoes et al. [23] used the subspace representation and obtained spatial smoothness based on total variation regularization. Wei et al. [24] formulated the HS fusion procedure as an ill-posed inverse problem, and the sparsity of HS images was exploited via subspace learning in the spectral dimension and sparse coding in the spatial dimensions. Dong et al. [25] proposed a nonnegative structured sparse representation (NSSR) approach to promote the nonlocal self-similarities in HR-HS images. Dian and Li [26] utilized a subspace-based low tensor multirank regularization (LTMR) for fusion, which exploits the spectral correlations and nonlocal similarities in HR-HS images. These methods obtain good accuracy for spectral and spatial resolutions with higher computational complexities, which makes them unsuitable for real-time applications.

Tensor-factorization-based approaches are also utilized in HS superresolution methods. Dian et al. [27] proposed a nonlocal sparse tensor factorization method that generates HR-HS images using dictionaries containing several modes and core tensors. Li et al. [28] conducted sparse tensor factorization for HS and multispectral images simultaneously to solve the fusion problem. Chang et al. [29] designed different sparsity regularization parameters for core tensor values in a low-rank tensor recovery procedure. Zhang et al. [30] proposed a graph-regularized low-rank Tucker decomposition approach, which combines spectral smoothness from HS images and spatial consistency from multispectral images. These methods convert conventional images to 4-D or higher order tensors without loss of information and reconstruct HS images based on prior knowledge, such as sparsity and nonlocal similarity. However, tensor-factorization-based methods have limited representation ability, which probably leads to a sharp deterioration with higher downsampling rates.

There are other fusion methods that estimate HS images with appropriate priors. Wei et al. [31] proposed an efficient Bayesian fusion framework by solving an underlying Sylvester equation associated with Gaussian prior, which has the advantage of decreasing computational complexity in HS image fusion. Qu et al. [32] designed an unsupervised deep convolutional neural network (CNN) that obtains representations in a sparse Dirichlet distribution. Dian et al. [33] used deep priors learned by residual-learning-based CNNs and reconstructed HR-HS images by solving optimization problems. Xie et al. [35] reconstructed HR-HS images by obtaining linearly transformed HR multispectral images and residual images within a deep learning framework. Wang et al. [36] proposed a blind fusion model that can improve the reconstruction quality without knowing the prior of spectral and spatial degradation. Zhu et al. [34] proposed a progressive CNN that learns high-frequency spatial details from a HR zero-centric residual image. These deep-learning-based approaches are data adaptive and can boost the reconstruction performance, whereas their computational burdens are very high and additional hardware support is needed for implementation.

B. Image Matting Model

The image matting model [37] was originally proposed to extract the foreground and background from an input image, which can be expressed as follows: $\begin{equation*} I_m=\alpha _m F_m +(1-\alpha _m) B_m \tag{1} \end{equation*}$ View Source where $I_m$ , $F_m$ , and $B_m$ refer to the intensity of the $m$ th pixel in input image $\bf I$ , the corresponding foreground $\bf F$ , and background $\bf B$ , respectively. $\boldsymbol \alpha$ is the alpha channel, which refers to the opacity of the foreground objects. For natural image matting, all quantities on the right-hand side of (1) are unknown. Therefore, user interaction is often required to estimate $\boldsymbol \alpha$ , $\bf F$ , and $\bf B$ simultaneously, which significantly increases the inconvenience of the matting procedure. Levin et al. [37] utilize the matting model's local linear assumption, which indicates that if $\boldsymbol \alpha$ contains most of the input image's spatial information in a local window, the foreground and background in that window can be estimated by constructing a cost function that depends on the alpha channel. Therefore, the PAN image can be regarded as a good substitute for the alpha channel.

In [38], Kang et al. proposed a multispectral pansharpening framework based on the local linear assumption of the matting model. The alpha channel is generated from the HR-PAN image using contrast compression. Then, the LR foreground and background in each band of the LR multispectral image are estimated with a downsampled alpha channel. The smoothed HR foreground and background are acquired by interpolating the LR foreground and background, respectively. Finally, the HR multispectral image is obtained by combining the HR foreground and background with the alpha channel. This method is simple and effective, but spectral distortion is introduced during the interpolation procedure. Dong et al. [39] proposed a matting-model-based fusion scheme, in which the alpha channel is generated from the HR-PAN image and the interpolated HS image. However, spectral distortion still occurs during interpolation, which significantly affects the accuracy of the obtained HR-HS image. Consequently, matting-model-based component substitution is an effective way for HS image fusion, but it is necessary to reduce the spectral distortion from the interpolation during fusion procedures.

SECTION III.

Proposed Method

In this section, the proposed superresolution method is presented, and it consists of three components: the observation model, the constructed regularization based on image matting, and the optimization problem. These components are illustrated in detail as follows.

A. Observation Model

An HR-HS image can be represented as a matrix ${\bf Z} \in \mathbb {R}^{L \times N}$ , where $L$ is the number of bands in the spectral dimension, and $N$ is the number of pixels in the spatial dimension. The observed LR-HS image ${\bf X} \in \mathbb {R}^{L \times n}$ has the same $L$ bands and fewer pixels $(n< N)$ than $\bf Z$ . Specifically, $\bf X$ can be represented as a linear combination of $\bf Z$ $\begin{equation*} \bf X={\bf Z} {\bf H} \tag{2} \end{equation*}$ View Source where ${\bf H} \in \mathbb {R}^{N \times n}$ denotes the blurring and downsampling operator of the HS imaging procedure. Similarly, the observed PAN image ${\bf Y} \in \mathbb {R}^{1 \times N}$ can be expressed as $\begin{equation*} \bf Y={\bf P} {\bf Z} \tag{3} \end{equation*}$ View Source where ${\bf P} \in \mathbb {R}^{1 \times L}$ represents the spectral response matrix associated with the imaging sensor. Clearly, recovering $\bf Z$ from $\bf X$ and $\bf Y$ is an ill-posed problem. Therefore, appropriate regularization should be introduced to arrive at a stable solution for $\bf Z$ .

The sparsity prior assumes that each pixel of the target HR-HS image $\bf Z$ can be represented as a linear combination of spectral atoms that suggest distinct spectral signatures $\begin{equation*} \boldsymbol z_i={\bf D} {\boldsymbol a}_i \qquad i=1,2,\ldots,N \tag{4} \end{equation*}$ View Source where ${\bf D} \in \mathbb {R}^{L \times K}$ represents the spectral dictionary, which consists of the spectral atoms, and ${\boldsymbol a}_i \in \mathbb {R}^{K \times 1}$ are the sparse coefficients, which are mostly zero or very close to zero. As the degraded LR-HS image $\bf X$ has spectral signatures that are similar to $\bf Z$ , $\bf X$ can also be represented as a combination of the spectral atoms $\begin{equation*} \bf X={\bf Z} {\bf H}=\bf D A H=\bf D B \tag{5} \end{equation*}$ View Source where ${\bf B} \in \mathbb {R}^{K \times n}$ consists of the sparse coefficients for $n$ pixels in $\bf X$ . For the HR-PAN image $\bf Y$ , we have $\begin{equation*} \bf Y={\bf P} {\bf Z}=\bf P D A=\bf \Psi A \tag{6} \end{equation*}$ View Source where ${\bf \Psi } \in \mathbb {R}^{1 \times K}$ denotes the spectral dictionary transformed by the spectral response matrix $\bf P$ . To enhance the sparsity of $\boldsymbol a_i$ , the spectral dictionary $\bf D$ should be constructed from $\bf X$ using dictionary learning algorithms, such as K-SVD [40] and online dictionary learning [41]. Specifically, Dong et al. [25] introduce nonnegative constraints for dictionary $\bf D$ and sparse coefficients $\bf B$ based on the assumption that each column of $\bf D$ represents the reflectance vector of the underlying material corresponding to the ground object in the field of view. The dictionary learning procedure can then be expressed as the following sparse nonnegative matrix decomposition problem: $\begin{align*} ({\bf D},{\bf B})=&\underset{{\bf D},{\bf B}}{\arg \min } \frac{1}{2} \Vert {\bf X - \bf D \bf B}\Vert _{F} ^{2} + \lambda \Vert {\bf B}\Vert _{1} \\ &{\text {s.t.}}\; \qquad {\boldsymbol b_i} \geq {0}, \boldsymbol d_k \geq {0} \tag{7} \end{align*}$ View Source where $\boldsymbol b_i$ and $\boldsymbol d_k$ are the columns of $\bf B$ and $\bf D$ , respectively. This problem can be solved via the alternative direction multiplier method (ADMM) technique and block coordinate method [42], [43]. In this manner, the spectral dictionary can be learned before reconstructing $\bf Z$ from $\bf X$ and $\bf Y$ .

B. Regularization Based on Image Matting

According to the image matting model presented in (1), an HR-HS image can be separated into three parts: the HS foreground, the HS background, and the alpha channel. If the alpha channel contains most of the edge information in a local window, the foreground and background will be spatially smooth. However, the spatial distribution of the edges in the estimated HR-HS image from $\bf D$ and $\bf A$ is not fully consistent with the PAN data, which is primarily caused by the slight differences in spectral signatures from $\bf Z$ and $\bf X$ .

In this article, we design a regularization term based on imaging matting to overcome this problem. Two alpha channels are generated to reduce the inconsistency between the estimated HR-HS image and the PAN reference image. The first alpha channel is used to extract the smooth HS foreground and background from the estimated HR-HS image, whereas the second alpha channel is combined with the extracted foreground and background to introduce edge information from the PAN image into the HR-HS image.

The first alpha channel is computed based on the structure tensor from the estimated HR-HS image and the observed PAN image to obtain the spatially smoothed HS foreground and background, which can be described as $\begin{equation*} {\bf I}_{\text {w}}=\boldsymbol w_1\cdot {\bf I}_{\text {syn}}+\boldsymbol w_2\cdot {\bf Y} \tag{8} \end{equation*}$ View Source where ${\bf I}_{\text {w}}$ refers to the weighted spatial intensity combined with the spatial synthesis ${\bf I}_{\text {syn}}$ and the PAN image ${\bf Y}$ , and $\boldsymbol w_1$ and $\boldsymbol w_2$ are the corresponding weighting coefficients. Notably, the spatial synthesis ${\bf I}_{\text {syn}}$ can be obtained from the HR-HS image ${\bf Z}$ $\begin{equation*} {\bf I}_{\text {syn}}={\bf P}{\bf Z}. \tag{9} \end{equation*}$ View Source

Clearly, ${\bf I}_{\text {syn}}$ contains spatial information consistent with the HR-HS image ${\bf Z}$ , whereas the observed PAN image ${\bf Y}$ contains more spatial details. Therefore, the weighting coefficients $\boldsymbol w_1$ and $\boldsymbol w_2$ are generated using the structure tensor from ${\bf I}_{\text {syn}}$ and ${\bf Y}$ . For the $i$ th pixel of the spatial synthesis ${\bf I}_{\text {syn}}$ , the structure tensor matrix $\hat{\bf T}_{\text {syn},i}$ is defined as $\begin{equation*} \hat{\bf T}_{\text {syn},i}= \begin{bmatrix}I_{Dx,i}^2 & I_{Dx,i} I_{Dy,i} \\ I_{Dx,i} I_{Dy,i} & I_{Dy,i}^2 \end{bmatrix} \tag{10} \end{equation*}$ View Source where ${I}_{Dx,i}$ denotes the horizontal gradient of ${\bf I}_{\text {syn}}$ , and ${I}_{Dy,i}$ denotes the vertical gradient of ${\bf I}_{\text {syn}}$ . The matrix $\hat{\bf T}_{\text {syn},i}$ is a symmetric semidefinite matrix, and it contains only a one-dimensional structure and the direction of the pixel. To describe the local structure information, the elements in $\hat{\bf T}_{\text {syn},i}$ are smoothed using a Gaussian filter $\begin{align*} {\bf G}_{xx}=&{\boldsymbol g}_\sigma \times ({\bf I}_{Dx}\cdot {\bf I}_{Dx}) \\ {\bf G}_{xy}=&{\boldsymbol g}_\sigma \times ({\bf I}_{Dx}\cdot {\bf I}_{Dy}) \\ {\bf G}_{yy}=&{\boldsymbol g}_\sigma \times ({\bf I}_{Dy}\cdot {\bf I}_{Dy}) \tag{11} \end{align*}$ View Source where ${\boldsymbol g}_\sigma$ is the Gaussian convolution kernel with standard deviation $\sigma$ . In this manner, the smoothed structure tensor matrix can be expressed as $\begin{equation*} {\bf T}_{\text {syn},i}= \begin{bmatrix}G_{xx,i} & G_{xy,i} \\ G_{xy,i} & G_{yy,i} \end{bmatrix}. \tag{12} \end{equation*}$ View Source

The matrix ${\bf T}_{\text {syn},i}$ is a positive semidefinite matrix, and its eigenvalues and eigenvectors can be calculated using eigen-decomposition, as shown in (12) $\begin{equation*} {\bf T}_{\text {syn},i}= \begin{bmatrix}\boldsymbol v_{1,i} & \boldsymbol v_{2,i} \end{bmatrix} \begin{bmatrix}\lambda _{\text {syn}1,i} & 0 \\ 0 & \lambda _{\text {syn}2,i} \end{bmatrix} \begin{bmatrix}\boldsymbol v_{1,i}\\ \boldsymbol v_{2,i} \end{bmatrix} \tag{13} \end{equation*}$ View Source where $\boldsymbol v_{1,i}$ and $\boldsymbol v_{2,i}$ refer to the eigenvectors that correspond to the pixel gradient direction, and $\lambda _{\text {syn}1,i}$ and $\lambda _{\text {syn}2,i}$ $(\lambda _{\text {syn}1,i}>\lambda _{\text {syn}2,i})$ are the eigenvalues that are proportional to the edge intensity of the $i$ th pixel in ${\bf I}_{\text {syn}}$ . The eigenvalues can be solved as follows: $\begin{align*} \lambda _{\text {syn}1,i}=\frac{1}{2}\left[G_{xx,i}+G_{yy,i}+\sqrt{(G_{xx,i}-G_{yy,i})^2+4G_{xy,i}^2}\right] \\ \lambda _{\text {syn}2,i}=\frac{1}{2}\left[G_{xx,i}+G_{yy,i}-\sqrt{(G_{xx,i}-G_{yy,i})^2+4G_{xy,i}^2}\right]. \tag{14} \end{align*}$ View Source

Similarly, for the observed PAN image $\bf Y$ , the eigenvalues $\lambda _{\text {Y}1,i}$ and $\lambda _{\text {Y}2,i}$ $(\lambda _{\text {Y}1,i}>\lambda _{\text {Y}2,i})$ of the structure tensor matrix can also be calculated using the same implementation. Therefore, the weighting coefficients are generated from $\lambda _{\text {syn}1,i}$ and $\lambda _{\text {Y}1,i}$ , which demonstrates the different contributions to the fused intensity from the spatial synthesis ${\bf I}_{\text {syn}}$ and the PAN image $\bf Y$ $\begin{align*} w_{1,i}=\frac{\lambda _{\text {syn}1,i}}{\lambda _{\text {syn}1,i}+\lambda _{\text {Y}1,i}} \\ w_{2,i}=\frac{\lambda _{\text {Y}1,i}}{\lambda _{\text {syn}1,i}+\lambda _{\text {Y}1,i}}. \tag{15} \end{align*}$ View Source

In this manner, the first alpha channel $\boldsymbol \alpha _1$ can be generated by performing contrast compression on the weighted spatial intensity ${\bf I}_{\text {w}}$ $\begin{equation*} \alpha _{1,i}=\frac{(1-2\rho)I_{{\text {w}},i}}{I_{\text {w,max}}}+\rho \tag{16} \end{equation*}$ View Source where $\alpha _{1,i}$ is the $i$ th pixel value of $\boldsymbol \alpha _1$ , $I_{{\text {w}},i}$ is the $i$ th pixel value of ${\bf I}_{\text {w}}$ , $I_{\text {w,max}}$ is the maximum value in ${\bf I}_{\text {w}}$ , and $\rho$ is a constant that determines the bounds of the pixel values in $\boldsymbol \alpha _1$ . The contrast compression procedure arranges the pixel values of ${\bf I}_{\text {w}}$ into a relatively small range, which improves the accuracy of the image matting procedure.

Similarly, the second alpha channel $\boldsymbol \alpha _2$ is generated from the PAN image to recover an accurate spatial distribution of the edges in the HS image with the smoothed foreground and background, using contrast compression with the same $\rho$ $\begin{equation*} \alpha _{2,i}=\frac{(1-2\rho)Y_i}{Y_{\text {max}}}+\rho \tag{17} \end{equation*}$ View Source where $\alpha _{2,i}$ refers to the $i$ th pixel value of $\boldsymbol \alpha _2$ , $Y_i$ refers to the $i$ th pixel value of $\bf Y$ , and $Y_{\text {max}}$ is the maximum value in $\bf Y$ . This step ensures that the edge features of the PAN image $\bf Y$ are processed equally with the features in ${\bf I}_{\text {w}}$ .

For the $i$ th pixel in the $c$ th band of $\bf Z$ , the imaging matting model is presented as $\begin{equation*} Z^{(i,c)}=\alpha _{1,i}F^{(i,c)}+(1-\alpha _{1,i})B^{(i,c)} \tag{18} \end{equation*}$ View Source where $F^{(i,c)}$ and $B^{(i,c)}$ refer to the $i$ th pixel value in the $c$ th band of $\bf F$ and $\bf B$ , respectively. The extraction of $\bf F$ and $\bf B$ can be expressed as a minimization problem $\begin{align*} (F^{(i,c)},B^{(i,c)})=&\underset{F^{(i,c)},B^{(i,c)}}{\arg \min } \sum _{i=1}^N \sum _{c=1}^L [\alpha _{1,i}F^{(i,c)} \\ &{+}\;(1-\alpha _{1,i})B^{(i,c)}-Z^{(i,c)}]^2 \\ &{+}\;|\alpha _{1,ix}|[(F_x^{(i,c)})^2+(B_x^{(i,c)})^2] \\ &{+}\;|\alpha _{1,iy}|[(F_y^{(i,c)})^2+(B_y^{(i,c)})^2] \tag{19} \end{align*}$ View Source where $\alpha _{1,ix}$ , $F_x^{(i,c)}$ , and $B_x^{(i,c)}$ are the gradients of $\alpha _{1,i}$ , $F^{(i,c)}$ , and $B^{(i,c)}$ in the horizontal direction, respectively; $\alpha _{1,iy}$ , $F_y^{(i,c)}$ , and $B_y^{(i,c)}$ are the gradients of $\alpha _{1,i}$ , $F^{(i,c)}$ , and $B^{(i,c)}$ in the vertical direction, respectively. Equation (19) is quadratic and the solution can be obtained by solving a sparse set of linear equations [37]. Once $\bf F$ and $\bf B$ are obtained, the HR-HS term $\bf U$ can be reconstructed in the following form: $\begin{equation*} U^{(i,c)}=\alpha _{2,i}F^{(i,c)}+(1-\alpha _{2,i})B^{(i,c)} \tag{20} \end{equation*}$ View Source where $U^{(i,c)}$ refers to the $i$ th pixel value in the $c$ th band of $\bf U$ . In this manner, the designed regularization term based on image matting can be described as $\begin{equation*} \phi ({\bf A}) = \Vert {\bf D A} - {\bf U}\Vert _{F}^{2}. \tag{21} \end{equation*}$ View Source

The algorithm for generating the regularization term using structure-tensor-based image matting is summarized in Algorithm 1.

Algorithm 1: Regularization Term Generation.

Input: ${\bf Z,Y,P,D,A},{\boldsymbol g}_{\sigma },\rho$

Compute ${\bf I}_{\text {syn}}$ via Eq. (9);

for $i=1,2,\ldots,N$ do

Compute $\hat{\bf T}_{\text {syn},i}$ and ${\bf T}_{\text {syn},i}$ via Eq. (10), (11) and (12);

Compute $\lambda _{\text {syn}1,i}$ , $\lambda _{\text {syn}2,i}$ , $\lambda _{\text {Y}1,i}$ and $\lambda _{\text {Y}2,i}$ via Eq. (13) and (14);

Compute $w_{1,i}$ and $w_{2,i}$ via Eq. (15);

Compute ${\bf I}_{\text {w}}$ via Eq. (8);

Compute $\alpha _{1,i}$ and $\alpha _{2,i}$ via Eq. (16) and (17);

end for

for $c=1,2,\ldots,L$ do

Compute $F^{(i,c)}$ and $B^{(i,c)}$ via Eq. (19);

Compute $U^{(i,c)}$ via Eq. (20);

end for

Compute $\phi ({\bf A})$ via Eq. (21);

Output: $\phi ({\bf A})$ , $\bf U$

C. Optimization Problem

According to (2) and (3), the problem of reconstructing the HR-HS image ${\bf Z}$ from the observed LR-HS image $\bf X$ and PAN image $\bf Y$ can be expressed as $\begin{equation*} {\bf Z}=\underset{{\bf Z}}{\arg \min } \Vert {\bf Y} - {\bf P} {\bf Z}\Vert _{F} ^{2} + \eta \Vert {\bf X}- {\bf Z}{\bf H}\Vert _{F} ^{2}. \tag{22} \end{equation*}$ View Source

Since (22) is an ill-posed problem, other constraints should be introduced to arrive at a stable solution. Equation (4) demonstrates that each pixel of ${\bf Z}$ has sparsity in the spectral domain. Therefore, with the spectral dictionary $\bf D$ learned from $\bf X$ in advance, (22) can be transformed with the nonnegative sparse regularization $\begin{align*} {\bf Z}=&\underset{{\bf Z}}{\arg \min } \Vert {\bf Y} - {\bf P D A}\Vert _{F} ^{2} + \eta _1 \Vert {\bf X}- {\bf D A H}\Vert _{F} ^{2} \\ &{+}\;\eta _2 \Vert {\bf A}\Vert _1 \qquad \text {s.t.} \qquad {\boldsymbol a_i} \geq {0}. \tag{23} \end{align*}$ View Source

Furthermore, image-matting-based regularization is also utilized to reconstruct the edge information of the HR-HS image in the spatial domain, described as $\begin{align*} {\bf Z}=&\underset{{\bf Z}}{\arg \min } \Vert {\bf Y} - {\bf P D A}\Vert _{F} ^{2} + \eta _1 \Vert {\bf X}- {\bf D A H}\Vert _{F} ^{2} \\ &{+}\;\eta _2 \Vert {\bf A}\Vert _1 + \eta _3 \phi ({\bf A}) \\ =&\underset{{\bf Z}}{\arg \min } \Vert {\bf Y} - {\bf P D A}\Vert _{F} ^{2} + \eta _1 \Vert {\bf X}- {\bf D A H}\Vert _{F} ^{2} \\ &{+}\;\eta _2 \Vert {\bf A}\Vert _1 + \eta _3 \Vert {\bf D A} - {\bf U}\Vert _{F}^{2} \qquad \text {s.t.} \qquad {\boldsymbol a_i} \geq {0}. \tag{24} \end{align*}$ View Source

Equation (24) is convex and can be solved using the ADMM technique. The augmented Lagrangian function in (24) can be represented as $\begin{align*} &L_{\mu }({\bf A},{\bf Z},{\bf S},{\bf V}_1,{\bf V}_2) \\ &=\Vert {\bf Y} - {\bf \Psi S}\Vert _{F} ^{2} + \eta _1 \Vert {\bf X}- {\bf Z H}\Vert _{F} ^{2} \\ &{+}\;\eta _2 \Vert {\bf A}\Vert _1 + \eta _3 \Vert {\bf D S -U}\Vert _{F}^{2} \\ &{+}\;\mu \left\Vert {\bf S-A}+\frac{{\bf V}_1}{2\mu }\right\Vert _{F}^{2} + \mu \left\Vert {\bf D S-Z}+\frac{{\bf V}_2}{2\mu }\right\Vert _{F}^{2} \\ &{\text {s.t.}}\; \qquad {\boldsymbol a_i} \geq {0} \tag{25} \end{align*}$ View Source where ${\bf V}_1$ and ${\bf V}_2$ are the Lagrangian multipliers. The subproblems for $\bf A$ , $\bf Z$ , and $\bf S$ can be solved analytically in each iteration as follows: $\begin{align*} {\bf A}^{(t+1)}=&\left[{\text {Soft}}\left({\bf S}^{(t)}+\frac{{\bf V}_1^{(t)}}{2\mu },\frac{\eta _2}{2\mu }\right)\right]_{+} \\ {\bf Z}^{(t+1)}\!=\!&\left[{\bf X H}^{\mathrm{T}} \!+\!\frac{\mu }{\eta _1}\left({\bf D S}^{(t)} \!+\! \frac{{\bf V}_2^{(t)}}{2\mu }\right)\right] \left({\bf H H}^{\mathrm{T}} \!+\! \frac{\mu }{\eta _1} {\bf I}\right)^{-1} \\ {\bf S}^{(t+1)}=&\left[{\bf \Psi }^{\mathrm{T}}{\bf \Psi }+\mu {\bf I}+(\eta _3+\mu){\bf D}^{\mathrm{T}}{\bf D}\right]^{-1}\Bigg[{\bf \Psi }^{\mathrm{T}}{\bf Y} \\ &{+}\;\eta _3{\bf D}^{\mathrm{T}}{\bf U} + \mu \left({\bf A}^{(t)}-\frac{{\bf V}_1^{(t)}}{2\mu }\right) \\ &{+}\;\mu {\bf D}^{\mathrm{T}}\left({\bf Z}^{(t)}-\frac{{\bf V}_2^{(t)}}{2\mu }\right)\Bigg]. \tag{26} \end{align*}$ View Source

The Lagrangian multipliers are updated by $\begin{align*} {\bf V}_1^{(t+1)}=&{\bf V}_1^{(t)} + \mu ({\bf S}^{(t+1)} - {\bf A}^{(t+1)}) \\ {\bf V}_2^{(t+1)}=&{\bf V}_2^{(t)} + \mu ({\bf D S}^{(t+1)} - {\bf Z}^{(t+1)}). \tag{27} \end{align*}$ View Source

In practice, the HR-HS image $\bf Z$ is unknown in (9) and (18), and the generated HR-HS term $\bf U$ cannot be directly computed using (20). Therefore, we overcome this difficulty by iteratively updating $\bf U$ using the current estimate of ${\bf Z}^{(t)}$ . The algorithm for estimating the HR-HS image is summarized in Algorithm 2.

Algorithm 2: HR-HS Image Reconstruction.

Input: ${\bf X,Y,P,D,H}$

Initialize $\eta _1$ , $\eta _2$ , $\eta _3$ , $\mu$ , ${\bf U}$ , ${\bf V}_1$ , ${\bf V}_2$

for $t=0,1,\ldots,T-1$ do

Compute ${\bf A}^{(t+1)}$ , ${\bf Z}^{(t+1)}$ , ${\bf S}^{(t+1)}$ via Eq. (26);

Update ${\bf V}_1$ , ${\bf V}_2$ via Eq. (27);

Update ${\bf U}$ via Algorithm 1;

Update $\mu :=\gamma \mu \qquad (\gamma > 1)$ ;

end for

Output: ${\bf Z}^{(T)}$

SECTION IV.

Experimental Results

In this section, the proposed method's reconstruction performance is illustrated using simulated datasets. To verify the superiority of the proposed method, five recent state-of-the-art methods for HS image superresolution are used for comparison, including the subspace-based method (HySure) [23], NSSR [25], fast fusion based on solving the Sylvester equation (R-FUSE) [44], LTMR [26], and the deep-learning-based HS image sharpening method (DHSIS) [33].

To objectively compare the performance of these superresolution methods, following six quantitative metrics are utilized to measure the quality of the fusion results.

Correlation Coefficient (CC): The CC [45] indicates the correlation degree between two images, which is defined as follows: $\begin{align*} &{\text {CC}}\\&=\frac{1}{L} \sum _{c=1}^{L} \frac{\sum _{i=1}^{N} [Z_{\text {ref}}^{(i,c)} - \overline{Z}_{\text {ref}}^{c}] [Z^{(i,c)} - \overline{Z}^{c}]}{\sqrt{\sum _{i=1}^{N} [Z_{\text {ref}}^{(i,c)} - \overline{Z}_{\text {ref}}^{c}]^2 \sum _{i=1}^{N} [Z^{(i,c)} - \overline{Z}^{c}]^2 }} \tag{28} \end{align*}$ View Source where $L$ is the number of bands and $N$ is the number of pixels in the HR-HS image's spatial domain. $Z^{(i,c)}$ and $Z_{\text {ref}}^{(i,c)}$ refer to the $i$ th pixel value in the $c$ th band of the fused HR-HS image and the reference ground-truth HR-HS image, respectively. $\overline{Z}^{c}$ and $\overline{Z}_{\text {ref}}^{c}$ refer to the mean value of the $c$ th band in the fused HS image and the reference HS image, respectively.
Spectral Angle Mapper (SAM): The SAM [46] denotes the absolute value of the spectral angle between the two vectors $\begin{equation*} {\text {SAM}}=\frac{1}{N} \sum _{i=1}^{N} \arccos (\frac{\langle {\boldsymbol z}_{\text {ref}}^{i},{\boldsymbol z}^{i} \rangle }{||{\boldsymbol z}_{\text {ref}}^{i}||^{2} \cdot ||{\boldsymbol z}^{i}||^{2}} \tag{29} \end{equation*}$ View Source where ${\boldsymbol z}^{i}$ denotes the spectral representation of the $i$ th pixel in the fused HS image and ${\boldsymbol z}_{\text {ref}}^{i}$ denotes the same for the reference HS image. This index reflects the spectral distortion by the absolute angles between the two images.
Root-Mean-Squared Error (RMSE): The RMSE index measures the standard difference between two images as follows: $\begin{equation*} {\text {RMSE}}=\frac{1}{NL} \sum \nolimits_{c=1}^{L} \sqrt{\sum _{i=1}^{N} (Z_{\text {ref}}^{(i,c)} - Z^{(i,c)})^{2}}. \tag{30} \end{equation*}$ View Source
Clearly, a fused image that is closer to the reference HS image leads to a smaller RMSE value.
Erreur Relative Global Adimensionnelle de Synthse (ERGAS): ERGAS [47] measures the global quality of the fused image, which is defined as follows: $\begin{equation*} {\text {ERGAS}}=100 \sqrt{\frac{n}{N}} \sqrt{\frac{1}{L} \sum \nolimits_{c=1}^{L}(\frac{({\text {RMSE}}^{c})^{2}}{\overline{Z}_{\text {ref}}^{c}}} \tag{31} \end{equation*}$ View Source where $n$ is the number of pixels in the LR-HS image's spatial domain, and ${\text {RMSE}}^{c}$ refers to the RMSE value of the $c$ th band between the fused HR-HS image and the reference HR-HS image.
Peak Signal-to-Noise Ratio (PSNR): The PSNR for an HR-HS image is defined as follows: $\begin{equation*} {\text {PSNR}}=-\frac{10}{L} \sum _{c=1}^{L} \log ({\text {RMSE}}^{c}). \tag{32} \end{equation*}$ View Source
Clearly, the PSNR definition for HR-HS images is an average of the PSNR values for the 2-D images for all bands.
Universal Image Quality Index (UIQI): The UIQI [48] is calculated on a sliding window of size $32 \times 32$ and averaged across all windows. The UIQI for two windows $\boldsymbol a$ and $\boldsymbol b$ is given by $\begin{equation*} Q({\boldsymbol a},{\boldsymbol b})=\frac{4\sigma _{\boldsymbol{ab}}^{2}}{\sigma _{\boldsymbol a}^{2}+\sigma _{\boldsymbol b}^{2}} \frac{\mu _{\boldsymbol a} \mu _{\boldsymbol b}}{\mu _{\boldsymbol a}^{2}+\mu _{\boldsymbol b}^{2}} \tag{33} \end{equation*}$ View Source where $\sigma _{\boldsymbol{ab}}$ is the sample covariance between $\boldsymbol a$ and $\boldsymbol b$ , and $\sigma _{\boldsymbol a}$ and $\mu _{\boldsymbol a}$ denote the standard deviation and mean value of $\boldsymbol a$ , respectively.

A. Datasets

To test the effectiveness of the proposed method, three sets of data, i.e., CAVE, Pavia University, and Eagle, are used for the experiments. The spectral response of a Canon 60D camera [49] is used to generate the PAN images from the original HS datasets, whereas the LR-HS images are simulated by applying a Gaussian blur and then downsampling in two spatial dimensions on the original HR-HS images.

The CAVE dataset [50] has 32 indoor HS images; each image has $512\times 512$ pixels in the spatial domain, and 31 bands in the spectral domain. We select 24 HS images as the training data for DHSIS, and 8 images as the test data for the superresolution methods. Pavia University [51] was acquired using a reflective optics system imaging spectrometer optical sensor over the Pavia University area. The selected area for experiments has $600\times 320$ spatial pixels and 103 spectral bands. The Eagle data [52] was captured over the Samford Ecological Research Facility, located in the Samford valley in southeast Queensland, Australia. The area selected as the test image contains $640\times 640$ spatial pixels and 120 spectral bands, whereas the other area is separated into 12 images and used as the training data for DHSIS. The original images are shown in Fig. 1.

Fig. 1.

Illustration of the original HR-HS test images from CAVE, Pavia University, and Eagle. The first row: images from the CAVE dataset. The second row: images from the Pavia University and Eagle dataset. (a) Beads. (b) Toy. (c) Cloth. (d) Food. (e) Peppers. (f) Hairs. (g) Jelly beans. (h) Oilpainting. (i) Pavia University. (j) Eagle.

Show All

B. Parameters Discussion

In our method, the key parameter that most impacts the accuracy is $\eta _1$ in (24). Clearly, this parameter balances the error tolerances between the PAN image $\bf Y$ and the LR-HS image $\bf X$ when reconstructing the HR-HS image $\bf Z$ . It is difficult to determine a suitable $\eta _1$ for all kinds of HS images, because the performance of the method also depends on $\eta _2$ and $\eta _3$ , which are relative to other features, such as the image's sparsity. Fig. 2 shows the fused results from the Pavia University dataset, which vary from $\log _2\eta _1$ at different downsampling rates $s$ . The curves of CC, SAM, and UIQI rise as $\log _2\eta _1$ varies from $-6$ to 0 and then decline as $\log _2\eta _1$ becomes greater than 0 with the downsampling rate $s=4$ . However, the maximum points of the curves change with higher downsampling rates. Fig. 3 plots the results of CAVE-Beads, Pavia University, and Eagle at a downsampling rate of $s=4$ . Clearly, the optimal value of $\eta _1$ varies with different HS datasets, where $\eta _1$ should be set smaller for CAVE and Eagle, and larger for Pavia University. Therefore, we set $\eta _1$ based on the downsampling rate $s$ and the number of bands $L$ , and the chosen values are shown in Table I.

TABLE I Parameters of the Simulation Experiments

Fig. 2.

Curves of CC, SAM, and UIQI values at different downsampling rates for Pavia University. (a) $s=2$ . (b) $s=4$ . (c) $s=8$ .

Show All

$Fig. 3. - Curves of CC, SAM, and UIQI values with different $\eta _1$ values (downsampling rate $s=4$) for (a) CAVE-Beads, (b) Pavia University, and (c) Eagle.$

Fig. 3.

Curves of CC, SAM, and UIQI values with different $\eta _1$ values (downsampling rate $s=4$ ) for (a) CAVE-Beads, (b) Pavia University, and (c) Eagle.

Show All

There are additional parameters that need to be specified. We have found experimentally that these parameters do not influence the results as much as $\eta _1$ . Thus, the values of these parameters are determined empirically, where $\eta _2=6\times 10^{-5}$ , $\eta _3=1.5\times 10^{-2}$ , the maximal iteration number $T=19$ , and the kernel size of $\boldsymbol g_\sigma$ is set to $5\times 5$ with a standard deviation $\sigma$ of 0.5. To provide a fair comparison, the parameters for the compared methods are also set to the optimal values from their original literature.

The input PAN images are generated from the original HS images with camera spectral sensitivity (CamSpec) [49]. The size of the Gaussian blur kernel used to synthesize the LR-HS images is set to $(3s+1)\times (3s+1)$ with a standard deviation of $\text{3}\,s/4$ , which produces the degraded images that are clear and free of sawtooth artifacts [53].

C. Experimental Results

The CC, SAM, RMSE, ERGAS, and UIQI results for different downsampling rates on the CAVE dataset are reported in Table II. The proposed method outperforms other competing methods and achieves the highest spatial and spectral precision. Fig. 4 shows the reconstructed CAVE-Beads images and the error images produced by different methods. All the test methods can effectively reconstruct the spatial structures of the HS image, but the proposed method performs best in recovering the details of the original image. Specifically, for the downsampling rate $s=16$ , the reconstructed HS images of HySure and R-FUSE have global spectral distortion, whereas NSSR, LTMR, and DHSIS produce extra blur in the pixels near the beads in the spatial domain. The proposed method can reconstruct the beads’ edge details more accurately. Spectral distortion primarily occurs on the reflecting surface and nearby background of the beads, and the proposed method shows an obvious advantage in preserving the spectral accuracy of the reconstructed HS image.

TABLE II CC, SAM, RMSE, ERGAS, and UIQI Values With Different Downsampling Rates on CAVE Dataset the Best Results are Bold

Fig. 4.

Reconstructed images of CAVE-Beads. The false color images are synthesized by band 5, 10, and 29. The first three rows show the reconstructed images with the downsampling rate $s=4,8,16$ , respectively; the last three rows show the error images at band 10 with the downsampling rate $s=4,8,16$ , respectively. From left to right: results of (a) NSSR, (b) HySure, (c) R-FUSE, (d) LTMR, (e) DHSIS, and (f) proposed.

Show All

The results of the methods for the Pavia University dataset are presented in Table III and Fig. 5.¹ From Table III, we can see that our proposed method performs significantly better than other test methods on most of the quality metrics. HySure and R-FUSE achieve good performance with a small downsampling rate ( $s=4$ ), whereas NSSR and LTMR perform much better with larger downsampling rates ( $s=8$ and $20$ ). This result occurs because both NSSR and LTMR use cluster-based regularization, which can effectively depict the nonlocal similarities in the HS image. Fig. 5 shows the visual performance of the test methods when reconstructing the Pavia University images. All test methods produce clear spatial details in the reconstructed HS images, and spectral distortion primarily occurs where the image area contains abundant edge information. As the downsampling rate increases, HySure and R-FUSE suffer from more serious spectral distortion, whereas NSSR, LTMR, and the proposed method demonstrate better stability. The proposed method consistently performs best in terms of preserving the spectral accuracy of HS images.

TABLE III CC, SAM, RMSE, ERGAS, and UIQI Values With Different Downsampling Rates on Pavia University the Best Results are Bold

Fig. 5.

Reconstructed images of Pavia University. The false color images are synthesized by band 13, 25, and 61. The first three rows show the reconstructed images with the downsampling rate $s=4,8,20$ , respectively; the last three rows show the error images at band 25 with the downsampling rate $s=4,8,20$ , respectively. From left to right: results of (a) NSSR, (b) HySure, (c) R-FUSE, (d) LTMR, and (e) proposed.

Show All

The quality metrics for the Eagle dataset are shown in Table IV. R-FUSE achieves the highest SAM and UIQI values, whereas the proposed method performs best on the CC, RMSE, ERGAS, and UIQI metrics. Because the detailed information in the observed area in the Eagle dataset is less than that in Pavia University, the performance of the test methods decrease less as the downsampling rate increases. However, the reconstruction quality of LTMR is seriously influenced when $s=32$ , which indicates that the effectiveness of tensor multirank regularization may be affected by a high downsampling rate. The quality of the reconstructed HS images is illustrated in Fig. 6. As the downsampling rate increases, HySure loses more spatial details because of the property of the total variation regularization. LTMR has obvious spectral distortion in the reconstructed HS image. DHSIS exploits more prior information from the training images using the deep learning framework and maintains stable performance with higher downsampling rates. The proposed method performs best on most of the quality metrics, which indicates the effectiveness of the image-matting-based regularization term.

TABLE IV CC, SAM, RMSE, ERGAS, and UIQI Values With Different Downsampling Rates on Eagle the Best Results are Bold

Fig. 6.

Reconstructed images of Eagle. The false color images are synthesized by band 13, 25, and 61. The first three rows show the reconstructed images with the downsampling rate $s=4,16,32$ , respectively; the last three rows show the error images at band 25 with the downsampling rate $s=4,16,32$ , respectively. From left to right: results of (a) NSSR, (b) HySure, (c) R-FUSE, (d) LTMR, (e) DHSIS, and (f) proposed.

Show All

Fig. 7 shows the CC, RMSE, and UIQI curves as functions of different spectral bands over CAVE-Beads for the test methods, which indicates the similarities of spectral reflectance between the fusion results and the reference images. All the test methods can effectively reconstruct the HS image with smaller downsampling rates. The proposed method achieves better results on most of the bands and has a significant advantage in terms of reconstruction performance with the bands corresponding to wavelengths over 610 nm, where the pixel values in these bands contain more spatial details.

Fig. 7.

Curves of (a) CC, (b) RMSE, and (c) UIQI as functions of different spectral bands over CAVE-Beads. From top to bottom: curves with downsampling rates $s=4,8,16$ , respectively.

Show All

D. Computational Time

The execution times of the test methods are presented in Table V. All experiments are implemented with MATLAB R2020b on an Intel Xeon W-10885M@2.4 GHz CPU. The MATLAB parallel toolbox is utilized to accelerate the extraction procedure in (19) described in Section III-C. As seen from the table, R-FUSE is the fastest method on the test datasets, and benefits from the high efficiency of its hierarchical Bayesian framework based on the Gaussian prior. DHSIS also demonstrated high efficiency during the test procedure, whereas its training implementation takes considerable computational time. HySure and LTMR are spectral subspace based; therefore, their computational time primarily depend on the spatial resolution of the reconstructed HS images. The computational times of NSSR and our proposed method are related to both the spatial resolution and the number of spectral bands in the reconstructed HS images. Our proposed method requires relatively more computational time, and the primary computational burden occurs when solving (19) and (24), which correspond to the image matting procedure and the HR-HS image reconstruction, respectively.

TABLE V Computational Time (Seconds) Comparison Between the Testing Methods With Downsampling Rate

$s=4$

SECTION V.

Conclusion

In this article, we propose an effective HS image superresolution method that reconstructs HR-HS images from LR-HS images and PAN images depicting the same scene. Based on the image matting model, a regularization term is designed to preserve the spectral signatures. To introduce spatial details from the PAN image into the image fusion procedure, two alpha channels are constructed for image matting. The first alpha channel is iteratively computed based on the structure tensors from the current HR-HS term and the original PAN image, whereas the second alpha channel is generated from the PAN image using contrast compression. Experimental results on public HS datasets demonstrate that the proposed method achieves better spatial and spectral accuracy on most test images than existing HR-HS recovery methods in the literature.

In the future works, the proposed superresolution method can be extended in three directions. First, the alpha channel generation algorithm can be further extended and optimized, especially for fusions of HS and multispectral images. Second, the camera spectral response estimation methods can be incorporated into the proposed method, which is necessary for solving the blind fusion problems. Third, image matting techniques with higher efficiency can also be introduced, which may significantly improve the computational speed of our method.

ACKNOWLEDGMENT

The authors would like to thank TERN AusCover and Remote Sensing Centre, Department of Science, Information Technology, Innovation and the Arts, QLD for providing the hyperspectral data Eagle. Airborne hyperspectral are from http://www.auscover.org.au/xwiki/bin/view/Product+pages/Airborne+Hyperspectral.

References is not available for this document.

Hyperspectral Image Superresolution via Structure-Tensor-Based Image Matting

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Works