Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 16

MCWESRGAN: Improving Enhanced Super-Resolution Generative Adversarial Network for Satellite Images

Abstract:

With the dynamic technological development, we are witnessing a major progress in solutions that allow for the observation of Earth's surface. Small satellites have a sig...Show More

Metadata

Abstract:

With the dynamic technological development, we are witnessing a major progress in solutions that allow for the observation of Earth's surface. Small satellites have a significant drawback. Due to their limitations, the installed optic systems are not perfect. As a result, the quality of the obtained images is lower, including lower resolution, although the satellites move on the Low Earth Orbit. In the case of images lacking a high-resolution counterpart, the spatial resolution of the imagery can be improved using single-image super-resolution algorithms. In this article, we present an SISR solution based on a new network called MCWESRGAN, which is a modification of the popular ESRGAN network. We propose a novel strategy that introduces a multi-column discriminator model. The generator model is trained using Wasserstein loss. The introduced modifications enable a tenfold reduction in the training time of the network. The proposed algorithm is verified using images obtained from space, aerial imagery, and the Dataset for Object deTection in Aerial Images (DOTA) database. A set of evaluation methods for super-resolution (SR) images is proposed to verify the results. These evaluation methods indicate areas that are poorly estimated by the algorithm. Furthermore, as part of the conducted experiments, an absolute assessment method for interpretational potential based on the power spectral density of the image (PSD) is proposed, allowing for determining the magnitude of interpretational improvement after applying resolution enhancement algorithms. The conducted research demonstrates that we achieve better qualitative and quantitative results than classical ESRGAN methods and other state-of-the-art (SOTA) approaches.

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 16)

Page(s): 9459 - 9479

Date of Publication: 06 October 2023

ISSN Information:

DOI: 10.1109/JSTARS.2023.3322642

Funding Agency:

Contents

SECTION I.

Introduction

In Recent years, we have been witnessing a dynamic development in space technology, and the possibilities to observe the Earth have been improving every year. The obtained data allow us to conduct analyses that may be applied in numerous fields of science, including monitoring changes [1], [2], [3], [4], [5], geospatial analysis [6], [7], automated object detection [8], [9], [10], [11], [12], [13], or monitoring havens [14], [15], [16].

The possibility to use the given imagery is determined by its resolution. High temporal resolution enables, for example, conducting the analyses of change detection. Spectral analysis determines the scope of potential remote sensing analyses, including the analyses of land coverage [17], [18], [19], [20].

Another quite important parameter is a spatial resolution that informs about ground sampling distance (GSD), i.e., the distance between two subsequent centers of pixels measured on the surface of the Earth. There are numerous methods used to improve the spatial resolution of an image [21], [22], [23], [24]. One of them is pansharpening, which combines the high resolution of a panchromatic image with the high resolution of a multi-spectral image [25], [26], [27], [28], [29]. This operation results in generating a high-resolution multispectral image. However, the pansharpening method may be used only provided that there is an existing high-resolution (HR) image. Unfortunately, not all satellites acquire imagery in the panchromatic and multispectral range. In addition, the problem also tends to appear as a result of the attempts to improve the resolution of imagery acquired by nanosatellites. Nano- and minisatellites are much smaller, and the matrices and telescopes that they carry are imperfect. As a result, the acquired images have much lower spatial resolution. Apart from that, the difference between the spatial resolution of the panchromatic and multispectral images is small, e.g., for the satellites from the SkySat-3 - SkySat-15 constellation it is only 0.16 m. This problem intensifies when it is necessary to increase the spatial resolution of an image that does not have a higher resolution corresponding image (e.g., a panchromatic image or a sequence of images). One of the solutions is the application of digital image processing methods. One of them is the Super-Resolution Variable-Pixel Linear Reconstruction algorithm [30], or the quincunx sampling mode named Très Haute Résolution that is applied for processing the imagery acquired by SPOT5 [31], [32].

Solutions that apply digital image processing are rather seldom used to improve the spectral resolution of satellite imagery due to the high requirements of acquiring satellite scenes or due to the need to acquire multiple scenes that present the same area by more than one satellite at a similar time. Another solution to this problem is the application of algorithms that employ deep neural networks. This group of solutions uses only a trained network model and low-resolution (LR) image to improve the spatial resolution.

In order to solve the problems with improving the satellite imagery based on a LR image, we propose a modification to the enhanced super-resolution generative adversarial network (ESRGAN) architecture [33]—multicolumn Wasserstein enhanced super-resolution generative adversarial networks (MCWESRGAN). It will enable to train of the neural network over 10 times faster and, at the same time, to improve the quality of the generated images.

To address these above issues, we made the following improvements.

We employed a multicolumn discriminator, which allowed for a better evaluation of SR images generated by the generator, resulting in up to a tenfold acceleration of the model training process.
We used Wasserstein loss, which allowed for better discrimination between HR and SR images by the discriminator, consequently enabling improved control over the generator. This modification significantly reduces the issue of vanishing gradients and accelerates GAN training.
We paid attention to the quality of training databases. In the case of training single-image super-resolution (SISR) models dedicated to satellite images, it is recommended to create diverse databases, e.g., by using images obtained at different times of the day by other satellites. By taking care of this aspect, we can eliminate the occurrence of vanishing or exploding gradients.
We introduced a novel method for evaluating generated SR images based on the power spectral density (PSD) of the image. This analysis allows us to determine the ground resolved distance (GRD) parameter and, consequently, the interpretation possibilities of SR images.
We conducted research and assessed the feasibility of the proposed method in this article. The experiments were conducted on three sets of databases. To objectively evaluate the generated SR images, we employed quality metrics commonly used in the remote sensing and computer vision domains. In addition, we proposed two additional evaluation methods. The first method allows for a local assessment of the images, while the second method enables an evaluation of the interpretational potential based on the PSD of the image.

This article consists of the following parts: Section II presents an overview of methods to improve the spatial resolution of satellite imagery using deep neural networks. A description of the proposed method is provided in Section III. Section IV describes the research methodology and presents the results of the work. Section V contains a discussion of the results. Finally, Section VI concludes this article.

SECTION II.

Related Works

Research on the methods to improve the spatial resolution of satellite imagery with the use of a single image (Single Image Super Resolution—SISR) has been conducted for years [34], [35], [36], [37]. However, the methods proposed by scientists do not allow for solving the problem of improving the spatial resolution of those satellite images that do not have a corresponding HR image. The main reason for this is the insufficient computational capacity of the workstations and the restrictions on free access to satellite data. However, the technological progress that took place in recent years enables us to solve these problems. Modern workstations enable the processing of very large imagery databases in a short time. In addition, the application of graphic processors enables the training and implementation of algorithms that employ deep neural networks.

Solutions based on convolutional networks may be applied in the tasks of detection [38], [39], [40], [41], classification [42], [43], segmentation [44], [45], [46], [47], [48], or improving the spatial resolution of the acquired images [49], [50], [51], [52].

This article will focus on the last of these applications, i.e., improving the spatial resolution of digital images. In this group of methods, we may distinguish solutions that are based on classical convolutional networks and solutions that employ generative adversarial networks (GANs). Although both these methods are based on a similar idea (the solutions recreate a HR image based on a neural network model), they differ significantly in terms of the architecture of the artificial neural network and the method of training. The first group of solutions includes models that are based on convolutional layers. It includes models that employ residual blocks or multicolumn networks, where each of the branches is responsible for improving the spatial resolution, and at the final stage of the operation of the model the results obtained on every branch are combined.

An example of a SISR algorithm is the single image super-resolution diffusion probabilistic model (SRDiff), which is designed to address the issues of over-smoothing and mode collapse [53]. SRDiff is a diffusion-based model for SISR, optimized using a variance-corrected likelihood. Another solution is the efficient super-resolution transformer model, which achieves significant computational cost reduction by combining the lightweight convolutional neural network (CNN) backbone and a lightweight transformer backbone [54].

Another solution employs the GAN described by Goodfellow [55]. The models of these networks find applications in many areas: image synthesis, texture synthesis, object detection, vision, natural language processing (NLP), music, and SR [56]. In the case of GAN models responsible for SR, apart from the model that is responsible for improving the resolution of the LR input image is the neural network that deals with the assessment of the estimated images. The application of two networks (the generator and the discriminator) significantly accelerates the process of training the generator (which is responsible for improving the resolution of LR images). Moreover, the final super resolution (SR) images are characterized by much higher quality than those acquired with the use of other SISR methods. However, these resolutions also have some disadvantages. Due to the fact that two neural network models are employed, GAN networks require quite large computational power. In addition, during the training of the network, one may encounter the problem of unstable gradient (in particular if low-diversity databases are used, e.g., fragments of satellite images). In spite of these difficulties, an increasing number of algorithms are proposed to improve the spatial resolution with the use of GAN.

An example is the cycle-in-cycle GANs, where the image resolution enhancement occurs in three stages. In the first stage, noise is removed from the LR image. Next, using a generator model, the image's resolution is increased. Finally, in the third stage, the two modules are fine-tuned in an end-to-end manner to obtain the HR output [57]. Another solution, the super-resolution generative adversarial networks with Ranker, is aimed at optimizing the generator in the direction of perceptual metrics by training a Ranker and introducing a novel rank-content loss to optimize the perceptual quality [58].

However, one of the most popular solutions is the SRGAN algorithm [59]. In this solution, the SR image is created as a result of the application of a series of operations. The first of them is blurring the LR image with the use of a Gaussian filter. Then the image is decimated with the use of the r-sampling coefficient. The neural network model is trained based on the strategy of maximizing the minimum profit, which is based on a theoretical game scenario, where images estimated by the generator compete with the original HR images in the discriminator model.

The SRGAN model has contributed to many modifications that enable to improve multispectral and hyperspectral images. Still, the most popular modification of the SRGAN model is the ESRGAN model [33]. The authors introduced several changes. The most important ones include the modification of the generator. The scientists removed the batch normalization (BN) layers from the model. In addition, the basic block was replaced with the residual-in-residual dense block (RRDB), which is a combination of a multilevel residual network and dense connections. The introduced modifications improved the stability of training, while the removal of the BN layers allowed reducing the number of trained parameters, which, in turn, allowed accelerating network training. Another important change was the modification of the discriminator.

The authors replaced the standard discriminator (used in the SRGAN) with a relativistic discriminator.

The task of this discriminator is to estimate the probability that image ${I}^{HR}$ is relatively more realistic than the false image${I}^{SR}$. The introduced changes have significantly improved the quality of SR images, e.g., for the database Set5-4x [60] the value of the structural similarity index (SSIM) increased from 0.847 to 0.901 [33].

The authors of numerous solutions that employ GAN networks used residual links. This solution allowed for a significant reduction of the phenomenon of vanishing gradients and accelerated the network training process. An example of the application of residual links is the algorithm proposed by Courtrai et al. [61] that enables to improve the spatial resolution of small objects that are present in aerial and satellite photos.

The algorithms to improve resolution are developed mainly based on databases of digital images that are not images acquired at the aerial or satellite altitudes, e.g., Set5-4x [60], Set14-4x [62], BSD100- 4x [63], URBAN100- 4x [64], FFHQ 256x256- 4x [65], FFHQ 512x512- 4x [65], and FFHQ 1024x1024- 4x [65].

Nevertheless, the developed methods are implemented to improve the spatial resolution of satellite imagery.

SECTION III.

Proposed Method

In order to improve the quality of SR images, the proposed solution involved a modification of the enhanced super-resolution generative adversarial networks (ESRGANs) architecture published in 2018 by Xintao Wang et al. [33]. The introduced modifications mainly refer to the part of the model that is responsible for the assessment of SR images.

The SR images are generated based on the original generator of the ESRGAN [33]. The generator of the ESRGAN is a modification of the SRGAN generator, where BN layers have been removed. In addition, basic blocks have been replaced with RRDBs, which combine a multilevel residual network with dense connections (see Fig. 1). By removing the BN layers, the authors of the solution observed stable training and improved network efficiency (the training process is significantly faster), which resulted from reduced computational complexity. The second modification of the generator network is the implementation of RRDB blocks. These blocks have a residual structure, where residual learning occurs on the main branch of the network (although they can be utilized at different levels). This solution increases network throughput, leading to improved performance [33].

Fig. 1.

Model of the ESRGAN generator (based on [33]).

Show All

The first of the modifications consisted of introducing a multicolumn discriminator. The proposed solution contains two branches, where each of them creates a classifier that uses the convolutional layers. The first of these classifiers has the same structure of layers as the discriminator of the SRGAN network. For the second critic, one convolutional layer was added, the method of activation on ReLU was used (in the part of the network that is responsible for the extraction of attributes) and the kernel size was changed (in even-number layers this parameter is 5 pixels) (see Fig. 2).

Fig. 2.

Discriminator model.

Show All

The application of such a multicolumn discriminator will allow for a better assessment of the operation of the generator, and thus, it will significantly accelerate the training of the network of generators that are responsible for improving the spatial resolution of satellite imagery. Another modification was changing the method of assessing the operation of the generator and discriminator. In the original ESRGAN solution, the generator losses (1) and discriminator losses (2) were calculated from the following equations: \begin{align*} \ L_D^{Ra} =& \ - {E}_{{x}_r}\left[ {\log \left( {{D}_{Ra}\left( {{I}^{HR},\ {I}^{SR}} \right)} \right)} \right] \\ &- \ {E}_{{x}_r}\left[ {\log \left( {1 - {D}_{Ra}\left( {{I}^{SR},\ {I}^{HR}} \right)} \right)} \right] \tag{1}\\ \ L_G^{Ra} =& \ - {E}_{{x}_r}\left[ {\log \left( {1 - {D}_{Ra}\left( {{I}^{HR},\ {I}^{SR}} \right)} \right)} \right]\\ & - \ {E}_{{x}_r}\left[ {\log \left( {{D}_{Ra}\left( {{I}^{SR},\ {I}^{HR}} \right)} \right)} \right] \tag{2} \end{align*} View Sourcewhere ${l}^{SR} - $ preceptual loss function, ${I}^{HR} - $ HR image, ${I}^{SR} - $ SR image, and ${D}_{Ra} - $ relativistic discriminator.

The proposed solution also applied the Wasserstein Loss. This solution was introduced by Martin Arjovsky et al. in 2017, in the article entitled Wasserstein GAN [66]. The authors of the solution have changed the approach to teaching the generator model, which will enable to approximate the distribution of observation data that are used for network training. They modify the discriminator that defines the probability of the image belonging to the SR/HR classes by a critic that determines whether the given image is “true or false”. The aim of the training process is to minimize the distance between the distribution of data that is observed in the training dataset and the distribution observed in the generated examples. The distance is determined with the use of the Wasserstein distance method, which defines the distance between probability distributions in metric space. The Wasserstein distance between two probability measures $\mu $ and $\upsilon $ is calculated using the following formula (3) [67] \begin{align*} W\ \left( {\mu,\vartheta } \right) =& \mathop {\max }\limits_{\alpha,\beta } \bigg[ {{\mathbb{E}}_x\left[ {\alpha \left( X \right)} \right] + {\mathbb{E}}_{x^{\prime}}\left[ {\beta \left( {X^{\prime}} \right)} \right]} \\ &- \ \gamma \mathop \sum \limits_{x,x^{\prime} \in \left\{ {01} \right\}}^d \exp \bigg( \frac{1}{\gamma }\bigg( \alpha \left( x \right) + \beta \left( {x^{\prime}} \right) \\ &- D\left( {x,x^{\prime}} \right) \bigg) - 1 \bigg)\ \bigg]\ \tag{3} \end{align*} View Sourcewhere D:$\ X \times X \to {R}_ + $, $X\ = {\{ {0,1} \}}^d$, W is Wasserstein distance, $\alpha,\ \beta $ are functions in the set X, and $\gamma $ is a joint probability distribution.

The distance carries information about the minimum amount of work that is needed to transform the base distribution into the target distribution, i.e., to improve the spatial resolution of LR data. Additional advantages of the Wasserstein distance method are its properties: it is continuous and differential. Moreover, this solution is more stable during the changes in the architecture of the model or the modification of hyperparameters. An additional advantage of the introduced modification is the fact that it accelerates the model training process. The aim of training the GAN network is to strive to achieve alignment between the generator and the critic by reducing generator's loss. This is enabled by the application of the Kantorovich-Rubinstein duality (4) [68] \begin{equation*} W\ \left( {{{\bm{P}}}_r,{{\bm{P}}}_\theta } \right) = \mathop {\text{sup}}\limits_{{{\left| {\left| f \right|} \right|}}_L \leq 1} \ {\mathbb{E}}_{x\sim {P}_r}\left[ {f\left( x \right)} \right] - {\mathbb{E}}_{z\sim {P}_\theta }\left[ {f\left( x \right)} \right] \tag{4} \end{equation*} View Sourcewhere ${{\bm{P}}}_r - $ real data distribution, ${{\bm{P}}}_\theta - $ is the distribution of the parametrized density ${P}_\theta $, $f:X \to \mathbb{R}$.

In the case where we are dealing with 1-Lipschitz functions, where $f:X \to \mathbb{R}$ and we consider K-Lipschitz for some constant K, then ${| {| f |} |}_L \leq 1$ will take the form ${| {| f |} |}_L \leq K$. Thanks to this relationship, we get the result $K \cdot W( {{P}_r,{P}_\theta } )$. Therefore, when we have a case where ${\{ {{f}_w} \}}_{w \in W}$, which are all K- Lipschitz for some K, We can write this operation as (5) [66] \begin{equation*} \mathop {\max }\limits_{w \in W} {\mathbb{E}}_{x\sim {P}_r}\left[ {{f}_w\left( x \right)} \right] - {\mathbb{E}}_{z\sim p\left( z \right)}\left[ {{f}_w({g}_\theta \left( z \right)} \right]. \tag{5} \end{equation*} View Source

If the above supremum (3) is reached for some $w \in W$ this process will yield a calculation of $W( {{{\bm{P}}}_r,{{\bm{P}}}_\theta } )$up to a multiplicative constant [66].

Considering the specificity of the operation of the GAN network, it takes the form presented in the following equation: \begin{equation*} \mathop {\min }\limits_G \mathop {\max }\limits_{D \in \mathcal{D}} {\mathbb{E}}_{x\sim {P}_r}\left[ {D\left( x \right)} \right] - {\mathbb{E}}_{\tilde{x}\sim {P}_g}\left[ {D\left( {\tilde{x}} \right)} \right] \tag{6} \end{equation*} View Source where $\mathcal{D}$ - set of 1-Lipschitz functions, ${P}_r$ - data distribution, and ${P}_g$ - the model distribution implicitly defined by $\tilde{x}= G(z), z \sim p(z)$.

In order to implement the Wasserstein Loss, the authors recommend using a special configuration of hyperparameters that is presented in Table I.

TABLE I Proposed Configuration of Network Training Hyperparameters [63]

SECTION IV.

Experiments and Results

A. Database

The proposed method was evaluated with the use of three databases that contained identical pairs of LR images (96 × 96 pixels) and HR ones (384 × 384 pixels). The first prepared database comprised aerial photos. The HR database was created as a result of dividing an orthophotomap of the pixel size of 0.25 m. LR images were prepared by sampling HR images to the size of 96 × 96 pixels. The second database was prepared with the use of the Dataset for Object deTection in Aerial images (DOTA) database [69]. The DOTA database is a collection of data created for applications in the field of object detection in aerial or satellite images. The DOTA dataset contains images with varying spatial resolutions, encompassing various scenarios and types of objects, such as vehicles, buildings, sports fields, and roads. In the conducted research, 18 000 random fragments of images of the size of 384 × 384 pixels were selected. LR images, as it was in the case of the first database, were prepared by sampling the HR images. The third database was prepared with the use of imagery acquired by World View 2 (WV2).In this case, the LR database was created by dividing the multispectral imagery of the spatial resolution of 2 m (for the purposes of the research, three channels were prepared: 2, 3, and 4) into smaller images of the dimensions of 96 × 96 pixels. To prepare the HR database, it was necessary to apply pansharpening (the Gram-Schmidt method), which enabled to increase the spatial resolution of the multispectral image four times. The resulting image was divided into smaller images (384 × 384 pixels). The prepared pairs of LR and HR images had the same range of imaging. Sample pairs of LR images from specific databases are presented in Fig. 3.

Fig. 3.

Sample LR and HR images from the prepared databases.

Show All

For training purposes, the prepared LR and HR image databases were divided into three datasets: training data (used to train the network and accounting for 70% of the whole database), validation data (used to evaluate the model during network training and accounting for 20% of the whole database) and test data (used for the final assessment of the model).

B. Analyzed Models

In order to verify the proposed solution, five models were trained for each of the prepared datasets. The aim of this approach was to select the solution that would best solve the problem of improving the resolution of satellite imagery. The first of these models was the original ESRGAN. In the second model, the multicolumn discriminator described above was applied. The model assumed that the maximum value of the assessment of the discriminators was the final loss value. In the third solution, the manner of calculating the final value of model losses was modified: the adopted final value was the average value of the losses of all discriminators.

In order to accelerate the network training process even more, in the fourth model, apart from introducing the multicolumn discriminator, the method of network training was modified: the Wasserstein loss (7) was applied. Moreover, considering the results obtained during the training of models 2 and 3, the weights of the generator were updated based on the average value of loss of each of the branches of the discriminator \begin{equation*} L\ = {\mathbb{E}}_{\tilde{x}\sim {P}_g}\ \left[ {D\left( {\tilde{x}} \right)} \right] - {\mathbb{E}}_{x\sim {P}_r}\left[ {D\left( x \right)} \right]. \tag{7} \end{equation*} View Source

However, it was observed during training that this approach was prone to the occurrence of the vanishing gradients phenomenon. As a result, in the final (fifth) model, the method of network training was modified according to the recommendations by Ishaan Gulrajani et al. [70].

This operation resulted in creating the MCWESRGAN model.

The authors of this solution propose a different method of enforcing the Lipschitz constraint. Based on the assumption that a differentiable function is 1-Lipschtiz if and only if it has gradients with norm at most 1 everywhere, so they consider directly constraining the gradient norm by constraint with a penalty on the gradient norm for random samples $\hat{x} - {P}_{\hat{x}}$ (8) \begin{align*} L\ =& {\mathbb{E}}_{\tilde{x}\sim {P}_g}\ \left[ {D\left( {\tilde{x}} \right)} \right] - {\mathbb{E}}_{x\sim {P}_r}\left[ {D\left( x \right)} \right] \\ &+ \lambda {\mathbb{E}}_{\hat{x}\sim {P}_{\hat{x}}}\left[ {{{\left( ||{{\nabla }_{\hat{x}}D{{\left( {\hat{x}} \right)}}||_2 - 1} \right)}}^2} \right]. \tag{8} \end{align*} View Source

Considering the above modifications, the last model was trained in compliance with the algorithm published by Ishaan Gulrajani et al. [70].

The conducted research involved the preparation of five models of the GAN network (mainly by introducing modifications to the structure of the discriminator and modifying the network training method). In order to select the best solution, all the models discussed above were trained based on each of the prepared databases.

C. Methods of Assessing the Results

The quality of the operation of the proposed network model was determined with the use of the most popular metrics used in the fields of remote sensing and computer vision. A list of the metrics used is provided in Table II.

TABLE II List of the Most Important Assessment Methods

Moreover, to improve the assessment of the generated images, an application was created that enables the local assessment of the obtained results. In this solution, the image is divided into smaller areas (determined by the p parameter). Then, the aforementioned metrics of qualitative assessment are defined. This method allows for the determination of areas that were “worse” assessed by the algorithm to improve spatial resolution. This information may become the basis to draw conclusions concerning the modifications that should be introduced, for example, during network training or on the manner of modifying the training database so as to improve the quality of the estimated SR images (see Fig. 4).

Fig. 4.

Block diagram of the functioning of the image assessment algorithm.

Show All

Another method proposed by the authors to assess this type of images is the analysis of the PSD of image, which describes the distribution of the frequency of signal power. This method allows us not only to assess the change in the GSD parameter but also in the GRD parameter (see Fig. 5). The GRD parameter defines the smallest size that may be distinguished in the image. Its value depends on the contract between the object and the background. For objects that are characterized by high contrast with the background the value of the GRD is two times the value of the GSD parameter, while if the contrast is low, then $GRD\ = \ 2\sqrt 2 GSD$ [71].

Fig. 5.

Sample diagram of PSD, for the spatial frequency of the HR image of the GSD = 0.5 m, along the x-axis.

Show All

The combination of the image assessment methods discussed above enables an unambiguous determination of the quality of the estimated images. Qualitative metrics of image assessment allow for a global assessment of SR images. Applying them to local areas also enables to identify areas that are poorly represented, e.g., containing errors or artifacts. On the other hand, conducting an analysis of PSD for various spatial frequencies enables us to determine the level of improvement of the interpretation capacity of SR images.

D. Results

All the analyzed artificial neural networks were trained based on the Nvidia TITAN RTX 24 GB graphics card, Intel Xeon Silver 4216 processor and the Ubuntu 18.04 operating system. The input parameters adopted for the ESRGAN training were the authors’ recommendations (learning rate is initialized as $2 \cdot {10}^{ - 4}$, decayed by a factor of 2 every $2 \cdot {10}^5$ of mini-batch updates, Adam optimization with β1 = 0.9, β2 = 0.999). However, due to the vanishing gradient phenomenon that occurred for the ESRGAN after 30 000 steps, the rate of network training was slowed down by reducing the value of the learning rate parameter [33].

Based on the above assumptions, three models (that differed in terms of the training dataset) were trained for each of the analyzed architectures of the conditional GAN network.

Table III presents the qualitative results of the evaluation metrics conducted on the database of images acquired by the WorldView-2 satellite. Analyzing the results, we observe a significant decrease in the training time required for each SISR model. The implementation of a multicolumn discriminator allows for approximately a 7-fold reduction in the training time of the model. At the same time, the use of Wasserstein loss shortens this time by over ten times. However, a drawback of the modified loss method is a relatively high likelihood of encountering the issue of vanishing gradients, especially in the case of satellite imagery. In the case of the MCWESRGAN, the “enforce the Lipschitz constraint” method was applied to mitigate the vanishing gradient issue, enabling longer network training and consequently improving the quality of the generated images, as confirmed in the qualitative analysis. The results of the quality assessment of SR images for the remaining databases are presented in Appendix 1.

TABLE III Presentation of Results for Satellite Images

To improve the evaluation of the proposed strategies for training GANs, qualitative metrics were calculate for the assessment of the image. These metrics are presented in Tables V and VI (Appendix 1). Apart from that, the table contains information about the number of hours that were spent on training the analyzed models (generators).

In order to improve the evaluation of SR images, the values of local SSIM and peak signal-to-noise ratio (PSNR) metrics were calculated. In this approach, the reference image (HR) and the estimated image (SR) are divided into smaller areas of the dimensions determined by the p parameter. Then, the values of the selected metrics of qualitative assessment were calculated for the defined areas (for the purposes of the presentation of results, only two metrics of qualitative assessment were selected, as, in the opinion of the authors, they best reflect the differences between the quality of estimation of the areas in SR images). Appendix 2 presents the examples of the conducted local analysis of SR images.

In order to determine the quality of improvement of the spatial resolution of SR images, the ratio of power density spectrum to spatial frequency was analyzed. This method enables us to determine the level of increase in the interpretation capacity of SR images. The analyses conducted for sample images on the x- and y-axes are presented in Appendix 3.

The presented results demonstrate that the SR image (the yellow curve in the diagram) has the same value of the GSD as the HR image (the spatial resolution increased four times in comparison to the LR image). In addition, the analysis of the ratio of power density spectrum to spatial frequency of the HR and SR images revealed that image resolution was improved only in the area marked by the blue rectangle, i.e., to a spatial density of approx. 105 [1/100 m] (see Fig. 5). The power density spectrum is best reflected in this area. As a result, the value of the GRD parameter is approx. 1.8–2.1 GSD.

To further verify the proposed model, the SR images generated by MCWESRGAN were compared with the SR images estimated by other SISR models based on deep neural networks. Based on visual analysis (see Appendix 4) and the results of the qualitative image assessment (see Table IV), it can be observed that the proposed solution allows for the most accurate image reconstruction, especially for linear elements and surfaces. Examining the obtained results, our model exhibits the best evaluation metric values (see Table IV), indicated in green. Metrics with the worst values are highlighted in red (mainly attributed to the EDSR model). From the conducted analysis, it is clear that our proposed MCWESRGAN achieves the best results in each of the evaluated metrics (SSIM, PSNR, spectral angle mapper [SAM], SCC, and universal quality measure [UQI]).

TABLE IV Quantitative Evaluation of MCWESRGAN

In addition, it can be noticed that SR images from satellite imagery fragments exhibit better results compared to SR images from the DOTA database. That may be due to the significantly lower spatial resolution of satellite imagery compared to the images in the DOTA database, resulting in fewer details present in the images. Furthermore, attention should be drawn to the SRDenseNet model. SR images estimated by this model also achieve equally good results during the qualitative assessment. However, through visual analysis, it can be observed that the SR images appear to be painted with a brush, distinguishing them from the HR images. The worst results were obtained by the EDSR network (SSIM value below 0.80 for the satellite imagery database, whereas for our model, SSIM = 0.92). Conducting visual analysis (examples of images can be found in Appendix 3), even degradation of interpretational potential can be observed.

The obtained results demonstrated that the application of a multicolumn network in the generator model significantly accelerates the process of network training. For the analyzed databases, the time required to train the generator was shortened more than 10 times. The implementation of Wasserstein loss additionally accelerated the training of the model. In addition, the quality of the estimated SR images was better than that of those estimated by the ESRGAN.

The analysis of the obtained results (mainly based on local assessment of images) allows us to claim that anthropogenic objects are better represented in SR images than objects of natural origin (e.g., tree crowns). The exceptions are clearings and meadows, whose assessment quality increases with the decrease in spatial resolution of LR images.

SECTION V.

Discussion

This article presents a new architecture of a conditional GAN that enables to improve the resolution of images acquired at the aerial and space altitudes. The core of the proposed solution is the ESRGAN, where the structure of the discriminator model was modified (the multicolumn network was used and the arrangement of layers was modified). The training strategy of the GAN network was also improved. The introduction of Wasserstein Loss enabled us to significantly increase the rate of network training: more than 10 times faster with a simultaneous improvement of the quality of estimated SR images (Appendix 1).

The conducted research has demonstrated that the use of a multicolumn discriminator and Wasserstein loss during the generator training process:

significantly accelerates the network training process,
mitigates the issue of vanishing gradients, leading to a more stable model training,
results in estimated SR images of higher quality, as confirmed through both local and global image evaluations.

However, it is worth noting that this proposed solution does not address the problem of the characteristic texture artifacts that appear in SR images (this phenomenon is also visible in images estimated by the ESRGAN model).

During the research works, an experiment was additionally conducted, in which the model was trained based on satellite data (LR image database) and aerial imagery (HR image database). During the creation of the database, the authors attempted to select the data so that the time of data acquisition was similar. Examples of LR and HR images are presented in Fig. 6. To reduce the differences in the quality of the obtained data, resulting from various methods of image acquisition, and different structures of the optical systems, the histogram of the HR image was matched to the LR image. This operation allowed us to significantly reduce the differences between LR and HR areas.

Fig. 6.

Examples of image pairs. (a) LR. (b) HR. (c) HR after histogram adjustment.

Show All

Analyzing the obtained results, it is worth noting the quality of the work of the generators (which were prepared based on multispectral images and aerial imagery). The values of qualitative assessment metrics are much poorer in comparison to the models that were trained on other databases. Fig. 7 presents an example of the operation of a generator.

Fig. 7.

Examples of the improvement of resolution of satellite images based on the example of aerial imagery. (a) LR. (b) SR. (c) HR.

Show All

When the database was created, the authors attempted to select the data so that the time and conditions of image acquisition were similar. However, with the use of open data sources, it was possible to acquire aerial images that were taken only in the same year as the WV2 satellite imagery. Unfortunately, the satellite imagery and aerial photos were acquired:

in different seasons of the year, so the plant cover visible in the images is at various stages of vegetation,
at different times of day, so the difference in the façade angle and azimuth of the Sun influenced the position of shadows that are visible in the photos,
at different values of sensor tilt angle, so various walls of the facades of tall buildings are visible.

The aim of the research was to improve the spatial resolution of satellite imagery of the GSD of approximately 2 m. Due to that, the differences within pairs of images (LR and HR) had a significant influence on the quality of network training. During the research, none of the tested networks managed to cope with his problem. At the same time, in a situation when it is necessary to improve the resolution of satellite images of lower resolution (when the GSD value exceeds 10 m), the differences that result from different ways of acquiring the data will have a lower influence [72], [73], [74].

SECTION VI.

Conclusion

The conducted research demonstrated that the application of a multicolumn discriminator allows us for a significant acceleration of the network training process. The rate of this process may be additionally accelerated by implementing the Wasserstein Loss that uses Wasserstein distance. Although Wasserstein Loss increases the model's proneness to overtraining (especially for images acquired at aerial and space altitudes). Moreover, we would like to attract the readers’ attention to the manner of assessment of the images estimated by GAN. During the global qualitative assessment of SR images, one may encounter cases where estimation errors are visible, in spite of the high results of the visual assessment of the image. In these cases, the high value of the assessment metrics may result in a very good representation e.g., of homogeneous surfaces (e.g., surfaces covered with shadows, or roads), which “overstate" the final score. The application of local assessment enables the identification of areas of lower or higher SR assessment quality. Such information may be valuable for preparing an additional database of images, which may provide a basis for supplementary training of an existing model. Furthermore, we recommend assessing interpretational potential based on the PSD of the image. This method allows for a numerical determination of the improvement in recognition capabilities in the estimated SR images. The proposed approach enables a precise evaluation of SR images, facilitating a better assessment of the generator and detecting outliers that may lead to overestimation or underestimation of global metrics in the qualitative image assessment.

The obtained results revealed that the quality of SR images deteriorated with the increase in spatial resolution of LR images (see Tables III–V). For images acquired at aerial altitudes (where the value of the GSD parameter is lower than 30 cm), deformations that result from incorrect estimation of the details of objects that are in the photograph have a significant influence on the result of the assessment of SR images.

In addition, it was noted that if homogenous databases are used (i.e., if the images used originate from the same set of satellite imagery or photogrammetric flight), where the conditions are similar, the probability of occurrence of the vanishing gradient phenomenon increases. If heterogeneous databases are used (e.g., DOTA that contains aerial and satellite images as well as images acquired by UAV) the phenomenon does not occur.

Considering the obtained results, future research will verify the proposed method for improving the spatial resolution of image sequences acquired by nano- and minisatellites.