Loading [MathJax]/extensions/MathMenu.js
A Novel Deep Learning Architecture for Agriculture Land Cover and Land Use Classification from Remote Sensing Images Based on Network-Level Fusion of Self-Attention Architecture | IEEE Journals & Magazine | IEEE Xplore

A Novel Deep Learning Architecture for Agriculture Land Cover and Land Use Classification from Remote Sensing Images Based on Network-Level Fusion of Self-Attention Architecture


Abstract:

AI-driven precision agriculture applications can benefit from the large data source that remote sensing (RS) provides, as it can gather agricultural monitoring data at va...Show More
Topic: Remote Sensing and Artificial Intelligence for Sustainable Agricultural Applications

Abstract:

AI-driven precision agriculture applications can benefit from the large data source that remote sensing (RS) provides, as it can gather agricultural monitoring data at various scales throughout the year. Numerous advantages for sustainable agricultural applications, including yield prediction, crop monitoring, and climate change adaptation, can be obtained from RS and artificial intelligence. In this work, we proposed a fully automated optimized self-attention fused convolutional neural network (CNN) architecture for land use and land cover classification using RS data. A new contrast enhancement equation has been proposed and utilized in the proposed architecture for the data augmentation. After that, a fused self-attention CNN architecture was proposed. The proposed architecture initially consists of two custom models named IBNR-65 and Densenet-64. Both models have been designed based on the inverted bottleneck residual mechanism and dense blocks. After that, both models were fused using a depth-wise concatenation and append a self-attention layer for deep features extraction. After that, we trained the model and performed classification using neural network (NN) classifiers. The results obtained from the NN classifiers are insufficient; therefore, we implemented a Bayesian optimization and fine-tuned the hyperparameters of NN. In addition, we proposed a quantum hippopotamus optimization algorithm for the best feature selection. The selected features are finally classified using fine-tuned NN classifiers and obtained improved accuracy of 98.20, 89.50, and 91.70%, and the highest precision rate is 98.23, recall is 98.20, and F1-score is 98.21, respectively, for SIRI-WHU, EuroSAT, and NWPU datasets. Moreover, a detailed ablation study was conducted, and the performance was compared with SOTA. The proposed architecture shows improved accuracy, sensitivity, precision, and computational time performance.
Topic: Remote Sensing and Artificial Intelligence for Sustainable Agricultural Applications
Page(s): 6338 - 6353
Date of Publication: 28 February 2024

ISSN Information:

Funding Agency:


SECTION I.

Introduction

Advances in remote sensing (RS) technology have made it possible to obtain a significant amount of satellite data quickly. For RS community researchers, new challenges are always opened based on the high-resolution satellite images [1]. Computer vision researchers have extensively utilized RS images for many semantic tasks, including but not limited to road segmentation, building extraction, land cover classification, IOT, and agricultural land classification [2], [3], [4]. The land cover classification achieved remarkable attention in computer vision due to its essential applications such as urban planning, crop fields, and landslide hazards [3]. The RS data are not easy to use because several things fall under the same category and might be seen in the same image, i.e., the vegetation category includes forest regions, herbaceous plants, and permanent crops [5]. A few sample images are illustrated in Fig. 1.

Fig. 1. - Few sample images of land cover RS.
Fig. 1.

Few sample images of land cover RS.

Several techniques were introduced in the literature to classify land cover from RS images [6], [7]. The presented techniques are based on supervised learning and unsupervised learning. In unsupervised learning, clustering techniques have usually been employed for classification [8], [9], such as K-means and fuzzy C-means; however, these techniques are not suitable and efficient due to many labeled imaging RS images [10]. The supervised learning methods used traditional methods such as handcrafted features and classification using machine learning classifiers. The handcrafted features are extracted based on prior information such as texture, shape, and point features [11]. However, for complex scenes, it is not easy to extract the most discriminative features [12], [13]. Feature selection is an important area in pattern recognition research, and many techniques have been introduced in the literature. Feature selection techniques mainly aim to reduce irrelevant information from the original feature space and minimize the computation time [14].

Deep learning has gained huge success worldwide for several applications, especially RS and object classification [15]. Deep learning is famous due to its large learning capacity and better performance of huge datasets. Convolutional neural networks (CNNs), a widely used deep learning model, can learn abstract features from images layer-by-layer, and it is utilized in diverse fields like healthcare [16], action recognition [17], satellite imaging [18], fraud detection [19], and many more. A simple CNN architecture consists of several intermediate layers, such as a convolutional layer, a pooling layer, a ReLu activation layer, a batch-normalization layer, an additional layer, a fully connected (FC) layer, and a softmax layer [20]. Several recent studies used pretrained deep learning models for land cover classification using RS images. Papoutsis et al. [21] introduced a CNN architecture for land cover classification from RS images. The presented architecture is based on multilayer perceptron and vision transformers.

Moreover, they added an EfficientNet mechanism to improve the training time and accuracy. They also compared the presented architecture with baseline ResNet50, showing improved accuracy and time. Ma et al. [22] presented a features enhancement neural network (NN) for land cover classification. They also added a self-attention module to extract the local information of the images and reduce the classification loss. Compared with the previous model named PSPNet, this network improved accuracy by 2%. For the crop classification task from the RS images, Patel et al. [23] presented a comparative study that focused on the pretrained model's performance, such as VGG16, VGG19, ResNet, and DenseNet. They trained these models on the RS image data and analyzed the performance. In addition, they analyze the performance of 2D CNN and 3D CNN custom architectures. Based on the results, they conclude that the custom models show better performance. Helber et al. [24] introduced a novel RS dataset named EuroSAT for the land cover classification task. The dataset that was produced was tested on a deep learning architecture, and improved classification accuracy was obtained. Kussul et al. [25] introduced a deep learning technique for land use and crop type classification using RS images. A review study is also presented by Vali et al. [26] for the land cover classification techniques. This study discussed the preprocessing, features engineering, and classification techniques for the land cover classification. Zhang et al. [27] presented a joint Deep CNN architecture using RS data for the land cover classification. They combined the patch-based CNN and pixel-based MLP with joint reinforcement to improve classification accuracy. Otal et al. [28] presented a framework for postbuyout land cover mapping using harnessing deep learning. They used FEMA's and grant program (HMGP) buyout dataset for experimental purpose. They gathered 40 053 satellite images of buyout lands and they employed deep learning models to evaluate the performance and they achieved 98.86% AUC score. Temenos et al. [5] presented a deep learning-based interpretable framework for classifying land use and land cover using satellite images. The authors use the EuroSAT dataset and employ customized CNNs. They achieved 94.72% accuracy and they applied SHAP to verify the performance of models. Vinaykumar et al. [29] presented a hybrid deep learning and optimal guidance whale optimization for land cover classification using satellite images. In this article, the authors applied alexnet and resnet50 for feature extraction and employed whale optimization for best feature selection. The authors used bi-lstm for the classification purpose. They achieved 89.57% on AID and 93.21% on NWHP dataset. A few more studies are also introduced, such as the high-resolution adaptive model [30], 3D-HRNET [31], crop classification [32], and the Northern Border Region technique [33].

Although the techniques mentioned earlier show impressive performance, there is room for improvement in accuracy and computational time optimization. The fusion process has shown an impact in the last few years for the improvement of accuracy; however, there is a drawback of this step, which is an increase in computational time. In the fusion process, the researchers usually fuse the features of two models [34]; however, this process is inefficient and consumes more time. To overcome this problem, a few researchers employed feature selection techniques that remove the irrelevant features from the final classification. In addition to that, the performance has been improved significantly after employing the optimization process [35], [36]. A few latest studies also focused on deep learning for the agri-yield classification using RS images [20], [37], [38], [39], [40], [41]. Overall, it is a complex process that first fuses the features of both models and then selects the best of them; therefore, it is important to consider a more optimized approach, such as network-level fusion.

In the network-level fusion, two models are created and combined within the network for improved classification accuracy and less computational cost. However, controlling the total learnable when performing a fusion process within a network is challenging. To tackle this challenge, we proposed a novel CNN architecture based on network-level fusion and optimization in this work. In the fusion process, we design two custom networks based on the bottleneck and dense mechanisms and then concatenate using a depth concatenation layer. After the fusion layer, a self-attention layer has been added to extract the local information of the image. Further, the model performance has been optimized using a Bayesian optimization (BO) and quantum hippopotamus optimization algorithm (QHPO). Our major contributions of this work are given as follows.

  • To address the problem of an imbalanced dataset, we performed data augmentation at the initial step using the contrast stretching technique and employed the updated dataset for the training process.

  • The samples of satellite images have complex patterns and are difficult to recognize; therefore, we proposed a novel network-level fusion Self-Attention CNN architecture. The proposed architecture's upper part is based on two mechanisms bottleneck and dense blocks. The depth concatenation layer combined both architectures and added a Self-Attention layer for local information extraction that followed the FC and softmax layers.

  • A BO technique has been implemented using an expected improvement (EI) acquisition function that optimizes the hyperparameters of the proposed model instead of past knowledge selection.

  • The irrelevant and redundant features misclassified the RS classes; therefore, we proposed a new QHPO that selects the best features and reduces the computation time.

The rest of this article is organized into four main sections. Section II presented a detailed study of the proposed methodology, including dataset description, data augmentation, proposed CNN-based self-attention model, BO learning, and QHPO-bases feature selection. In Section III detailed results of the proposed framework are discussed. Section IV concludes the overall proposed research.

SECTION II.

Proposed Methodology

The proposed land cover classification framework has been presented in this section. The proposed framework comprises a novel CNN self-attention fused architecture for feature extraction. Fig. 2 illustrates the proposed framework of land cover classification from RS images. In the proposed framework, data augmentation has been performed at the initial step using a hybrid contrast enhancement technique. After that, fused self-attention deep learning architecture was proposed, and the hyperparameters used BO for the training process. Deep features have been extracted from the self-attention layer and optimized using a novel QHPO. The proposed QHPO selects the best features for final classification using NN classifiers.

Fig. 2. - Proposed framework of land cover classification from RS images.
Fig. 2.

Proposed framework of land cover classification from RS images.

A. Dataset Description

In this work, we employed three datasets for the classification purposes such as EuroSAT [24], NWPU-RESISC45 [42], and SIRI-WHU [43]. These datasets are publicly available for the research purposes of the RS domain. The nature of the selected dataset is RGB. A brief description of each dataset is given in the following.

The EuroSAT dataset consists of 10 classes, as summarized in Table I. The aim of this dataset was land cover and land use classification. Each image of this dataset has 64 by 64 pixels and a 10-m ground sampling distance. Every one of them was gathered by the Sentinel-2 satellite. The total number of images in this dataset is 27 500 [24].

TABLE I Description of Selected Datasets for the Classification of Land Cover and Land Use From RS Images
Table I- Description of Selected Datasets for the Classification of Land Cover and Land Use From RS Images

Northwestern Polytechnical University created the NWPU dataset in 2017 [42]. This dataset is publically available for the researcher of RS. There are 12 classes in this dataset; each class contains images in the range of 700 to 1400 with a pixel ratio of 256×256. Within the scene classes, the spatial resolution drops from 30.0 to 0.2 m per pixel. The dataset is quite difficult because of the rich image variations, some discrepancies within the scene classes, and some commonalities between the scene classes. Table I shows the number of images in each class.

The SIRI-WHU dataset is a publicly available RS database used for land use classification [43]. It consists of 12 classes and 2400 images (200 images in each class). Each image is 200 × 200 and has a spatial resolution of 2 m. Table I demonstrates the description of each class. Moreover, a few sample RS images are shown in Fig. 3.

Fig. 3. - Few sample RS images of SIRI-WHU dataset for land cover classification.
Fig. 3.

Few sample RS images of SIRI-WHU dataset for land cover classification.

Fig. 4. - Visual illustration of the proposed contrast enhancement approach for RS images.
Fig. 4.

Visual illustration of the proposed contrast enhancement approach for RS images.

B. Data Augmentation of Training Set

In this work, we utilized the contrast enhancement technique to generate new images instead of traditional rotate and flip operations, a few sample images are shown in Fig. 4. A new mathematical fitness function is proposed for contrast enhancement. Consider I is an input image of dimension 256×256×3, and the resultant image is denoted by F. A contrast function is implemented at the initial step to improve the image pixel intensity values as follows: \begin{equation*} {\tilde{ {\mathbb {F}}}}\left(u \right) = \frac{1}{{\int_0^1 {\mathbb {I}}^{a - 1}{{{\left({1 - {\mathbb {I}}} \right)}}^{b - 1}}d{\mathbb {I}}}} \times \int_{0}^{u} {{\mathbb {I}}^{a - 1}{{{\left({1 - {\mathbb {I}}} \right)}}^{b - 1}}d{\mathbb {I}}} \tag{1} \end{equation*} View SourceRight-click on figure for MathML and additional features.where ${\mathbb {I}}$ is an integrated image, $a$ and $b$ are two adjusted parameters that control the brightness of an image during the processing. Finally, the resultant $\tilde{\mathbb {F}}$ is passed to the fitness function and obtains the final enhanced image as follows: \begin{align*} {\mathbb{G}}\left({\tilde{\mathbb {F}}} \right) & = {{u}_1}\log \left({\left({\log (S\left({\tilde{\mathbb {F}}} \right)} \right)} \right) + {{u}_2}H\left({\tilde{\mathbb {F}}} \right)\\ & + {{u}_3}\log \left({{\mathbb {I}} + \tilde{\mathbb {F}}} \right) \tag{2} \end{align*} View SourceRight-click on figure for MathML and additional features.where ${{u}_1}$, ${{u}_2}$, and ${{u}_3}$ are weight coefficients, $S({\tilde{\mathbb {F}}})$ denotes the sum of the fringe intensities of the image, and $H({\tilde{\mathbb {F}}})$ is entropy value. This technique is employed for only less number of images classes such as in EuroSAT, the augmentation is performed for industrial, pasture, permanent crop, and river class. In these classes, we generated more images until the total images are reached to 3000 (maximum images of this dataset). Similarly, this process is performed for the rest of the datasets.

C. Proposed Methodology

In this work, we proposed a fused self-attention deep learning architecture for land cover and land use classification from RS images. The proposed architecture consists of two CNN architectures (bottleneck and density mechanism) fused in the last stage and then embedded in a self-attention layer for feature extraction. Each architecture consists of several intermediate layers: a convolutional layer, pooling layer, ReLu activation layer, batch normalization layer, self-attention layer, FC layer, and a softmax layer. A visual architecture of the proposed fused self-attention CNN architecture is illustrated in Fig. 5. This figure illustrates that the enhanced images have been employed as input to this network called the input layer. After that, two networks were designed: Inverted bottleneck residual 65 layered (IBNR-65) and Densenet-64. The IBNR-65 comprises 65 hidden layers in the inverted bottleneck fashion, whereas the Densenet-64 includes 64. In the IBNR-65 network, the initial convolutional layer has been added to convolve the features on depth size 32 and kernel size 3 ×3. Mathematically, the activation of the convolutional layer has been defined as follows: \begin{equation*} {{\tilde{\phi }}^{\left(i \right)}}\left(l \right) = \max\left({0,{{{\tilde{\beta }}}^{j(l)}} + \sum_j {{{J}^{i,j\left(l \right)}}*\;{\mathbb {F}}^{i\left(l \right)}}} \right) \tag{3} \end{equation*} View SourceRight-click on figure for MathML and additional features.where ${\mathbb {F}}^{i(l)}$ denotes the input activation map, ${{\tilde{\phi }}^{(i)}}(l)$ is ith output activation map, ${{\tilde{\beta }}^{j(l)}}$ is a bias of jth output activation map, and ${{J}^{i,j(l)}}$ is a convolutional kernel between ith and jth maps.

Fig. 5. - Proposed fused self-attention deep learning architecture for land cover and land use classification.
Fig. 5.

Proposed fused self-attention deep learning architecture for land cover and land use classification.

After the convolutional layer, a ReLu activation layer has been added to convert nonlinear separable data into linear form and then fed to the next layer as an input. Another important layer in this architecture is the max pooling layer. The purpose of this layer is to scale down the image but keep the important features for the recognition task. Mathematically, this layer is defined as follows: \begin{equation*} {{\tilde{\phi }}^{\left(i \right)}}_{jk} = \max\left({{\mathbb {F}}_{js + m,ks + n}^i} \right). \tag{4} \end{equation*} View SourceRight-click on figure for MathML and additional features.

After this layer, a batch normalization layer is added to fasten the training process. In addition, the skip connections have been combined using an additional layer. The skip connections reduce the complexity of the proposed model and improve the efficiency based on the combined information as follows: \begin{equation*} {{\tilde{\phi }}^{\left(i \right)}}_{skip} = {{L}_i} + CL \tag{5} \end{equation*} View SourceRight-click on figure for MathML and additional features.where Li denotes the output layer of the respective path and CL denotes the skip layer that is connected with Li using an additional layer.

We added several residual and dense blocks using these layers, as shown in Fig. 5, for both IBNR-65 and Densenet-64. Each block consists of a convolutional 2D layer, batch normalization layer, grouped convolutional layer, max-pooling layers, and ReLu activation. The depth size starts from 32 and ends with 1024. Finally, a global average pool layer has been added for both networks and depth-wise concatenated in a single layer named the depth concatenation layer. After the depth-wise concatenation, we added a self-attention layer to extract the most important information of the input image.

1) Self-Attention

Self-attention networks (SANs) are becoming increasingly popular because of their high computational parallelization and adaptability when modeling interdependence. In the CNN architecture, the self-attention module improves the performance of a network due to the attention on the more overriding area of the image. Through SAN, local features of the image have been extracted. Consider we have a depth concatenation layer features denoted by $\tilde{\phi } \in {\mathbb{R}}^{{C \times N}}$, where C denotes the number of channels. A 1 × 1 convolutional operation has been performed on $\tilde{\phi }$ and obtained three 1D matrix f, g, and h, as shown in Fig. 6. Mathematically, f, g, and h are defined as follows: \begin{align*} &f\left({\tilde{\phi }} \right) = {{\psi }_f}\tilde{\phi },g\left({\tilde{\phi }} \right) = {{\psi }_g}\tilde{\phi } \tag{6}\\ &h\left({\tilde{\phi }} \right) = {{\psi }_h}\tilde{\phi } \tag{7}\\ &{{\psi }_f},{{\psi }_g},{{\psi }_h} \in {\mathbb {R}}^{{{C}^*} \times C}. \tag{8} \end{align*} View SourceRight-click on figure for MathML and additional features.

Fig. 6. - Self-Attention module for local features extraction of the image.
Fig. 6.

Self-Attention module for local features extraction of the image.

The feature maps of $f({\tilde{\phi }})$ and $g({\tilde{\phi }})$ are combined by performing a series of softmax. \begin{align*} &{\mathbb{S}}_{{ji}} = \frac{{\exp \left({{{\psi }_{ij}}} \right)}}{{\sum_{i = 1}^N \exp \left({{{\psi }_{ij}}} \right)}} \tag{9}\\ &{{\psi }_{ij}} = f{{\left({{{{\tilde{\phi }}}_i}} \right)}^T}g\left({{{{\tilde{\phi }}}_j}} \right). \tag{10} \end{align*} View SourceRight-click on figure for MathML and additional features.

The final output of the self-attention layer is defined in (11), which is further utilized for the features extraction map. \begin{equation*} {{O}_j} = {\rm{\Phi }}\left({\sum_{i = 1}^N {{\mathbb{S}}_{{ji}}h\left({{{{\tilde{\phi }}}_i}} \right)} } \right). \tag{11} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The resultant attention map feature matrix is fed to an FC layer. In the FC layer, all features are FC. Mathematically, it is defined as follows: \begin{equation*} \tilde{\phi }_{jk}^i = A\left(w \right)\left({\sum_{i = 1}^{n\left({l - 1} \right)} {{{{\tilde{\phi }}}^{l - 1}}\left(i \right).{{\mathbb {X}}^l}\left({i,j} \right) + \beta _i^{\left(l \right)}} } \right) \tag{12} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $l - 1$ denotes the number of neurons in the previous layer $({l - 1})$, ${\mathbb {X}}^l({i,j})$ denotes the weight for the connection from a neuron i in layer $({l - 1})$ to neuron j in layer l. The A (w) denotes the activation function of this layer. The output of this layer is fed to the softmax layer for multiclass land cover and land use classification.

D. Training and Feature Extraction

The training was performed after the design of the proposed self-attention fused CNN architecture. Several hyperparameters have been employed in the training process, such as learning rate, mini-batch size, epochs, and optimizer. The values of these hyperparameters have been computed using BO. The model is trained on 50% of images that was splited before the augmentation process. Augmentation was performed on the split images. After a model's training, deep features are extracted from the self-attention layer with the size of N × 2048 that is further used for classification. Many NN classifiers, such as narrow, medium, wide, trilayered, and bilayered, have been employed for the classification results. The accuracy and a few other measures have been computed for each classifier. This step's noted accuracy was insufficient compared with the recently published techniques; therefore, we optimized the NNs using BO. Also, we increased the hidden layers for each classifier.

E. BO-Based Learning

Black box optimization problems, in which the objective function, represented by the symbol f(y), is treated as a black box, are the focus of deep learning optimization. In such scenarios, BO is shown to be extremely beneficial, especially when human expertise cannot significantly enhance accuracy. By combining previous knowledge of the function $f$ and continuously updating posterior information, this method minimizes loss and maximizes model accuracy. In contrast to the difficult and nonreproducible process of human tuning, BO effectively identifies the global optima of the black box function of the NN. The first step of the BO is a Gaussian process to update the prior function $F$ results and adopt the posterior distribution.

Gaussian processes are deep learning techniques developed using Bayesian learning theory and Gaussian stochastic processes. Any finite subcollection of random variables has a multivariate Gaussian distribution for a stochastic process called a Gaussian process. A statistical model of the function is assumed by the Gaussian process, which is based on the idea that comparable input produces similar output. Similar to a Gaussian distribution defined by mean and covariance, a Gaussian process is defined by its mean function, $n:\ \mu \ \to R$, and its covariance function, $p:\ \mu \ \times \ \mu \ \to R$. The Gaussian process is known as follows: \begin{equation*} f\left(\mu \right)\sim GP\left({n\left(\mu \right),p\left({\mu,\mu ^{\prime}} \right)} \right). \tag{13} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The probability density function $f(\mu)$for an arbitrary $\mu $ is no longer a scalar but rather a normal distribution function over all possible values of $\mathrm{f}(\mu)$. This is how the Gaussian process differs from the Gaussian distribution assume for convenience that the mean function $n(\mu)$ of the Gaussian process is 0. The exponential square function is a common option for the covariance function $p$ \begin{equation*} p\left({\mu i:\mu j} \right) = \exp \left({ - \frac{1}{2}\mu i:\mu {{j}^2}} \right). \tag{14} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The following procedure can be used to determine $f{{(\mu)}^\prime }$ posterior distribution. As the training set $Z1$, the first sample $t$ observations are as follows: ${{Z}_{1:t}} = {\{ {{{\mu }_n}} {,{{f}_n}} \}}_{n = 1}^t,{{f}_n} = ({{{\mu }_n}}).$ Assume, a multivariate normal distribution is used to draw the function values $f$, which are drawn according to a multivariate normal distribution $f\sim M({0,R})$. \begin{equation*} R = \left[ \! {\begin{array}{c} {p\left({{{\mu }_1},{{\mu }_1}} \right)\left({{{\mu }_1},{{x}_2}} \right) \ldots ..p\left({{{\mu }_1},{{\mu }_t}} \right)}\\ {p\left({{{\mu }_2},{{\mu }_1}} \right)\left({{{\mu }_2},{{x}_2}} \right) \ldots ..p\left({{{\mu }_2},{{\mu }_t}} \right)}\\ { \ldots .}\\ \ldots \\ {p\left({{{\mu }_t},{{\mu }_1}} \right)\left({{{\mu }_t},{{\mu }_2}} \right) \ldots ..p\left({{{\mu }_t},{{\mu }_t}} \right)} \end{array}} \! \right]. \tag{15} \end{equation*} View SourceRight-click on figure for MathML and additional features.

At the new sample point ${{\mu }_{t + 1}}$, compute the function value ${{f}_{t + 1}} = f({{{\mu }_{t + 1}}})$ based on the function ${{f}_1}$. The function value ${{f}_{t + 1}}$ and ${{f}_{1:t}}$in the training set, under the Gaussian process assumption, follow the t+1 dimensional normal distribution \begin{equation*} \left[ \! {\begin{array}{c} {{{f}_{1:t}}}\\ {{{f}_{t + 1}}} \end{array}} \! \right]\sim M\left({0,\left[ {\begin{array}{rr} R & p\\ {{p}^T} & p\left({{{\mu }_{t + 1}},{{\mu }_{t + 1}}} \right) \end{array}} \right]} \right) \tag{16} \end{equation*} View SourceRight-click on figure for MathML and additional features.where ${{f}_{1:t}} = {{[ {{{f}_1},{{f}_2}, \ldots \, \ldots \, ..{{f}_t}} ]}^T}$, \begin{equation*} p = \left[ {p\left({{{\mu }_{t + 1}},{{\mu }_1}} \right)p\left({{{\mu }_{t + 1}},{{\mu }_2}} \right) \ldots ..p\left({{{\mu }_{t + 1}},{{\mu }_t}} \right)} \right]. \tag{17} \end{equation*} View SourceRight-click on figure for MathML and additional features.

And ${{f}_{t + 1}}$ follows a one-dimensional normal distribution, i.e., ${{f}_{t + 1}} = \sim M({{\unicode{x0273}}_{{t + 1}},{{\beta }^2}_{t + 1}})$. By the properties of the joint Gaussian (normal distribution). \begin{align*} &{\unicode{x0273}}_{t + 1}\left({{{\mu }_{t + 1}}} \right) = {{p}^T}{{R}^{ - 1}}{{f}_{1:t}} \tag{18}\\ &{{\beta }^2}_{t + 1}\left({{{\mu }_{t + 1}}} \right) = {{p}^1}{{R}^{ - 1}}p + \left({{{\mu }_{t + 1}}{{\mu }_{t + 1}}} \right). \tag{19} \end{align*} View SourceRight-click on figure for MathML and additional features.

In the second step, the best points are chosen for the function F by using an acquisition function. We opted for the EI activation function in this work. When a point is explored in the region of the present optimum value, function EI computes the EI it can achieve. The difference between the function value at the sample point value and the present optimum value is known as the degree of improvement EI. The improvement function is 0 if the function value at the sample point value is smaller than the existing optimum value. Mathematically, it is defined as follows: \begin{equation*} I\left({\rm{\mu }} \right) = \max\left\{ {0,{{f}_{t + 1}}\left({\rm{\mu }} \right) - f\left({{{{\rm{\mu }}}^ + }} \right)} \right\}. \tag{20} \end{equation*} View SourceRight-click on figure for MathML and additional features.

Our aim in this function is to maximize EI about the existing optimum value $f({{{{\rm{\mu }}}^ + }})$. \begin{equation*} {\rm{\mu }} = \text{argmax}\;\mathrm{E}\left({\max \left\{ {0,{{f}_{t + 1}}\left({\rm{\mu }} \right) - f\left({{{{\rm{\mu }}}^ + }} \right)} \right\}} \right) \tag{21} \end{equation*} View SourceRight-click on figure for MathML and additional features.where ${{f}_{t + 1}}({\rm{\mu }}) - f({{{{\rm{\mu }}}^ + }}) \geq 0$, when the distribution of ${{f}_{t + 1}}({\rm{\mu }})$with mean ɳ(μ) and standard deviation ${\unicode{x10fc}}^2 (\mu)$follows the normal distribution. Therefore, the normal distribution with mean and standard deviation ${{\unicode{x10fc}}^2}(\mu)$is the distribution of the random variable EI. EI is defined as follows: \begin{align*} & E\left(. \right) = \int_\infty ^\infty EI.f\left(l \right)dl \\ &\quad = \int_{l = 0}^\infty l\frac{1}{{\sqrt {2\pi {\unicode{x10fc}} \left(\mu \right)} }}exp\left({\frac{{{\unicode{x0273}} \left({\rm{\mu }} \right) - \mathrm{f}\left({{{{\rm{\mu }}}^ + }} \right) - \text{EI}{{)}^2}}}{{2{{\unicode{x10fc}}^2}\left(\mu \right)}}} \right) \tag{22}\\ & dl = {\unicode{x10fc}} \left(\mu \right)\left[ {Y\emptyset \left(Y \right) + \emptyset \left(Y \right)} \right] \tag{23}\\ & Y = \frac{{{\unicode{x0273}} \left({\rm{\mu }} \right) - \mathrm{f}\left({{{{\rm{\mu }}}^ + }} \right)}}{{{\unicode{x10fc}} \left(\mu \right)}}. \tag{24} \end{align*} View SourceRight-click on figure for MathML and additional features.

In the third step, the suggested area for sampling produced by the acquisition function is determined. In the fourth step, use an objective function to validate the results. We are finally adding the previously chosen data to the best optimized sample points and modifying the statistical Gaussian distribution model. In this work, we used BO for the dynamic selection of hyperparameters for the training of the proposed model. The selected hyperparameters are listed in Table II.

TABLE II Selected Hyperparameters and It Ranges
Table II- Selected Hyperparameters and It Ranges

After the fine-tuning of selected NN classifiers using BO, classification has performed again and obtained some improved accuracy; however, there is a drawback: computational time. Therefore, we proposed a new optimization technique named QHPO that selects the best testing features of the self-attention layer.

F. QHPO-Bases Feature Selection

The original HPO is a population-based optimization algorithm in which search agents are hippopotamus (HP). The update in the position of each HP represents the values for the decision variables. Hence, each HP denotes a vector, and a matrix defines the population of HP. Similarly, to other optimization algorithms, the random initial solutions have been generated; hence, the vector of the decision variables is generated using the following formulation: \begin{equation*} {{\xi }_i}:{{\tilde{\phi }}_{ij}} = {{L}_j} + r.\left({u{{L}_j} - l{{L}_j}} \right) \tag{25} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $i = 1,2,3,..,N$ and $j = 1,2,3, \ldots,M$. The notation ${{\xi }_i}$ denotes the position of the $ith$ candidate solution, $r$ denotes the random numbers, $u{{L}_j}$ denotes the upper bound of the $jth$ solution decision variable, $l{{L}_j}$ denotes the lower bound of the $jth$ decision variable, $N$ denotes the total population size, and $m$ denotes the total number of decision variables in the problem. \begin{equation*} \xi = {{\left[ {\begin{array}{c} {{{\xi }_1}}\\ {{{\xi }_2}}\\ .\\ {{{\xi }_i}}\\ {{{\xi }_N}} \end{array}} \right]}_{N \times m}}. \tag{26} \end{equation*} View SourceRight-click on figure for MathML and additional features.

1) Phase 1. Update Position (Exploration)

In this phase, the male HP position has been updated by employing the following formulation: \begin{equation*} \xi _i^{male}:\tilde{\phi }_{ij}^{male} = {{\tilde{\phi }}_{ij}} + {{\tilde{\phi }}_1} \cdot \left({DHP - {{I}_1}{{{\tilde{\phi }}}_{ij}}} \right) \tag{27} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $\xi _i^{male}$ denotes the male HP position and $DHP$ denotes the dominant HP positions, respectively. After that, the female HP position has been updated as follows: \begin{equation*} \xi _i^{Fmale}:\tilde{\phi }_{ij}^{Fmale} = \left\{ {\begin{array}{c} {{{{\tilde{\phi }}}_{ij}} + {{H}_1}.\left({DHP - {{I}_2}\mu } \right)T > 0.625}\\ {\qquad Ignore,\qquad\quad Elsewhere} \end{array}} \right.. \tag{28} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The $\xi _i^{Fmale}$ denotes the position of female HP, $\mu $ is a mean value of a few randomly selected HP, and $T$ is a threshold value that is selected based on the different trails. In the later step, the immature HP position has been updated based on the following objective function: \begin{align*} &{{\xi }_i} = \left\{ {\begin{array}{c} {\xi _i^{male} \ F_i^{male}\quad < {{F}_i}}\\ {{{\xi }_i}\qquad Elsewhere} \end{array}} \right. \tag{29}\\ &{{\xi }_i} = \left\{ {\begin{array}{c} {\xi _i^{Fmale} \ F_i^{Fmale}\quad < {{F}_i}}\\ {{{\xi }_i}\qquad Elsewhere} \end{array}} \right.. \tag{30} \end{align*} View SourceRight-click on figure for MathML and additional features.

2) Phase 2. HP Defense Against Predators

Security and safety of the HP is another factor; therefore, it is important to protect them by heavy-weighted animals. Hence, the predator position is an important factor and it is defined as follows: \begin{equation*} \widetilde {PR}:{{\widetilde {PR}}_j} = l{{L}_j} + {{\vec{r}}_8}.\left({u{{L}_j} - l{{L}_j}} \right) \tag{31} \end{equation*} View SourceRight-click on figure for MathML and additional features.where ${{\vec{r}}_8}$ denotes a random vector of range 0 to 1 and range of $j = 1,2,3, \ldots,m$. The distance is computed after the following mathematical equation: \begin{equation*} \overrightarrow {Dis} = \left| {{{{\widetilde {PR}}}_j} - {{{\tilde{\phi }}}_{ij}}} \right|. \tag{32} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The distance is computed among $ith$ HP to the predator. The updated HP position is further performed using the following mathematical equation: \begin{align*} & \xi _i^{HPR}:\tilde{\phi }_{i,j}^{HPR} = \\ & \left\{ {\begin{array}{c} {\overrightarrow {RV} \oplus {{{\widetilde {PR}}}_j} + \left({\frac{e}{{\left({c - d \times \cos \left({2\pi g} \right)} \right)}}} \right)\left({\frac{1}{{\overrightarrow {Dis} }}} \right){{F}_{{{{\widetilde {PR}}}_j}}} < {{F}_i}}\\ {\overrightarrow {RV} \oplus {{{\widetilde {PR}}}_j} + \left({\frac{e}{{\left({c - d \times \cos \left({2\pi g} \right)} \right)}}} \right)\left({\frac{1}{{2 \times \overrightarrow {Dis} + {{r}_9}}}} \right){{F}_{{{{\widetilde {PR}}}_j}}} \geq {{F}_i}} \end{array}} \right. \tag{33} \end{align*} View SourceRight-click on figure for MathML and additional features.where $\xi _i^{HPR}$ is HP position which was focused on the predator, $\overrightarrow {RV} $ denotes the reduction vector based on the Leavy distribution (LD). In the next step, a mitigation process has been conducted as follows: \begin{equation*} {{\xi }_i} = \left\{ {\begin{array}{ll} {{\xi }_i}^{HPR}F_i^{HPR} & < {{F}_i}\\ {{\xi }_i}F_i^{HPR} & \geq {{F}_i} \end{array}} \right.. \tag{34} \end{equation*} View SourceRight-click on figure for MathML and additional features.

3) Phase 3. Exploitation With Quantum Gate

In this phase, the new positions update formulation has been discussed. The HP which are immature or adult leave the group, they faced by an attack of predator. Therefore, the position is updated as follows: \begin{align*} &l{{L}_j}^{local} = \frac{{l{{L}_j}}}{t},u{{L}_j}^{local} = \frac{{u{{L}_j}}}{t},{\rm{where\ }}t = 1,2,3, \ldots,T \tag{35}\\ &\xi _1^{HP \in }:\tilde{\phi }_{i,j}^{HP \in } = {{\tilde{\phi }}_{i,j}} + {{r}_{10}}\\ &\qquad \cdot \left({l{{L}_j}^{local} + {\mathbb{S}}_{1}\left({u{{L}_j}^{local} - l{{L}_j}^{local}} \right)} \right) \tag{36} \end{align*} View SourceRight-click on figure for MathML and additional features.where $\xi _1^{HP \in }$ denotes the position of HP used to find the search space of the nearest safe place. Hence, the final activation function is defined as follows: \begin{equation*} {{\xi }_i} = \left\{ {\begin{array}{ll} \xi _i^{HP \in }F_i^{HP \in } & < {{F}_i}\\ {{\xi }_i}F_i^{HP \in} & \geq {{F}_i} \end{array}} \right.. \tag{37} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The selected feature vector $({{{\xi }_i}\ F_i^{HP \in }})$ is further refined using the rotation quantum gate (RQG) approach. In the quantum modification, we used RQG to speed up HP's search space. Each Q-bit individual balances exploitation and exploration to speed up the search process and maintain a distance from the predator. The RQG is mathematically defined as follows: \begin{equation*} {\mathbb {U}}\left({\mathds{g}} \right) = \left[ {\begin{array}{lc} Cos\left({\mathds{g}} \right) & - Sin\left({\mathds{g}} \right)\\ Sin\left({\mathds{g}} \right) & Cos\left({\mathds{g}} \right) \end{array}} \right]. \tag{38} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The rotation angle $({\mathds{g}})$ and direction of ${\mathds{g}}$ is determined from the lookup table [44]. This equation is embedded with (36) and updates the position of selected features (HP). The following fitness function is employed for each binary feature vector as follows: \begin{equation*} Fitness = \alpha \left({(Err} \right) + \beta \left(k \right). \tag{39} \end{equation*} View SourceRight-click on figure for MathML and additional features.

And cost functions are defined as follows: \begin{align*} &{{\oint }_{cost}} = {{\phi }_\alpha } \times {{\eta }_{error}} + {{\phi }_\beta } \times \left({\frac{{num\_feat}}{{\text{max}\_feat}}} \right) \tag{40}\\ &{{\eta }_{error}} = 1 - {{\mathcal{A}}_{accuracy}} \tag{40a} \end{align*} View SourceRight-click on figure for MathML and additional features.where ${{\phi }_\alpha }$ and ${{\phi }_\beta }$ are constant variables and presented the values of ${{\phi }_\alpha }\; {\rm {is}} \;0.99\; {\rm {and}} \;{{\phi }_\beta }\;is\;0.01$, ${{\oint }_{cost}}$ denoted the cost function and ${{\mathcal{A}}_{accuracy}}$ denoted the obtained accuracy from the fitness function. The features are selected by applying the QHPO and the size of selected features are$N \times M$. The selected features are finally classified using fine-tuned BO hyperparameters selected NN classifiers. The detailed results are presented in Section III.

SECTION III.

Results and Discussion

The results of the proposed framework have been described in this section. The experiments are carried out on three selected datasets, as discussed in dataset description Section II-A. Each dataset is separated into 50:50. This indicates that 50% of samples are used for training, and the remaining 50% are utilized for testing. The proposed model is 134 layers deeper with 18.6M trainable parameters. The 10-k fold cross-validation is employed to prevent the overfitting problem. For the training of the proposed model, the hyperparameters are learning rate, mini-batch size, epochs, and optimizer with the values of 0.00012, 128, 40, and SGDM. NN classifiers are selected for the classification outcomes. The BO is employed for the hyperparameter tuning of NN classifiers. The selected hyperparameters for tuning are described in Table III. The classification outcomes are evaluated using accuracy, precision rate, Recall rate, F1-score, and computation time in seconds. All the experiments are performed using MATLAB2023b using a Desktop Gigabyte Computer designed with a 13th Gen Core-i5 3.50 GHz processor, 128 GB RAM, 500 GB SSD, 1 TB HDD, and 12 GB NVIDIA RTX 3060 graphic card.

TABLE III Selected Hyperparameters for Tuning Using BO
Table III- Selected Hyperparameters for Tuning Using BO

The following experiments have been performed to evaluate the proposed framework.

  1. Experiment 1: Proposed classification results using proposed fusion-based self-attention CNN architecture without using optimized hyperparameters of NN classifiers.

  2. Experiment 2: Proposed classification results using fusion-based self-attention CNN architecture using BO-based optimized hyperparameters of NN classifiers.

  3. Experiment 3: Proposed classification results using QHPO (features selection, whereas the BO-based optimized hyperparameters of NN classifiers are chosen).

A. Classification Results on the EuroSAT Dataset

The classification results of the proposed architecture for the EuroSAT dataset are presented in this section. The results of each of the listed experiments have been explained. In experiment 1, the proposed network-level fusion self-attention CNN was trained on the EuroSAT dataset, and the self-attention features were extracted. Many NNs such as narrow NN (NNN), medium NN (MNN), wide NN (WNN), bilayered NN (BNN), and trilayered NN (TNN) have been employed and computed the classification results. The results of this experiment are given in Table IV. This table describes that the WNN classifier achieved the highest accuracy of 90.3%. The precision rate is 89.05%, the recall rate is 88.59%, and F1-score is 88.82%. These measures are also noted for the rest of the listed classifiers. The computation time is also recorded for all the NN classifiers. The shortest time is recorded for the MNN classifier at 493.57 (sec), while the longest time is noted for the NNN classifier at 1185.9 (sec).

TABLE IV Classification Results of the Proposed Fused Self-Attention CNN Model on the EuroSAT Dataset
Table IV- Classification Results of the Proposed Fused Self-Attention CNN Model on the EuroSAT Dataset

To improve the accuracy and other performance measure, we fine-tuned the hyperparameters of these selected NN classifiers in Experiment No 2. The hyperparameters have been fine-tuned using a BO algorithm. The improved classification results of NN classifiers are shown in Table V. The table statistics show that the TNN classifier obtained higher accuracy than the other classifiers. The accuracy of TNN is 90.0%, precision is 89.01%, recall is 89.14%, and F1-score is 89.07%. The execution is recorded for all the classifiers, and it is observed that this process increases the time; however, the accuracy and other performance measure values have been improved. The longest time of experiment 1 is 1154.4 (sec); however, experiment 2’s longest time is 2278.3 (sec).

TABLE V Classification Results of BO Tuning of NN Classifiers on the EuroSAT Dataset
Table V- Classification Results of BO Tuning of NN Classifiers on the EuroSAT Dataset

To further reduce the computational time of the proposed architecture, we implemented a new QHPO algorithm that selects the best features for the classification. Table VI presents the classification results of the proposed QHPO. From this table, the WNN classifier gained 89.5% accuracy. The precision, recall, and F1-score values are 88.63%, 88.65%, and 88.63%, respectively. Moreover, the confusion matrix of this experiment is also illustrated in Fig. 7. The confusion matrix gives the number of observations and TPR values. The confusion matrix can confirm the TNN classifier's accuracy and other performance measures. The computation time is measured for all the classifiers, and it is observed that the QHPO-based feature selection process significantly reduced the computation time and almost maintained the accuracy. The longest time for this experiment is 284.01 (sec) for the NNN classifier, and the WNN classifier executed in a minimum execution time of 193.06 (sec).

TABLE VI Classification Results of the Proposed QHPO on the EuroSAT Dataset
Table VI- Classification Results of the Proposed QHPO on the EuroSAT Dataset
Fig. 7. - Confusion matrix of EuroSAT dataset after employing QHPO-optimized architecture.
Fig. 7.

Confusion matrix of EuroSAT dataset after employing QHPO-optimized architecture.

B. Classification Results on NWPU-RESISC45 Dataset

The classification results of the proposed architecture for the NWPU-RESISC45 dataset are presented under this subsection. The results of each of the listed experiments have been explained. In the first experiment, the proposed fused Self-Attention CNN architecture is trained on the NWPU-RESISC45 dataset, and the eminent features are extracted from the self-attention activation. The NN classifiers are utilized to obtain the classification results, as presented in Table VII. From this table, it is observed that the WNN classifier obtained a maximum accuracy of 87.7%. WNN's precision, recall, and F1 scores are 88.3%, 88.1%, and 88.2%, respectively. These values are also measured for all the listed classifiers. The execution time is recorded, and it is observed that the BNN classifier requires 371.63 (sec) for the execution, and the MNN classifier needs 102.14 (sec) for carry out.

TABLE VII Classification Results of the Proposed Fused CNN on NWPU-RESISC45 Dataset
Table VII- Classification Results of the Proposed Fused CNN on NWPU-RESISC45 Dataset

In the second experiment, the BO is utilized to optimize the hyperparameters of NN classifiers. Table VIII shows the optimized results of NN classifiers on the NWPU-RESISC45 dataset. The WNN classifier achieved a higher accuracy of 91.8%. The other parameters, such as precision, recall, and F1-score, are also measured, with 91.81%, 91.51%, and 91.65%, respectively. The accuracy of the WNN classifier improves from 87.7% to 91.8% after fin-tuned NN classifiers using BO. The execution time is noted for all the classifiers, and it was observed that, with the improvement of accuracy, the execution time increased by ∼37 (sec) for the WNN classifier.

TABLE VIII Classification Results of BO Tuning of NN Classifiers on NWPU-RESISC45 Dataset
Table VIII- Classification Results of BO Tuning of NN Classifiers on NWPU-RESISC45 Dataset

To reduce the computation time of the fine-tuned NN classifiers, we implemented a QHPO features selection algorithm. The classification results of the proposed QHPO algorithm are presented in Table IX. From this table, the WNN classifier again achieved the maximum accuracy from all the other classifiers, and it was also better in other parameters, including precision, recall, and F1-score. The values of these parameters are 91.91%, 91.44%, and 91.67%, respectively. The confusion matrix of this experiment has been illustrated in Fig. 8. This figure gives the correct prediction rate for each class, including the number of observations. In addition, the computation time is also noted, and it is observed that the computational time is significantly reduced after the selection algorithm. Moreover, the accuracy is also maintained for the features selection algorithm.

TABLE IX Classification Results of the Proposed QHPO on NWPU-RESISC45 Dataset
Table IX- Classification Results of the Proposed QHPO on NWPU-RESISC45 Dataset
Fig. 8. - Confusion matrix of NWPU-RESISC45 dataset after employing the proposed optimized architecture.
Fig. 8.

Confusion matrix of NWPU-RESISC45 dataset after employing the proposed optimized architecture.

C. Classification Results on SIRI-WHU Dataset

The classification results of the proposed architecture for the SIRI-WHU dataset are presented in this subsection. The results of each of the listed experiments have been explained. In the first experiment, the self-attention layer is utilized for feature extraction and fed to NN classifiers for classification. Table X presents the results of this experiment. The WNN classifier gained a maximum accuracy of 98.2%. The other performance measures, such as precision, recall, and F1-score values, are also computed at 98.1%, 98.2%, and 98.1%, respectively. The testing time is computed for all the classifiers, and the TNN classifier takes a long time of 222.77 (sec). The MNN classifier was executed in a minimum time of 43.94 (sec).

TABLE X Classification Results of the Proposed Fused CNN on the SIRI-WHU Dataset
Table X- Classification Results of the Proposed Fused CNN on the SIRI-WHU Dataset

To further improve the performance of classifiers, we fine-tuned the hyperparameters of the classifiers using BO.

In the second experiment, hyperparameters of NN classifiers were fine-tuned, and classification was performed. The results of this experiment are presented in Table XI. This table shows that the WNN classifier obtained a maximum accuracy of 98.2%. The precision, recall, and F1-score values are 988.19%, 98.16%, and 98.17%, respectively. The computational time of this experiment is also noted, and it is observed that the time is increased compared with the time of experiment 1. To resolve this challenge, we employed the proposed QHPO features selection algorithm.

TABLE XI Classification Results of BO Tuning of NN Classifiers on SIRI-WHU Dataset
Table XI- Classification Results of BO Tuning of NN Classifiers on SIRI-WHU Dataset

The results of experiment 3 are listed in Table XII. This table shows that the BNN classifier achieved the highest accuracy of 98.2% with a 98.23% precision rate, 98.20% recall rate, and 98.21% of F1-score. The obtained accuracy of this experiment is almost consistent, and the confusion matrix is also illustrated in Fig. 9. Using this figure, we can confirm the proposed accuracy of this experiment and other performance measures. The time of this experiment is substantially improved than the previous experiments. The maximum noted time of this experiment is 89.96 (sec) for the TNN classifier, whereas the minimum noted time is 28.99 (sec) for the BNN classifier. Hence, the proposed architecture performed well after employing the feature selection algorithm.

TABLE XII Classification Results of the Proposed QHPO on SIRI-WHU Dataset
Table XII- Classification Results of the Proposed QHPO on SIRI-WHU Dataset
Fig. 9. - Confusion matrix of SIRI-WHU dataset after employing the proposed optimized architecture.
Fig. 9.

Confusion matrix of SIRI-WHU dataset after employing the proposed optimized architecture.

D. Discussion

In this study, we proposed a novel architecture based on the network level fusion and Quantum HPO algorithm for agriculture land cover and land use classification. We designed two parallel CNN models based on inverted bottleneck and dense blocks and fused them using a depth concatenation layer. After that, we embedded a self-attention layer and extracted deep features. Several NN classifiers have been employed for classification accuracy. An inclusive comparison is conducted based on accuracy and computation time with the state-of-the-art deep learning models, as shown in Figs. 10 and 11. We implemented a few pretrained models, such as AlexNet, VGG19, ResNet50, InceptionV3, and NasNetMobile, and compared their accuracy and time with the proposed architecture on selected datasets. Fig. 10 shows that our proposed model outperforms the other networks by achieving 98.2% accuracy on SIRI-WHU, 89.5% on EuroSat, and 91.7% on the NWPU dataset. While the NasNetMobile is a large and complex network from the deep learning networks, it achieved 94.76% accuracy on SIRI-WHU, 86.03% on EuroSat, and 91.7% on the NWPU dataset.

Fig. 10. - Comparison of the proposed model accuracy with several other NNs using selected RS datasets.
Fig. 10.

Comparison of the proposed model accuracy with several other NNs using selected RS datasets.

Fig. 11. - Comparison of the proposed model testing computational time with several other NNs using selected RS dataset.
Fig. 11.

Comparison of the proposed model testing computational time with several other NNs using selected RS dataset.

The testing computation time is also noted for all the deep learning models. Based on testing computation time, it is seen that our proposed network has taken the shortest time, 190.37 (sec) for EuroSat, 28.99 (sec) for SIRI-WHU, and 83.81 (sec) for the NWPU dataset. From the other deep learning models, VGG19 takes the largest testing time, which is 956.22 (sec) for EuroSAT, 114.53 (sec) for SIRI-WHU, and 190.51 (sec) for the NWPU dataset. Our proposed model is ∼×2 times faster than the state-of-the-art deep learning models. In Fig. 12, It is seen that the proposed model has 134 layers deeper with 18.6M parameters.

Fig. 12. - Comparison of the proposed model with pretrained on the basis of number of parameters and layers.
Fig. 12.

Comparison of the proposed model with pretrained on the basis of number of parameters and layers.

We conducted a fair comparison based on datasets with the existing techniques, as presented in Table XIII. According to this table, several techniques are added based on the selected datasets. For the SIRI-WHU dataset, the recent best-attained accuracy was 94.16% by Khan and Basalamah [1]. However, the proposed architecture obtained the best accuracy of 98.20% for this dataset. For the EuroSAT dataset, the recent best accuracy is 88.68% by Zhang et al. [45]. The proposed architecture obtained a maximum accuracy of 89.50%. For the NWPU dataset, Liu et al. [46] obtained a maximum accuracy of 90.30%; however, the proposed architecture obtained an accuracy of 91.70%. Based on these facts, it is clear that the proposed method obtained a better accuracy and precision rate on the selected datasets.

TABLE XIII Comparison of the Proposed Architecture Accuracy With State-of-the-Art (SOTA) Techniques
Table XIII- Comparison of the Proposed Architecture Accuracy With State-of-the-Art (SOTA) Techniques

SECTION IV.

Conclusion

As an illustration, the resolution of the sensor and the weather can impact the quality of the data obtained by RS. Inaccurate managerial decisions can result from low-quality data, which can also impact the accuracy of AI models. In this work, we proposed a novel architecture based on the network-level fused self-attention CNN architecture for agriculture land cover and land use classification. A contrast enhancement technique has been proposed at the initial stage with two important aims to improve the quality of images and to increase the training data. A fused Self-Attention CNN architecture is proposed and trained on the augmented training set. Deep features are extracted from the self-attention layer and classified using NN classifiers. At this stage, we obtained the maximum accuracy of 90.3, 87.7, and 98.2% for EuroSAT, NWPU, andSIRI-WHU datasets. We used BO to improve the NN classifiers’ performance further. After BO, the obtained accuracies of the NNs are 90.0, 91.8, and 98.2, and the highest values of precision, recall, and F1-score are 98.1, 98.2, and 98.1, respectively. However, this step increased the computational time; therefore, we proposed a QHPO feature selection algorithm and obtained the best accuracy of 89.5, 91.7, and 98.2% within 200% less computational time with the highest precision and recall rate is 98.23, and 98.20. Overall, the proposed architecture obtained improved accuracy and consumed less time for execution. The limitation of the proposed work is the CNN architecture which required a large amount of labeled data to achieve reasonable outcomes. In the future, a vision transformers-based architecture will be designed for the land cover and land use classification.

References

References is not available for this document.