Introduction
With the global population on the rise and urbanization advancing, issues like environmental concerns, scarcity of land resources, and the availability of arable land are becoming increasingly prominent. This has put the development of agriculture constantly under threat. Achieving high-quality and sustainable agricultural development is therefore, a pressing research issue today. Remote sensing technology is constantly developing, thanks to the construction of various earth observation platforms and the popularity of automated equipment. Remote sensing image classification can quickly and accurately identify environmental changes based on spectral information for agricultural land planning assessment, providing a reliable method for intelligent agricultural monitoring.
The rapid development of artificial intelligence (AI) [1] technology over the past decade has brought about dramatic changes in various fields. Due to the great success of AlexNet [2], experts in various fields started to pay attention to the great significance of deep learning. The subsequent emergence of models such as VGG [3] and ResNet [4] has also made deep learning far superior to humans in image classification. In addition, deep learning has also achieved good results in many tasks such as agricultural detection [5], [6], marine survey [7], face-swap [8], and image segmentation [9], [10].
However, remote sensing images often suffer from low spatial resolution due to the limitations of imaging sensors and transmission capabilities, making high-quality agricultural data difficult to obtain and very costly to annotate. To overcome the challenge of acquiring high-resolution images with sufficient coverage and frequency, remote sensing image super-resolution (SR) technology has emerged as a promising solution to meet the growing demand for detailed and accurate information. Currently, deep learning-based image super-segmentation methods have made significant progress and play an important role in video processing [11], urban planning [12], land cover classification [13], and other fields. In this article, an SR network is introduced to detect detailed features and reconstruct high-quality datasets, thereby, recovering high-frequency details lost in low-resolution datasets and maintaining content consistency.
On the other hand, training SR models still require a large amount of labeled data, which will bring high training costs. In order to make more effective use of high-resolution data, this article introduces the idea of semisupervised active learning (AL) to select the most valuable samples for annotation. AL trains a neural network model with a small number of labeled samples. It uses the model to determine the value of unlabeled samples so that informative samples can be selected and labeled by an observer. The birth of this field has to some extent alleviated the problem of labeling agriculture datasets [14], [15], [16] and has been applied in other fields as well [17], [18], [19]. In addition, semisupervised learning learns the labeled samples so that the unlabeled samples can be labeled autonomously and participate in training. This has the advantage of effectively using the predictive power of the neural network model itself, rather than relying solely on expert labeling, which can further reduce the cost of labeling. Semisupervised learning has also achieved remarkable success in many fields [20], [21], [22].
Currently, there are numerous AL methods available for evaluating the information content of samples, each with different focuses on sample selection. The excessive focus on screening a single sample can lead to biased sample selection, reducing the efficiency of the dataset. It is important to consider integrating the evaluation scores of different AL algorithms. Therefore, a proposed dual-branch fusion selection method aims to adapt to different needs.
The main contribution points of this article are as follows.
A joint network of SR and AL is designed. The pretrained SR can effectively restore the informative details in LR remote sensing images; through AL, reconstructed valuable images are picked out for human annotation.
A dual-branch selection strategy (DBSS) is proposed for sample information assessment. It balances the contributions between the interclass selection and boundary selection, which helps the annotation of high-value samples.
A self-supervisory assistance strategy (SSAS) optimized for AL is designed to effectively utilize the predictive power of the neural network model itself for samples, thereby, reducing the annotation cost.
The rest of this article is organized as follows. Section II describes the work on remote sensing image classification, image SR and AL. Section III describes the proposed AL method, DBSS and SSAS. Section IV describes the results of the proposed methods. Finally, Section V concludes this article.
Related Work
A. Remote Sensing Image Classification
As neural networks continue to evolve [23], [24], remote sensing image classification using a neural network is of huge value. In 2010, Lienou et al. [25] introduced a theme model of the machine learning field to remote sensing image scene classification to realize the marking and classification tasks of high-resolution remote sensing scenes. At the same time, Yang collected and arranged datasets containing 21 remote sensing scenes [26]. This is a high-resolution remote sensing scene dataset for the first time available, widely used by domestic and foreign researchers, and is the most common dataset in this field. Li et al. [27] introduced a pretrained convolutional neural network (CNN) model which is used as a feature extractor. It constructs a feature representation of remote sensing images for classification. However, when classifying remote sensing images based on CNN networks, the accuracy of the model is constrained by insufficient training samples. To address the problem, Marco et al. [28] compared the performance between the traditional training and pre-trained way. It provides a constructive solution for insufficient samples. Wang et al. [29] proposed an end-to-end attention recurrent convolutional network (ARCNet) for scene classification and fine-tunes on the target dataset. The features learned by these models are not exactly suited to the features of the target dataset, and these methods suffer from high labeling costs and unbalanced processing samples.
B. AL
After the rise of deep learning, there has been some research on AL based on deep learning. Gal et al. [30] proposed a method called deep Bayesian active learning (DBAL), which can be used in high dimensional data, an extremely challenging task for conventional AL methods. Zhu et al. [31] proposed an AL method for query synthesis using generative adversarial networks (GAN), a method that adaptively generates instances for querying, thus, speeding up training. Sener et al. [32] defined the AL problem as core-set selection and proposed a core set approach, according to which the most selective (a.k.a. highest accuracy) samples are selected as the core-set. Yoo et al. [18] proposed a learning-loss method to score samples by predicting loss using a loss prediction module.
Since remote sensing datasets inherently have the problem of high annotation costs, many AL algorithms are based on remote sensing applications. Wang st al. [33] addressed semisupervised object detection (SSOD) in remote sensing images characterized by a long-tailed distribution. The experimental results show that the introduction of AL performs well in the remote sensing dataset DOTA-v1.0. Lars et al. [34] validated the effectiveness of the joint use of self-supervised pretraining with AL.
Semisupervised learning refers to pattern recognition using a small amount of labeled data and a large amount of unlabeled data. Commonly used methods include self-training, graph-based semisupervised learning, co-training [35], and label propagation [36]. With the development of deep learning, deep semisupervised learning for deep learning also appeared. Li et al. [37] proposed a semisupervised learning method using pseudolabels by simply using the predicted maximum class to train on unlabeled data, and to prevent network degradation, a growth function was used to control the amount of unlabeled data used by the network while making the iteration speed as fast as possible, and making the obtained pseudolabels as accurate as possible. This method is very simple to implement, but yields good results. Laine et al. [38] introduced a self-integration method that uses network predictions of unlabeled data from different epochs to synthesize consensus predictions to generate pseudolabels. Self-ensemble prediction uses the network to predict unknown labels under different regularization and input enhancement conditions, so the resulting labels will be more consistent and semisupervised. Tarvainen et al. [39] proposed mean teacher, a semisupervised learning method using unlabeled data using average model weights instead of label predictions. This method can effectively improve test accuracy and can be trained with fewer labels than temporal ensembling.
C. Image SR
High-resolution images often contain more detailed visual information. Image SR can help enhance image analysis by addressing the issues of blurriness and missing details in low-resolution images. This technique has wide-ranging applications in target monitoring, video reconstruction, and earth observation. Image SR methods can be divided into interpolation-based methods, reconstruction-based methods and deep learning-based methods. Traditional interpolation-based methods include nearest neighbor interpolation, bicubic interpolation [40], bilinear interpolation and other methods, which are computationally simple and easy to implement, but are unable to more accurately restore the lost information of an image in complex application scenarios. Reconstruction-based methods aim to construct a priori constraints for high-resolution images using regularisation methods [41], including a priori knowledge of the image's edge texture, local smoothing features, nonnegativity of pixel values, etc., and complete the image SR with the idea of solving an optimization problem. Currently, deep learning-based methods are mainstream. Super-Resolution convolutional neural network (SRCNN) [42] is the first neural network approach to achieve the single image SR by end-to-end mapping. But the quality of reconstruction is insufficient for early methods like this, and the computational complexity is high because the convolutional layer is in HR space. MEN [43] used multiscale features of remote sensing images to enhance the network's reconstruction capability. Besides, VDSR [44], DRRN [45], ESRGAN [46], RFB-ESRGAN [47], and other methods emerged one after another, and these methods can provide effective solution ideas for the reconstruction of low-resolution remote sensing images.
Methodology
In this section, we introduce the following three main parts: 1) a joint network of SR and AL that can be excellent on agricultural remote sensing datasets; 2) a DBSS is proposed for sample information assessment, and can balance the contributions between the interclass selection and boundary selection; and 3) an SSAS that optimizes AL and can effectively exploit the predictive power of the neural network model itself for the samples, thus, further reducing the labeling cost.
A. Remote Sensing Images Reconstruction
Since texture details vary greatly from one remote sensing image to another, this article uses a pretrained SR model, RFB-ESRGAN [47], to implement the reconstruction of the low-resolution dataset. In order to make this pretrained model better applicable to our agricultural remote sensing dataset, we introduce an auxiliary training module based on visual geometry group (VGG) [3] for feature extraction and further fine-tuning. The training of the SR model is divided into two stages.
In the first stage, given the raw set
\begin{equation*}
S_{SR^{1}} = \mathcal {F_{\theta }}\left(S_{LR}\right) . \tag{1}
\end{equation*}
\begin{equation*}
L_{\text{aux}} = \left\Vert VGG\left(S_{SR^{1}} \right), VGG\left(S_{\text{Raw}} \right) \right\Vert _{2} \tag{2}
\end{equation*}
\begin{equation*}
S_{SR^{2}} = \mathcal {F_{\theta }}\left(S_{\text{Raw}}\right) . \tag{3}
\end{equation*}
Architecture of the proposed joint network of SR and AL. The whole architecture consists of two parts. (a) Pretrained SR framework, which is the main architecture of RFB-ESRGAN. (b) AL framework, which introduces a dual-branch AL selection strategy and an SSAS.
After one RFB layer used for feature fusion, upsampling operations of nearest neighbor interpolation and sub-pixel convolution are used alternately, and one RFB layer and activation PReLU layer are picked up after each upsampling. Nearest neighbor interpolation scales up the feature map spatially, while sub-pixel convolution further enhances the details by rearranging the feature channels. Finally, one convolution layer is attached. Through SR reconstruction, the original low-quality remote sensing images are transformed into high-resolution datasets with richer detail information and higher sampling values.
B. DBSS
The uncertainty-based AL algorithm utilizes the feature distribution extracted by the network to create a probability distribution. The algorithm then calculates the uncertainty (i.e., entropy) of this probability distribution and uses it as a score for the sample. A higher score indicates a higher value for the sample. In this study, the upper layer of the network's output layer is considered as the feature distribution for the uncertainty calculation. This layer is chosen over the output layer or a shallower layer because it retains more information about the image and is less abstract than the output layer. In addition, using a shallower layer can result in a chaotic distribution, making it difficult to design an effective probability distribution. Therefore, the upper layer of the network's output layer is chosen as the embedding layer for these reasons.
As illustrated in Fig. 1, the primary structure of the AL network utilizes a DBSS. This prediction strategy is chosen because we develop AL algorithms with distinct selection strategies for each branch, as detailed in Sections III-B1 and III-B2. We applied t-SNE dimensionality reduction to the embedded features of the training set samples extracted by the network, and the outcomes are presented in Fig. 2. The network-extracted features are predominantly distributed in various clusters. Consequently, the underlying concept of the AL algorithm proposed in this article is to select samples that are situated as far as possible from the distribution of these training set embeddings.
Distribution obtained by dimensionality reduction of all embedding features using t-SNE. Notice that the features are distributed in clusters; in-class features are close, and different classes' features are distant.
1) Inner Branch
Given
\begin{equation*}
c(i)=\frac{1}{n(i) } \sum _{x_{i} \in B} e\left(x_{i}\right) \tag{4}
\end{equation*}
Once the center is found, the interclass branching can be measured for the unlabeled set samples. Since AL requires reselecting samples in the unlabeled set
\begin{equation*}
D\left(x_{i}\right)=\sum _{l=1}^{C} \ln \left(\left\Vert e\left(x_{i}\right) -\operatorname{c}(l)\right\Vert _{2}\right) \tag{5}
\end{equation*}
2) Border Branch
The interclass selecting method mentioned above effectively selects samples from the unlabeled set that are far from any class. However, this method does not select samples between any two classes because these samples are closer to both two classes. The contribution calculated according to the method in Section III-B1 will be smaller. Therefore, it is necessary to consider designing a new method to select the samples at the decision boundary between any two classes.
After calculating the center of each class, given class
\begin{equation*}
\begin{aligned} D\left(x_{i}\right)=\min _{ m \ne n}\left\lbrace \left\Vert e_{x_{i}}-\operatorname{c}(m)\right\Vert _{2}-\left\Vert e_{x_{i}}-\operatorname{c}(n)\right\Vert _{2}\right\rbrace \end{aligned} .\tag{6}
\end{equation*}
3) Inner-Border Balance Selecting Strategy
In order to better integrate the two methods mentioned in Sections III-B1 and III-B2, we design a fusion contribution selection strategy to balance the contributions obtained from the above two. The proposed inner-boundary balanced selecting strategy is able to fuse the two in an arbitrary proportion.
Assuming that
C. SSAS
The designed AL method can help in choosing informative samples, which reduces the need for labels and lowers the labeling cost. However, this approach may require discarding less informative samples, and it may not fully utilize the information from the entire unlabeled set. In addition, the model itself has some recognition ability during the learning process. Therefore, we have designed a SSAS to take advantage of the model's own prediction potential in this regard.
We believe that adding less informative samples to a certain base set and its corresponding unlabeled set does not provide a significant benefit. However, having less informative features can help semisupervised learning by enabling the model to better differentiate unlabeled samples. This means that the prediction result is closer to the actual truth. As a result, we process the prediction results and use them as labels for the unlabeled samples in subsequent training.
Given an unlabeled sample
\begin{equation*}
y_{\text{{pseudo}}}(i)=\left\lbrace \begin{array}{ll}1, &\hat{y}_{\text{unlabeled}}(i)=\max \hat{y}\left(x_{i}\right) \\
0, &\text{else} \end{array}\right. .\tag{7}
\end{equation*}
The architecture is shown in Fig. 3. The proposed SSAS seems to have the ability to annotate accurately. Still, this strategy is not so robust cause it is difficult to guarantee that the labels provided by semisupervised learning are 100% accurate. Incorrect labels will inevitably reduce the accuracy of deep learning models in a supervised learning task. Therefore, we propose a robust solution called consistency-based semisupervised learning. First, after training on the labeled set, save the model and select a certain amount of high information samples for annotation, then, the training of the next cycle continues. After the next cycle is trained, we use the model obtained from the two cycles to predict the unlabeled samples separately. Then, calculate the similarity of the two predictions using the following equation:
\begin{equation*}
\operatorname{Sim}(D(x^{t}), D(x^{t+1}))=\Vert D(x^{t})-D(x^{t+1})\Vert _{2} \tag{8}
\end{equation*}
\begin{equation*}
\mathcal {I}(D(x^{t}), D(x^{t+1}))\!=\!D(x^{t}) \!\cdot \operatorname{Sim}(D(x^{t}), D(x^{t\!+\!1})) \cdot D(x^{t\!+\!1}).\tag{9}
\end{equation*}
This concept is based on the following assumption: if two models, one before and one after a cycle, make different predictions on a sample, then, that sample is considered to be still highly informative, even if one of the models is highly confident in its prediction. However, if the two predictions are consistent, then, the sample is correctly predicted and can be annotated and moved to the labeled set. In subsequent experiments, it is observed that the impact of the consistent semisupervised learning strategy is greater than that of simple semisupervised learning.
Results
A. Experimental Environment
1) Experimental Environment
To evaluate the effectiveness and robustness of the joint network, we conducted several experiments, including multiple rounds of parameter tuning and dual-branch ablation experiments. These experiments were carried out using our experimental platform, which is equipped with an Intel i7 12700 K CPU and two NVIDIA RTX 3090 Ti graphics cards. The platform runs on the Ubuntu 20.04 operating system, and we use PyTorch 1.9 as the deep learning framework.
2) Dataset
In order to construct a small dataset for agricultural land classification, this chapter extracted a shared database from the reference [48]. The original dataset contains 31 500 images belonging to 45 classes. However, this dataset suffers from intraclass variability and interclass data imbalance, which leads to the wastage of computational resources. To simplify the dataset and meet the demands of agricultural tasks, this chapter selected 9 common agricultural land classes (a total of 1800 samples) in this new balanced dataset, called NR09. The common agricultural land classes contain farmland, meadow, residential, forest, desert, hills, roads, lake, and river. Each class consists of 200 color images corresponding to the RGB channels of
NR09 dataset. The dataset contains 9 common agricultural land classes, including farmland, meadow, residential, forest, desert, hills, roads, lake and river. Each class consists of 200 images of
B. Performance of the Joint Network
We co-train the joint network of SR and AL, using the sample selection effects of the active learning network to fine-tune the SR model. After SR modeling, the original low-resolution blurred remote sensing images are super-resolved by a factor of 8. The result of SR is shown in Fig. 5. After reconstruction, the lost texture features in the image are restored, which facilitates the subsequent embedding layer to extract the features of the agricultural images.
The results of our sample addition experiments using coreset [32], learning loss [18], distance entropy [49], and our proposed method are shown in Table I. The comparative experiment starts with the same 10% samples and selects 10% samples every cycle, until the labeled set reaches 90% samples. The best results are highlighted in bold and the second best are underlined. The experimental results demonstrate introduction of the SR network can bring about an improvement in the model performance, and our proposed joint network has a strong sample selection capability. It should be noted that we also compared the results of both branches for validation and found that the selecting results of both branches are more efficient than any of the branches, which effectively shows the effectiveness of our proposed dual-branch method. We added a set of hyperparametric experiments for this purpose. We took different hyperparameters
C. Experiments of Different SR Strategy
In order to further verify the role of SR models in improving the accuracy of remote sensing agricultural classification, we compare the performance of the joint network embedded in different SR models and at different SR ratios. The size of the original remote sensing images before reconstruction is 256 × 256. We set 2x, 4x, 6x, 8, and 10x magnification parameters, and select BICUBIC [40], SRCNN [42], DRRN [45], ESRGAN [46], and RFB-ESRGAN [47] as pretrained model for comparison. The results are shown in Fig. 6.
Performance of different SR models in different SR ratios. It can be found that, the prediction accuracy of the joint networks embedded with different SR models all show an increasing trend as the SR ratio increases.
It is obvious that as the SR ratio increases, the prediction effect of the joint network has an overall upward trend, and the change in accuracy gradually slows down when the SR ratio is greater than 6. On the other hand, as the SR ratio increases, the SR model parameters and computations increase, and the model is difficult to converge, which requires a larger scale of sample data for optimization. Hence, we set the SR ratio as 8 to balance the performance.
D. Comparison Experiments of SSAS
In this study, we carry out a comparison experiment of a semisupervised learning strategy for the method proposed in this article. The control group uses an AL algorithm with 10% of the data as the base set. In each cycle, 10% of samples from the unlabeled set were selected and added to the base set. This process was repeated for a total of 10 cycles. On the other hand, the AL algorithm using the SSAS selected 10% of samples from the unlabeled set and moved them into the base set in each cycle, for a total of 4 cycles. The results of the experiment are presented in Fig. 7.
Comparative experiment to verify the effect of SSAS. It can be found that the group using this strategy can achieve close to the test accuracy trained by full annotation but with less (40%) annotation cost.
We can see that until the sample size approaches 100%, the performance of using the SSAS parallels that of the samples without the AL strategy, which is slightly lower at some stages. After comparison, we can find that the accuracy obtained using the consistent semisupervised learning strategy is higher than the ordinary semisupervised learning (pseudolabel) and also closer to the ordinary AL algorithm, yet the time consumed is only 40% of the ordinary AL, which is enough to prove the practicality of our method.
The results above show that with the assistance of the proposed SSAS, our proposed method can improve the performance of the labeled dataset as much as possible while further reducing human annotation. The semisupervised ratio in Fig. 7 is 1 (5% for human annotation and 5% for SSAS annotation). To explore the impact of using different semisupervised ratios in the proposed methods, we add experiments of 0.5 (2.5%), 1.5 (7.5%), 2 (10%), and 2.5 (12.5%).
The results are shown in Fig. 8. By analyzing the results, We can draw the following two conclusions. 1) The model's strong prediction capability is sufficient for accurately labeling low-informative samples, and when combined with AL, it can efficiently annotate unlabeled samples. 2) The SSAS needs to be at an optimal ratio to maximize its effectiveness. If the ratio is too low, the model's prediction cannot be effectively leveraged. Conversely, if the ratio is too high, the model's prediction capability may limit its accuracy on certain samples, leading to performance degradation.
Comparative experiment to verify the effect of different semisupervised ratios on the same AL backbone method.
Conclusion
Remote sensing datasets face the problem of low resolution and high labeling cost, which makes deep learning difficult to generalize in agriculture image processing under this limitation. This article provides a joint network of SR framework and semisupervised explanation-friendly AL framework. The introduction of pretrained SR models facilitates the reconstruction of important information such as texture details in low-resolution datasets. At the same time, the DBSS guarantees the selection and annotation of high-value samples. These enhance the highly accurate classification of agricultural remote sensing datasets.
Moreover, the traditional AL framework only relies on the representation of the unlabeled samples by the model, and selects the samples that are difficult to identify or have a great impact on the model for labeling. This ignores the gain of the samples with high confidence in the model, and the use of this part requires no extra manpower. SSAS can select samples with high confidence and participate in training with pseudolabels, which reduces the cost of manual labeling. The results also demonstrate the effectiveness of the semisupervised strategy. In the future, we will optimize the mechanism of the joint network, and explore the use of large models in remote sensing. At the same time, we will expand our work to image classification in multiple remote sensing scenarios such as urban land planning, maritime targets, etc.
ACKNOWLEDGMENT
The authors especially acknowledge the Artificial Intelligence and Marine Information Processing Lab (AIMIP) at Tianjin Univeristy, Tianjin, China and the Automated Systems and Soft Computing Lab (ASSCL) at Prince Sultan University, Riyadh, Saudi Arabia. The authors wish to acknowledge the editor and reviewers for their insightful comments, which have improved the quality of this publication.