Introduction
In Recent years, remote sensing technology has made remarkable advancements, providing a plethora of high-resolution remote sensing imagery. These images contain a wealth of diverse geospatial information. Among the many targets of interest, aircraft, which hold significant value for both military and civilian applications, are crucial for accurate recognition in various fields, including military reconnaissance. With the continuous improvement in the resolution of remote sensing imagery, fine-grained recognition of aircraft has become feasible, necessitating not only the detection of aircraft in imagery but also the capability to differentiate between various models.
However, fine-grained recognition of aircraft in current remote sensing imagery still faces a series of formidable challenges [1]. First, the backgrounds in high-resolution remote sensing imagery are often extremely complex. The presence of similar geospatial features greatly increases the difficulty of accurately recognizing aircraft, leading to a high risk of misrecognition. Second, different models of aircraft may appear strikingly similar in appearance, making fine-grained differentiation based on traditional feature extraction methods challenging. These conventional methods may fail to capture subtle yet critical differences in features, complicating the task of distinguishing between various aircraft models. Furthermore, factors such as lighting and viewing angles can cause significant variations in the appearance of the same aircraft across different images. Changes in lighting can alter the color and shadows of the aircraft, while different viewing angles can affect its shape and silhouette, further increasing the complexity of aircraft recognition.
In recent years, a multitude of scholars both domestically and internationally have conducted extensive and in-depth research in the field of remote sensing image target recognition [2]. Traditional methods, such as template matching, feature extraction, and classifier-based approaches, have achieved a certain level of success in recognizing aircraft. However, these conventional techniques exhibit significant limitations when it comes to fine-grained recognition. With the burgeoning rise of deep learning technology, convolutional neural networks and similar methodologies have been widely applied to the recognition of targets in remote sensing imagery, yielding notably effective results [3].
Currently, target detection algorithms based on convolutional neural networks are categorized into two types. The first type includes one-stage detection algorithms represented by SSD [4], RetinaNet [5], and the YOLO series [6], [7], [8], [9], [10], [11], [12], [13], which can directly extract feature information from the network to predict the category and location of targets, offering advantages such as rapid detection speed and minimal resource consumption. The second type comprises two-stage detection algorithms like R-CNN [14], Fast R-CNN [15], Faster R-CNN [16], and Mask R-CNN [17], which generate candidate bounding boxes before classifying targets, boasting higher detection accuracy.
Considering that future on-orbit aircraft target recognition will require real-time performance, scholars are primarily focusing on research based on one-stage detection algorithms. Huang et al. [18] proposed a cross-scale feature fusion pyramid network to address challenges such as significant size variations and complex backgrounds in remote sensing images. However, for scenarios with high intraclass similarity and low image quality, detection accuracy still needs improvement. Lu et al. [19] achieved better high-level semantic features and multiscale representation capabilities by constructing a feature pyramid structure and employing the CSWin Transformer as the backbone. Yi et al. [20] enhanced the detection capability for small targets by designing a bidirectional feature pyramid network with a dual-branch attention mechanism and transformer modules to expand the local module. Zou et al. [21] introduced a new paradigm for target detection in high-resolution aerial remote sensing images—random access memories (RAMs). This method, by learning the feature representation of targets, can rapidly and accurately detect targets in images. Qian et al. [22] designed an aircraft inference network, which includes a knowledge inference module, spatial context module, airport facility relationship module, and aircraft component recognition module. By leveraging prior knowledge for inference and optimizing target locations, it can accurately detect and recognize fine-grained aircraft. However, for fine-grained recognition of aircraft targets in remote sensing imagery, there is still room for improvement, particularly in dealing with complex backgrounds and overcoming variations in lighting and viewing angles.Cheng et al. [23] proposed the SFRNet, a novel dual-branch transformer architecture that enhances fine-grained classification and oriented localization by capturing long-range spatial interactions and key channel correlations with a spatial and channel transformer, and further improving class separability through a multi-RoI loss. Ouyang et al. [24] developed a detection algorithm combining prototypical contrastive learning with class-balanced sampling for fine-grained object detection in remote sensing imagery, leading to significant performance enhancements through optimized feature representation and sampling strategies. Hu et al. [25] introduced an architecture featuring global-local self-attention for detailed semantic segmentation, which leverages global atrous self-attention and local window self-attention to consider global and local contexts simultaneously, thereby enhancing feature representation and segmentation precision. Zhang et al. [26] presented an efficient inductive vision transformer framework that minimizes computational expenses via an adaptive multigranular routing mechanism, bolsters inductive bias with a dual-path encoding architecture, and encodes directional knowledge using an angle tokenization technique, consequently elevating the efficiency and accuracy of oriented object detection.
The field of remote sensing has witnessed swift advancements in target detection. Nonetheless, the task is impeded by the substantial intraclass variations and minimal interclass differences in fine-grained recognition, coupled with the overhead perspective of remote sensing imagery, which diminishes the discernible regions for identification, making accurate recognition a formidable challenge. While some researchers have focused on optimizing context modeling, this approach exhibits constraints when resolving detailed aspects of targets, potentially overlooking crucial local details within the context and thus compromising the inference for challenging targets. In addition, other investigators have incorporated the Transformer architecture to capitalize on its proficiency in capturing long-range dependencies, thereby enhancing fine-grained recognition. Yet, this approach incurs significant computational expenses and struggles with modeling the local characteristics and scale invariance of objects, necessitating high-caliber data. Moreover, in contexts demanding swift processing, the computational demands render this approach impractical.
This article has developed more effective feature extraction methods, enabling the model to accurately capture the unique features of aircraft and distinguish different models among numerous similar features. In addition, this article explores how to overcome the impact of complex backgrounds and factors such as lighting and viewing angles, enhancing the accuracy and stability of aircraft recognition and strengthening the model's robustness against various environmental factors.
In response to the current issues, this article proposes an improved algorithm based on YOLOv8, called FD-YOLOv8, aimed at significantly enhancing the recognition accuracy of aircraft targets in remote sensing imagery through innovative network structures and feature extraction mechanisms. The contributions of this article are primarily in the following areas.
To address the issue of information loss in shallow networks, this article has designed a novel local detail feature module (LDFM), which enhances the capture of fine-grained information while extracting shallow features, thereby providing the network with a richer feature representation.
Considering the importance of high-level semantic features in fine-grained recognition, this article has introduced a focal modulation mechanism (FMM) to improve the network's attention to local features and its understanding of global features, combining the LDFM and multitype feature fusion (MTFF) to enhance the model's recognition capabilities.
To optimize the recognition accuracy of small and challenging targets, this article has designed an MTFF, which integrates local features, high-level semantic information, and low-level texture information to generate more optimized feature maps, thereby enhancing the accuracy of fine-grained target recognition.
The rest of this article is organized as follows. Section II introduces YOLOv8 and FD-YOLOv8, detailing the design principles of the optimized modules. Section III describes the data analysis of the dataset, experimental setup, and the evaluation criteria for the model. Section IV conducts ablation experiments and provides a visual analysis of the results. Finally, Section V concludes this article.
Methodology
A. Fundamental the YOLOv8 Model
YOLOv8 represents a contemporary one-stage algorithm that integrates state-of-the-art techniques [27], employing advanced backbone networks and feature fusion strategies (as shown in Fig. 1), thereby demonstrating exceptional detection capabilities. The YOLOv8 algorithm is primarily composed of four components: input, backbone, neck, and head.
The loss function in YOLOv8 encompasses both classification loss and bounding box regression loss [28], [29]. The classification loss is based on the binary cross-entropy loss (
\begin{equation*} L_{\text{BCE}} = \frac{1}{N} \sum _{i} L_{i} = -\frac{1}{N} \sum _{i} \sum _{c=0}^{M-1} y_{ic} \log (p_{ic}) \tag{1} \end{equation*}
The bounding box regression loss function comprises complete intersection over union (
\begin{align*} L_{\text{CIoU}} &= 1 - \text{IoU} + \left(\frac{\rho ^{2}(b, b^{gt})}{\epsilon ^{2}} \right) + \alpha v \\ \alpha &= \frac{v}{1 - \text{IoU} + v} \\ v &= \frac{4}{\pi ^{2}} \left(\arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w^{\text{pred}}}{h^{\text{pred}}} \right)^{2} \tag{2} \end{align*}
\begin{align*} \text{DFL}(S_{i},S_{i+1})&=-((y_{i+1}-y)\log (S_{i}) \\
&\quad +(y-y_{i})\log (S_{i+ 1})) \tag{3} \end{align*}
B. Improved Algorithm
In the context of fine-grained recognition in remote sensing imagery, there is often minimal inter-class variation among different models of targets, with distinctions primarily lying in the differences of local components. In addition, due to factors such as climate, season, lighting, occlusions, and atmospheric scattering encountered during the acquisition of remote sensing imagery, the same model of target can exhibit significant visual differences, leading to high intraclass variability. Consequently, the detail information of targets is positively correlated with the outcomes of model discrimination. However, the accurate identification of small and challenging targets is often constrained by the incompleteness of information. Traditional strategies involve increasing the size of the sample dataset, yet, these approaches face numerous challenges in practical application, such as difficulties in obtaining samples due to varying climatic conditions and differences in aircraft models. Moreover, even in existing object detection models like YOLOv8, which adapt to targets of different sizes by adding detection heads, the improvement in recognition performance for small and challenging samples is limited. This limitation arises because key information from the original image is lost during multiscale transformations of the feature maps, thereby reducing the detection head's ability to accurately judge the category and location of targets.
To address these issues, this article introduces an improved algorithm based on YOLOv8, called FD-YOLOv8 (as shown in Fig. 2). Compared to the YOLOv8, this article proposes an LDFM that enhances the localization accuracy of targets by preserving more fine-grained information, thereby strengthening the model's ability to discriminate challenging samples. The introduction of an FMM enhances the model's capacity to capture long-range dependencies and contextual information within images, effectively replacing the spatial pyramid pooling-fast (SPPF) module. The FMM, designed with an attention mechanism, enables the network to focus on key areas of the image containing targets, thereby improving the model's recognition capabilities. Concurrently, this article has developed an innovative MTFF module, which optimizes the generation of feature maps by integrating local features, high-level semantic information, and low-level texture information, thereby enhancing the accuracy of fine-grained target recognition.
1) Local Detail Feature Module
This module employs a dynamic multiscale residual module (DMRM) to replace the initial down-sampling convolutional operation in the shallow layers of the network, and concurrently utilizes an adaptive weight allocation mechanism (AWAM) in place of the second down-sampling convolutional operation (as shown in Fig. 3).
Within the DMRM, a residual mechanism is adopted to preserve the original information of the input image. By applying 1×1 convolutional techniques, data compactness is enhanced, and information concentration is improved. In addition, recursive convolutions are constructed using convolutional kernels of varying sizes (7×7, 5×5, 3×3) [30], which not only expand the receptive field of the model but also ensure sensitive capture of image details through the reduction in kernel size, as shown in (4). This design enriches the semantic content for target recognition at the early stages of the network, significantly enhancing the model's accuracy in target localization and effectively preserving subtle features of the targets. The merging of feature maps is accomplished through a convolutional process with a specific stride, which not only reduces the spatial dimensions of the feature maps but also minimizes information loss, crucial for the precise identification of complex targets.
\begin{align*} F_{\text{DMRM}} =& \text{CBS}_{3\times 3}(1 + \text{CBS}_{1\times 1} + \text{CBS}_{7\times 7}(1 +\text{CBS}_{5\times 5} \\
& \times\, (1+\text{CBS}_{3\times 3})))F_{\text{in}} \tag{4} \end{align*}
In the investigation of the AWAM, this study initially employs a dimensionality reduction technique, transforming the spatial dimensions of the image into channel dimensions. This process ensures the preservation of the integrity of the original information. Following the dimensionality transformation, a 3×3 convolutional kernel can cover the input of the original 6×6 feature point area, thereby expanding the effective receptive field of the feature extraction operation without compromising spatial resolution [31].
Subsequently, the max pooling layer effectively highlights regions with higher feature salience by selecting the maximum value within the pooling window, aiding the model in capturing key information within the image. Concurrently, the average pooling layer provides a more balanced feature extraction mechanism by calculating the average of all pixel values within the pooling window, facilitating the network's ability to capture additional detail and enhancing sensitivity to subtle feature variations.
Ultimately, the integration of feature maps is achieved through a 1×1 convolutional layer, realizing stride-free downsampling, as shown in (5). This technique not only facilitates the interaction of information across channels but also enables the network to focus on both local feature changes and enhance the integrity of global features, thereby improving the efficiency and accuracy of feature extraction
\begin{align*} F_{\text{AWAM}} &= \text{CBS}_{1 \times 1}(\text{CBS}_{3 \times 3} + \text{MaxPool} + \\
&\quad \text{AvgPool})\text{SPDConv}(F_{\text{DMRM}}) \tag{5} \end{align*}
2) Focal Modulation Mechanism
In contrast to self-attention mechanisms [32] that require complex query-key interactions and query-value aggregations for each query token with other tokens to compute attention scores and capture context, FMM [33] first aggregates spatial context at various granularities into modulators and then injects these modulators into the query tokens in a query-dependent manner (as shown in Fig. 4). FMM simplifies the interactions and aggregations, making the process more lightweight. It reduces the network's demand for computational resources and enhances the network's focus on local features and its judgment of global features. The FMM is divided into three key components.
Focal contextualization begins by projecting the input feature into a new linear layer feature space, as shown in (6). Utilizing a series of deep convolutional layers, it encodes contextual information within various receptive fields. These layers adeptly capture visual features and their interdependencies across different scales within the image, from local to global. This enables the network to analyze image features across different receptive fields effectively. By incorporating corner-based contextualization, the network maintains sensitivity to local features while simultaneously enhancing its assessment of global features during the aggregation of contextual information
\begin{align*}[Q, \text{CTX}, \text{Gate}] &= \text{split}\bigg(\text{Conv}_{1\times 1}(\text{Input}) \\
& \quad [C_{\text{in}}, C_{\text{in}}, \text{focal}_{\text{level}}+1], \dim =1\bigg) \tag{6}\\ \text{CTX}_{l} &= \text{GELU}\bigg(\text{DWConv}_{k_{l} \times k_{l}}(\text{CTX}_{l-1})\bigg) \tag{7}\\ \text{CTX}_{\text{global}} &= \text{GELU}\bigg(\text{Mean}(\text{ctx, dim}=(2,3))\bigg) \tag{8} \end{align*}
Gated aggregation enhances the network's capacity to focus on the most salient information for the task at hand. By integrating context information from various scales, modulated through gating mechanisms, and overlaying it at the feature map level, the network adeptly filters out irrelevant data [34]. This strategy not only augments the network's processing efficiency and overall performance but also sharpens its focus on critical features. In addition, this approach ensures that the network can effectively encode local fine-grained features for objects and global coarse-grained features for the background, thereby improving its discriminative power in target judgment
\begin{align*} \text{CTX}_{\text{all}} &= \sum _{l=0}^{\text{focal}_{\text{level}}-1}\text{CTX}_{l} \cdot \text{Gate}[:,l:l+1] \\
&\quad + \text{CTX}_{\text{global}} \cdot \text{Gate}[:,\text{focal}_{\text{level}}+1] \tag{9} \end{align*}
Considering the current feature maps already emphasize spatial hierarchies, the element-wise affine transformation initially deploys a affine layer to facilitate interchannel communication, resulting in modulators. These modulators, derived from the gated aggregation mechanism, undergo pixel-wise operations that perform weighted and nonlinear transformations before being integrated into each query token. Subsequently, LayerNorm [35] is applied to shift the network's focus toward the relative relationships among features, rather than their absolute values. This approach enhances the model's feature representation for individual query tokens, thereby enhancing its ability to detect subtle nuances. Consequently, the augmented feature information within the network's intermediate layers bolsters the model's ability to differentiate between challenging samples
\begin{align*} \text{Output} = \text{LayerNorm}(Q \odot \text{Conv}_{1 \times 1}(\text{CTX}_{\text{all}})). \tag{10} \end{align*}
3) Multitype Feature Fusion
This module is dedicated to the integration of local features from high-level semantic information, global features, and low-level texture information [36]. The MTFF models local features through convolutional neural networks, leveraging their advantages in locality and perceptual field, while also incorporating self-attention mechanisms to capture high-dimensional semantic information to facilitate accurate target recognition [37]. Furthermore, MTFF employs feature fusion techniques to learn the weight distribution between low-level texture information and high-level semantic information, optimizing the model's attention allocation to key features within channels, thereby enhancing the quality of the final feature map generation (as shown in Fig. 5).
In the local branch, to enhance interchannel interaction and promote the integration of local information, a 1×1 convolution is first used to adjust the channel dimensions, followed by channel shuffling to further enhance information interaction. Channel shuffling groups the input feature maps along the channel dimension and employs depthwise separable convolution within each group for shuffling, which, while reducing computational load, allows for sufficient integration of long-range dependencies [38]. The outputs from each group are then concatenated along the channel dimension, and a 3
\begin{equation*} F_{\text{conv}} = W_{3 \times 3 \times 3}(\text{CS}(W_{1 \times 1}(\text{LF} \oplus \text{HF}))) \tag{11} \end{equation*}
In the global branch, we initially generate the Query(Q), Key(K), and Value(V) tensors using 1×1 and 3×3 convolutions. Subsequently, we adjust the shape of the query and key tensors to alleviate the computational load of the attention map. Attention scores are derived from the dot product, followed by normalization using the softmax function to obtain the attention weights for global features. These attention weights, in conjunction with the value tensor (V), are then used to compute the global features [39]. Finally, the dimensions of the feature maps are adjusted using 1×1 convolutions. The operations of the global branch are as follows:
\begin{equation*} F_{\text{att}} = W_{1 \times 1}(\text{softmax}(\frac{\text{QK}^{T}}{\alpha })V) \tag{12} \end{equation*}
In the feature fusion branch, the attention weight maps from the aforementioned local and global branches are utilized to guide the critical information of the original features [40]. By employing 7×7 grouped convolutions, the weight allocation to key features in important regions is adjusted, further highlighting the essential channel and spatial features, thereby enhancing the model's focus on these significant areas. The operations of the feature fusion branch are as follows:
\begin{align*} F_{\text{out}} &= W_{1 \times 1}(\omega \text{LF} \oplus (1- \omega)\text{HF} \\
&\quad \oplus F_{\text{conv}} \oplus F_{\text{att}}) \\ \omega &= W_{7 \times 7}(F_{\text{conv}} + F_{\text{att}}). \tag{13} \end{align*}
Experiments and Evaluation Metrics
A. Dataset and Experimental Setup
The dataset employed in this experiment is the FAIR1M [41] dataset, which comprises over 15 000 high-resolution images, each with a resolution better than one meter and varying in size from thousands to millions of pixels. The dataset features more than one million finely annotated, multiangle distributed targets, with scenes spanning hundreds of typical cities and towns globally, as well as commonly used airports and ports. Among them, there are 5581 images containing aircraft (as shown in Fig. 6), which are categorized by model into Boeing types (such as Boeing 737, 747, 777, and 787), Airbus types (such as Airbus 220, 321, 330, and 350), and domestically produced aircraft (such as C919 and ARJ21). The dataset is randomly divided into training and validation sets in a 7:3 ratio.
Bar chart above clearly depicts the number distribution of aircraft types. The heat map below visually represents the scale of the target in the detected image.
The experimental platform is based on the Windows operating system, equipped with an Intel(R) Xeon(R) Gold 6226R CPU@3.90GHz, 384GB of RAM, and an NVIDIA Quadro RTX 6000 graphics card. The deep learning framework utilized is PyTorch 1.13.1 with CUDA 11.6. The experimental parameter settings are presented in Table I.
B. Evaluation Metrics
The experimental evaluation in this article utilizes Precision (P), Recall (R), and mean average precision (mAP) as metrics to assess the performance of the algorithm for recognition tasks. Precision is defined as the proportion of true positive samples correctly recognized out of all samples predicted as positive by the model. Recall refers to the ratio of the number of true positive samples correctly recognized to the total number of actual positive samples. Average precision (AP) represents the area under the precision–recall curve, which is used to holistically evaluate the performance of the model. The mAP is calculated as the average of the AP values across all categories. The formulas for each of these assessment metrics are as follows:
\begin{align*} P = \frac{N_{\text{TP}}}{N_{\text{TP}}+N_{\text{FP}}} \tag{14}\\ R = \frac{N_{\text{TP}}}{N_{\text{TP}}+N_{\text{FN}}} \tag{15}\\ \text{AP} = \int _{0}^{1} P(R) \, dR \tag{16}\\ \text{mAP} = \frac{\sum _{i=1}^{n} \text{AP}_{i}}{n} \tag{17} \end{align*}
Experimental Result and Discussion
A. Ablation Experiment
This study evaluates the impact of various components on the model's performance through ablation experiments using YOLOv8n as the baseline network (as shown in Table II). The mAP at 50% intersection over union (mAP@0.5) increased by 1.2% when the LDFM replaced the first two convolutional downsampling operations in the baseline network. The mAP@0.5 was further enhanced by 0.7% when the FMM was utilized in place of the SPPF module in the baseline network. In addition, the introduction of the MTFF as a preprocessing step before the network's decoupled head led to an 0.8% improvement in mAP@0.5. When these modules were applied collectively, the FD-YOLOv8 achieved a 3.2% and 2.8% improvement in mAP@0.5 and mAP@0.5:0.9, respectively, compared to the baseline network YOLOv8n.
This article presents an in-depth optimization of the baseline model, significantly enhancing the network's recognition accuracy for aircraft of various models (as shown in Fig. 7). The refined model is capable of effectively distinguishing aircraft based on subtle detail features, demonstrating exceptional identification capabilities. Notably, the model maintains accurate discrimination even when faced with image variations of the same aircraft model caused by climatic conditions or data source differences, thereby significantly enhancing the model's generalization ability. To validate the model's performance, a series of experiments were designed, and the results are summarized in Table III, comparing the recognition performance of YOLOv8n and FD-YOLOv8 on different aircraft categories in the validation set. The experimental results indicate that our model outperforms YOLOv8n across multiple evaluation metrics, thereby confirming the effectiveness of the optimization strategy.
B. Comparative Analysis of Experimental Results
To further validate the enhancements in feature extraction and target localization capabilities of the FD-YOLOv8 proposed in this article, a feature map visualization experiment was conducted. The results demonstrate that FD-YOLOv8 exhibits a profound excavation and precise representation of target information in the lower layer feature maps. This not only confirms the effectiveness of the innovative modules but also provides new perspectives and methodologies for future research.
Fig. 8 illustrates the significant difference in the lower layer network feature extraction capabilities between the FD-YOLOv8 and YOLOv8n models. Our model effectively improves the precision of target information extraction during the feature extraction process and markedly enhances the network's rapid localization ability. In the first layer of feature maps, FD-YOLOv8 effectively reduces the shadow effects caused by weather conditions, minimizing the interference of external environmental factors on target recognition. This improvement significantly enhances the model's robustness in recognizing targets in complex environments. Further observation of the second layer of feature maps reveals that FD-YOLOv8 focuses more on the key features of targets, reflecting the model's exceptional ability in semantic feature extraction. Through the AWAM, the network can automatically adjust its focus on different features, thereby achieving more accurate target recognition. In the third layer of feature maps, FD-YOLOv8 shows a faster high-level semantic feature extraction speed compared to the baseline network YOLOv8n. This result validates the optimization effects of FD-YOLOv8 in high-dimensional feature abstraction and semantic understanding, providing strong support for rapid and accurate target recognition.
Visualization of feature maps. (a) Feature map after the first downsampling. (b) Feature map after the second downsampling. (c) Feature map after the first C2f module.
The YOLOv8n model faces challenges in remote sensing image recognition when encountering targets with similar characteristics. As shown in Fig. 9, YOLOv8n is prone to false positives when images contain structural targets similar to aircraft frames or when shadows of aircraft caused by lighting conditions are present. Specifically, during the initial image processing phase, the model rapidly downsamples the image size in the lower network layers, a process that inevitably leads to the loss of critical fine-grained feature information. Furthermore, the limitations in semantic feature extraction of the higher network layers make the model susceptible to misidentification when faced with targets that resemble the appearance of aircraft. To address this issue, the FD-YOLOv8 proposed in this article significantly reduces the false positive rate. This improvement is primarily attributed to the enhanced preservation of texture features in the lower layers of the model, which provide rich visual cues, thereby improving the accuracy of recognition at the level of similar features.
Detection results of YOLOv8n and FD-YOLOv8 on images with partially similar objects.
In the first row of images in Fig. 10, although the YOLOv8n model successfully distinguished between two similar aircraft models, it failed to adequately consider the significant differences in the dimensions of the aircraft fuselage across different models during the identification process. In the second row of images, YOLOv8n incorrectly identified a Boeing 787 as a Boeing 747, an error attributed to the model's inability to capture the key detail feature of the number of aircraft engines. In the third row of images, the YOLOv8n model failed to accurately differentiate between two distinctly different aircraft models, erroneously categorizing both as A350, indicating that the model has certain limitations in processing low-level texture information for fine-grained recognition tasks. The analysis results in Fig. 10 demonstrate that FD-YOLOv8 significantly outperforms the YOLOv8n model in local information processing. In target recognition tasks, this model not only independently assesses the target but also takes into account relevant information from other targets to assist in making precise judgments about the current target. Particularly when dealing with interdependent size information of aircraft, FD-YOLOv8 exhibits exceptional performance.
Comparative detection results between YOLOv8n and FD-YOLOv8 approach in processing partial information.
Fig. 11 illustrates the challenges faced by the YOLOv8n model when processing the same model of aircraft within the same image. Influenced by factors such as image quality, aircraft shadows, and the orientation of the target, the YOLOv8n model incorrectly identified the same model of aircraft as different types. This misjudgment highlights the model's limitations when dealing with certain scenarios involving the same type of target. In contrast, FD-YOLOv8 is capable of accurately discerning the varied features caused by differences in image quality, shadows, and direction, correctly identifying the same model of aircraft. This result demonstrates the significant advantage of FD-YOLOv8 in integrating contextual information.
Recognition results of the same aircraft type in remote sensing imagery of the same scene by YOLOv8n and FD-YOLOv8.
In the first row of images in Fig. 12, the YOLOv8n model failed to adequately consider the detailed information on the aircraft fuselage and the differences in fuselage size, leading to incorrect detection outcomes. This oversight indicates the model's limitations in handling targets with subtle feature differences. In the second row of images, due to the camera angle, adverse weather conditions, and the high similarity between the aircraft body color and the background color, the YOLOv8n model was unable to successfully identify the target. The combined effect of these factors increased the difficulty of model recognition. In the third row of images, where the target color is highly similar to the background and the target is situated in a complex scene, these challenges further highlight the YOLOv8n model's insufficient detection capabilities in complex environments. In contrast, FD-YOLOv8 is capable of integrating low-level textural details with high-level semantic features to recognize targets that are visually challenging. By learning the scale properties of images and gaining a deep understanding of the semantic features of targets, it achieves high-accuracy target recognition.
Fig. 13 illustrates the challenges faced by the YOLOv8n model when detecting small targets in high-resolution, large-scale remote sensing imagery. In these instances, the YOLOv8n model exhibits a significant issue with missing small targets. This phenomenon can be primarily attributed to two key factors: firstly, the excessive loss of target information during the preprocessing of large-sized images, which prevents the model from effectively capturing the features of small targets; second, the relatively weak feature extraction capability of the shallow layers in the YOLOv8n model, which is particularly evident in small target detection scenarios, limiting the model's performance. FD-YOLOv8 significantly enhances the feature extraction capability of the shallow layers and ingeniously integrates low-level texture information with high-level semantic information in the model's Neck section, effectively reducing information loss. This strategy not only strengthens the model's ability to detect small targets but also markedly improves the recognition accuracy of targets that are difficult to identify in complex scenes. Through this multilevel information fusion, the model can more comprehensively understand image content, thereby performing exceptionally well in detection tasks involving high-resolution, large-scale remote sensing imagery.
Comparative detection results in high-resolution, large-scale remote sensing imagery.
C. Comparison With Other Detection Models
To assess the performance of the FD-YOLOv8 algorithm, this study conducted comparative experiments with YOLOv5-n [27], YOLOv6-n [10], YOLOv8-n [27], YOLOv9-t [12], YOLOv10-n [13], YOLOv3-t [8], YOLOv5-s [27], YOLOv6-s [10], YOLOv8-s [27], YOLOv9-s [12], YOLOv10-s [13], Faster-RCNN [16], FCOS [42], Gliding Vertex [43], Oriented R-CNN [44], RoI Trans [45], S2A-Net [46], and PCLDet [24] on both the FAIR1M dataset and the MAR20 dataset. The performance metrics of AP, mAP50, Parameters, and FLOPs for each algorithm are presented in Tables IV and V.
On the FAIR1M dataset, when compared with the n-tier models of the YOLO series, FD-YOLOv8 achieved the best performance with a slight increase in parameters and computational load. Compared to the s-tier models, its mAP50 metric was only 1% less than the strongest YOLOv5-s, yet its computational load was 50%. The performance of FD-YOLOv8 surpasses that of Faster-RCNN, FCOS, and PCLDet. Compared to S
On the MAR20 dataset (image size of 256× 256pixels), FD-YOLOv8 improved the mAP50 metric by 1.6% to reach 91.2% compared to YOLOv8n. The performance of FD-YOLOv8 surpasses that of the s-tier models of the YOLO series while maintaining high accuracy and robustness, alongside a lower parameter count and computational load. In contrast, models based on the Transformer architecture struggle to effectively learn target information due to the small size of target pixels. FD-YOLOv8 outperforms Faster-RCNN, FCOS, PCLDet, S
Conclusion
This study addresses the issue of fine-grained recognition of aircraft in aerospace remote sensing imagery by proposing an improved algorithm based on YOLOv8, called FD-YOLOv8. Through the design of an LDFM, an FMM, and MTFF, this algorithm significantly enhances the identification accuracy of aircraft targets, particularly in complex backgrounds and small target recognition. Experimental results on the FAIR1M dataset demonstrate the effectiveness of this algorithm in aircraft category recognition, achieving accurate differentiation among various aircraft models.
Despite the positive outcomes of this study, there are some limitations. First, the algorithm's capability to handle noise and occlusions in remote sensing imagery needs further enhancement. Second, the real-time performance and robustness of the algorithm in practical applications should be validated on a broader range of datasets and real-world scenarios. In addition, the model's generalization ability could be improved when facing different climatic conditions and environmental variations.
To address the limitations of this work, future research can be conducted in the following directions: one is to further optimize the network structure to enhance the algorithm's robustness to noise and occlusions; another is to train and test on larger and more diverse datasets to strengthen the model's generalization capabilities. Through these research avenues, it is hoped that the technology for fine-grained recognition of aircraft in remote sensing imagery can be advanced to a higher level.
ACKNOWLEDGMENT
The authors would like to thank the authors of [27], [41], [47], [48] and [49] for their generous sharing of their codes or data. In addition, our sincere gratitude goes to the teachers and classmates who offered valuable suggestions and assistance during the writing process of this article.