I. Introduction
Currently, there exist three different tasks for remote sensing object detection, including the horizontal bounding box (HBB), oriented bounding box (OBB), and instance segmentation (InSeg), which aim to accurately localize different category objects from complex remote sensing scenes. Although massive efforts have been made for HBB [1], [2], OBB [3], [4], and InSeg [5], [6] in the remote sensing domain, they cannot also satisfy multigrained object detection requirements for different categories of remote sensing objects. As shown in Fig. 1(a), in a very complicated airport (AT) scene, the OBB can be adopted for the detection of rectangular cars. However, since the storage tanks (STs) do not have apparent orientation information, the OBB is not suitable for their detection. Moreover, because of the irregular shape of airplanes (ALs), neither OBB nor HBB are very viable options for AL detection. Therefore, several studies [7], [8], [9], [10], [11], [12], [13] attempt to establish a unified remote sensing object detector to integrate the multigrained detection abilities involving HBB, OBB, and InSeg. Xu et al. [7], Shi and Zhang [8], and Liu et al. [9] are all considering utilizing a multitask learning strategy to integrate the InSeg ability with HBB or OBB for improving object detection ability based on parallel, cascaded and shared feature extraction architecture. In order to set up the multigrained object detection ability, Yang et al. [10] and Qian et al. [11], respectively, designed their unified detectors for integrating HBB and OBB detection abilities by HBB and OBB conversion according to a specific weakly self-supervised learning way and horizontal smallest enclosing rectangle constraint. Besides, based on the segment anything model (SAM), Chen et al. [12] employed various prompting mechanisms (i.e., the bounding boxes and queries) to integrate HBB and InSeg detection abilities into a generic framework called RSPrompter. Zhang et al. [13] considered HBB and OBB detection tasks as universal language modeling and designed a visual-text alignment representation learning to integrate HBB and OBB detection abilities into a multimodal large language model called EarthGPT. Nevertheless, whether multitask learning, different detection granularity conversion, prompt engineering, or universal language modeling, they are all dependent on constructing multigrained labeled datasets, designing valid strategies for fundamental model integration and undergoing expensive training procedures. Then, through tedious dataset preparation, specific model design, and lengthy model training procedure, these studies [7], [8], [9], [10], [11], [12], [13] also cannot well integrate multigrained object detection requirements of HBB, OBB, and InSeg. They can only unify HBB and OBB or HBB and InSeg for detection; thus, it would hinder precise and adaptive detections of multicategory objects from complicated remote sensing scenes. Sebsequently, how to integrate multigrained object detection abilities (i.e., HBB, OBB, and InSeg) into a unified detection way becomes a challenge, which has to be further studied for catering to different detection granularity of multicategory objects, while advancing the development of remote sensing object detection technique.
Illustration of the multigrained object detection requirement in the complex remote sensing scene. (a) Complex AT scene including multiple categories, e.g., AL, V, and ST. (b) Individual detectors are designed for different categories. (c) Proposed UniconDet is applied for detecting all kinds of objects.