Processing math: 0%
Context-Aware Region-Dependent Scale Proposals for Scale-Optimized Object Detection Using Super-Resolution | IEEE Journals & Magazine | IEEE Xplore

Context-Aware Region-Dependent Scale Proposals for Scale-Optimized Object Detection Using Super-Resolution


Overall pipeline of the proposed method. Our proposed method addresses the issue of false-positive detections in SR-based methods that occur when the images are rescaled ...

Abstract:

Image scaling techniques such as Super-Resolution (SR) are useful for object detection, especially for detecting small objects. However, we found that scaling by an inapp...Show More

Abstract:

Image scaling techniques such as Super-Resolution (SR) are useful for object detection, especially for detecting small objects. However, we found that scaling by an inappropriate factor tends to induce false-positive detections. This paper presents a Region-Dependent Scale-Proposal (RDSP) network that estimates the appropriate scale factors for each image region depending on its contextual information. In our detection framework, images are appropriately scaled by SR according to the estimations of the RDSP network, and fed into the scale-specific object detectors. While previous works have proposed models for scale proposal, our RDSP extracts regions where objects could potentially exist based on scene structure, regardless of whether actual objects are present, because small objects are often too small to determine their presence accurately. Additionally, while existing approaches have fused object detection and SR in an end-to-end manner, scale proposals for SR are not provided or are performed independently. Qualitative and quantitative experiments show that our RDSP network provides appropriate SR scales and improve detection accuracy on highly challenging dataset, captured by real car-mounted cameras with size-varied objects, including extremely small objects. Our code is available at https://github.com/kakitamedia/RDSP.
Overall pipeline of the proposed method. Our proposed method addresses the issue of false-positive detections in SR-based methods that occur when the images are rescaled ...
Published in: IEEE Access ( Volume: 11)
Page(s): 122141 - 122153
Date of Publication: 01 November 2023
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Object detection is one of the most important computer vision tasks. One of the challenges in object detection lies in the variation in object sizes, and many previous works addressed this size-varied object detection task.

A standard anchor-based method detects size-varied objects from a wide variety of scaled anchor boxes that are densely distributed in an image (e.g., SSD [2], RetinaNet [3], and Faster R-CNN [4]). However, such a huge number of anchor boxes that cover a variety of locations and scales not only degrade computational efficiency but also make learning difficult.

In contrast to such a scale-invariant object detector, it is known that object size-varied object detection can be improved by scale-specific object detectors [5], [6] with image scalings. In SNIP [5], an image is scaled to various sizes by pre-determined upscale factors with image interpolation, and all the scaled images are fed into its corresponding scale-specific detector. However, it has been pointed out that image interpolation degrades the performance of small object detection [1].

To detect such small objects, detection methods using Super-Resolution (SR) have been proposed. For example, face detection [7] and generic object detection [1], [8]. In [7], [8], SR is applied to regions extracted by RPN. However, in cases where objects are extremely small, RPN may not perform effectively. Therefore, the entire detection process, including RPN, should performed on SR images. In [1], SR is applied before the entire detection process, but they conduct experiments only on manually downsized images. This means the effectiveness of SR for size-varied object detection is not evaluated.

Therefore, we apply SR directly to real data, which contains objects of various sizes, including extremely small objects. However, we found that application of SR to detectors tends to detect false-positives (as shown in Figure 1 (a) and Figure 2) when object regions are rescaled by inappropriate scaling factors, while these inappropriately-rescaled regions are not false-detected in the original-scale image.

FIGURE 1. - Examples of our object-scale proposals for scale-specific object detection with super-resolution. (a) Detection result by TDSR [1] with factor 2. False-positive detection can be seen on the left of the image. (b) SR-scale proposals by our Region-Dependent Scale-Proposal (RDSP) network. The regions appropriate for upscaling by super-resolution with factor 2 are predicted. (c) Detection result of our proposed detection pipeline. False-positive detection is suppressed based on (b).
FIGURE 1.

Examples of our object-scale proposals for scale-specific object detection with super-resolution. (a) Detection result by TDSR [1] with factor 2. False-positive detection can be seen on the left of the image. (b) SR-scale proposals by our Region-Dependent Scale-Proposal (RDSP) network. The regions appropriate for upscaling by super-resolution with factor 2 are predicted. (c) Detection result of our proposed detection pipeline. False-positive detection is suppressed based on (b).

FIGURE 2. - Recall and precision of standard one-stage detectors with different settings. Naive application of SR improves the recall by achieving the detection of small objects, but harms the precision because of false-positives. Our method improved recall without harming precision.
FIGURE 2.

Recall and precision of standard one-stage detectors with different settings. Naive application of SR improves the recall by achieving the detection of small objects, but harms the precision because of false-positives. Our method improved recall without harming precision.

This paper presents how to estimate the appropriate scale of each image region. Our contributions in this paper are as follows:

  • We apply SR to the size-varied object detection tasks, while the previous SR-based approach applies SR to manually downscaled images.

  • Our proposed method addresses false-positive detections in SR-based object detection. In our method, an appropriate SR scale is predicted at each image region depending on the scene structure. Based on this prediction, the false-positive detections caused by inappropriate scaling are suppressed.

  • The aforementioned appropriate scale prediction is achieved by our proposed network called a Region-Dependent Scale-Proposal (RDSP) network. While some previous works have proposed models for scale proposal, our RDSP extracts regions where objects could potentially exist based on scene structure, regardless of whether actual objects are present, because small objects are often too small to determine their presence accurately.

  • In RDSP, the global scene structure and local appearance features are utilized in implicit and explicit manners. While RDSP is presented in our early work [9], it is extended with the positional and global structure embeddings in this paper.

  • This paper explores how to effectively end-to-end train the network consisting of RDSP, SR, and detection sub-networks. Since the full network is huge, the ill-considered combination of complex loss functions makes it difficult to train the network due to conflict between the losses. We found the best combination of the losses for end-to-end learning of the full network.

  • Our method can be applied to any differentiable object detector. This applicability is demonstrated in our experiments shown in this paper. Various object detectors are integrated with RDSP in order to improve the performance of object detection.

  • To validate the effectiveness of the proposed method, we utilize datasets captured by car-mounted cameras, which have severe object size variations, including extremely small objects.

While the aforementioned first and second contribution has been presented in our early version [9], the remaining contributions are the novel contributions presented in this paper.

SECTION II.

Related Work

In this section, we introduce SR using deep convolutional networks (in Sec. II-A) and object detection using SR and/or scene-object relationships (in Secs. II-B and II-C).

A. Deep Super-Resolution

As with many computer vision technologies, SR has been improved with convolutional networks (e.g., DBPN [10], WDST [11], SRFlow [12], PAMS [13], CARN [14], LatticeNet [15], SRNTT [16], SPSR [17]). Recent approaches with downscaling kernel representations [18], [19], [20], [21], [22], [23] improve the SR performance and its applicability to real-world images degraded by a variety of blur kernels. Since SR is one of the hot topics in computer vision as demonstrated in public challenges [24], [25], [26]. Furthermore, single-image SR is extended to video SR [27], [28], [29], [30], [31] and joint space-time video SR [32], [33], [34], [35] for more variety of applications.

However, all of these SR methods are designed to improve the image quality for human perception, which is evaluated by PSNR and other image-quality metrics. While our proposed method can utilize any of these SR methods, the goal of our work is to explore SR methods applicable to machine perception, such as object detection. To this end, in previous work, SR is combined with an object detector in tiny face detection [7] and tiny generic object detection [1], [8]. This paper proposes automatic region-dependent scale proposals for SR, in addition to image upscaling using SR.

B. Object Scale Proposal

To address size-varied object detection challenges, numerous strategies have been proposed. For example, image pyramid-based [5], [6], [36], and feature pyramid-based [2], [3], [37] strategies are common practice. However, these methods are unsuitable for scenarios with extremely large variations in object sizes, as they require constructing very large pyramids. Therefore, previous works [38], [39] proposed that explicitly predict object sizes in advance and perform detection based on these predictions. However, since these approaches rely on the appearance of objects in the image to estimate their scale, estimation can fail when dealing with extremely small objects. Therefore, scale estimation methods that do not rely on object appearance are needed.

Scene structure also plays a role in determining the scale of objects in an image; for instance, a sidewalk near the vanishing point may contain a small person. Hence, some studies explicitly utilize this kind of scene structure. For example, scene-specific [40], [41], perspective-aware [42], depth-aware [43], 3D geometry-aware [44] object detectors are proposed.

In this study, we construct the RDSP network that incorporates scene structures and propose its training method that enables scale estimation independent of object appearance.

C. Scale-Dependent Object Detection

Beyond object scale recognition [42], [43], [44], [45] mentioned in Sec. II-B, local image regions can be explicitly rescaled for more easy recognition (e.g., region-dependent object scaling for object detection [46], in particular for tiny object detection [47]). Tiny object detection can be further improved by incorporating SR and detection networks with end-to-end learning [1].

Our proposed method integrates scale-dependent detection with SR so that appropriate scales in different image local regions are estimated. Previous methods (i) estimate a region-independent scale histogram [46] or (ii) just employ off-the-shelf detectors for estimating region-dependent scale proposals [47]. On the other hand, our method has additional networks (RDSP) for proposing the probability maps, in which a pixel value in each pixel is higher if it is highly possible that any target object is observed in that pixel, for multiple SR scales. Each scale proposal is expressed as a heatmap image, as shown in Figure 1 (b).

SECTION III.

Region-Dependent SR-Scale Proposals for Scale-Specific Object Detection

Figure 3 shows the overall pipeline of our proposed method. Our detection pipeline consists of three independent branches and one independent detector, and each branch is composed of the following processes. Note that each “term” enclosed in double quotation marks appears in Figure 3 and Figure 4.

  1. SR networks: The SR network is prepared for each scaling factor.

  2. RDSP network: The RDSP network is designed to estimate the “Scale-proposal heatmaps” where the appropriateness of each SR scaling factor (e.g., factors of 1, 2, and 4) is given at each pixel. The examples of the scale proposal heatmap are shown in Figure 1 (b).

  3. Detection network: Each of the generated “Masked SR images” is fed into its corresponding-scale object detection network (i.e., “Detector” in Figure 3). Detected bounding boxes are superimposed onto a single image, where overlapping bounding boxes are merged by a standard non-maximum suppression scheme.

FIGURE 3. - Overall pipeline of the proposed method. Any differentiable network can be used for the SR network and detector. The detailed structure of RDSP networks is shown in Figure 4.
FIGURE 3.

Overall pipeline of the proposed method. Any differentiable network can be used for the SR network and detector. The detailed structure of RDSP networks is shown in Figure 4.

FIGURE 4. - RDSP architecture. 
$\oplus $
 denotes a pixel-wise add operation.
FIGURE 4.

RDSP architecture. \oplus denotes a pixel-wise add operation.

Our RDSP network suppresses false positives due to inappropriate scaling by proposing an appropriate scaling factor for each image region. In what follows, we explain our proposed “RDSP network” in detail in Section III-A. After how to train the network consisting of three networks (RDSP network, SR, and detection) is introduced in Section III-B, Section III-C describes how to create the ground-truth for the training of RDSP network.

A. Region-Dependent SR-Scale Proposals

The RDSP network is required to roughly but robustly detect regions in each of which there might be any object. This region is called a Possible Object Region (POR). In particular, even PORs of tiny objects must also be detected by the RDSP network. Such PORs of tiny objects are upscaled using SR by large scaling factors (e.g., the factor of 4). However, this scheme seems to be a chicken-and-egg problem because the RDSP network must detect PORs of tiny objects to support the following object detection network. Therefore, the goal of the RDSP network is not to precisely detect objects without excess or deficiency but to roughly detect PORs with no false negatives. While the POR is similar to a general region proposal for object detection, RDSP also estimates the appropriate scaling factor of each POR for improving the performance of scale-specific object detection. One more difference between the region proposal and the POR is that a set of pixelwise probability values is provided in a heatmap image for our POR representation, while the region proposal is a bounding box corresponding to each object region.

Since POR estimations for tiny objects are difficult, we need additional cues as well as the appearance information of each region of interest. To encode additional cues, the proposed RDSP architecture has the following three modules (also shown in Figure 4):

  1. “Positional embedding”: If similar scene structures are observed in different images, objects in each class are observed to be similar sizes in specific pixel coordinates in these images. For example, in the case of an in-vehicle camera, small objects are located near the vanishing point, i.e., around the center of the image. In order to utilize such a constraint implicitly, we employ a positional embedding scheme. For this embedding, we use a pair of “position images” consisting of x and y channels in which the value of each pixel is x and y image coordinate, respectively. These position images are fed into a 1\times 1 conv layer. The output features of this conv layer are pixelwise added to image features extracted from the input image (as indicated by “Add-1” in the figure).

  2. “Scene structure embedding”: If the structure of a scene is neglected, ridiculous detections might be found. For example, the misclassification of a boat in the sea as a car is a typical one [48]. To suppress such misclassification, methods using the structure of the entire scene have been proposed. (e.g., based on object-scene relationships [48] and attention [49]). In our work, scene structure influences the image features by adding the pooled global features to the image features pixel-wisely (as indicated by “Add-2” in the figure).

  3. “UNet-like backbone”: The effectiveness of the context around a region of interest is validated for tiny object detection such as face detection [7], human-body key-point detection [50], and semantic segmentation [51]. The proposed method employs a semantic segmentation network, U-Net [52], for estimating the scale proposals by jointly evaluating local and global appearance features. The U-Net architecture [52] has symmetric expanding paths for precise localization and contracting paths for capturing contexts. These two types of paths are designed to be symmetric and connected in order to propagate a large number of feature channels from the contracting paths to the expanding paths for utilizing the local and global contexts.

The “Add” operation described above refers to element-wise addition defined by the following equation.\begin{equation*} F^{\prime }_{x,y,c} = F_{x,y,c} + E_{x,y,c} \tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where F^{\prime } denote the post-embedded feature, while F denotes the image feature and E denotes the embedding feature. x , y , and c denote positions on each feature map.

The “stretch” operation in Figure 4 involves replicating a 1-dimensional vector V in both the height and width dimensions to create a 3-dimensional feature map. The “stretch”ed feature map M is expressed as the following formula:\begin{equation*} M_{x, y, c} = V_{c} \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. Through this “stretch” operation, the pooled 1-dimensional feature vector is expanded to the same size as the image feature map.

The RDSP network represents the estimated PORs as a set of heatmap images, each of which corresponds to each scaling factor, as indicated by the “Scale-proposal heatmap” in Figure 4. Since the last layer of the UNet-like backbone is a Sigmoid activation layer, each pixel value in the scale-proposal heatmap is normalized between 0 and 1, which means the less probable and the most probable to be a POR, respectively. As shown in Figure 3, our proposed network has multiple RDSP networks, each of which corresponds to each scaling factor.

While our proposed RDSP is similar to the general Region Proposal Network (RPN), it differs in that RDSP estimates the probability of an object’s presence based solely on context, regardless of whether the object actually exists there. In contrast to RPN, which is trained based on the ground truth bounding boxes, our RPSP is trained to inherently extract regions where objects could potentially exist by our proposed training algorithm, as detailed in the next section.

B. End-to-End Training for SR and RDSP

In our proposed detection framework, we train each branch (as illustrated on the right in Figure 3) independently. An overview of the training for each branch is shown in Figure 5.

FIGURE 5. - The overview of the training of our proposed method.
FIGURE 5.

The overview of the training of our proposed method.

As well as a detection network, any differentiable SR network can be utilized for our joint SR and detection network. Prior to end-to-end training of the full network consisting of these two networks, we assume that each of them is pre-trained in accordance with the training process of each network for better training of the full network. In what follows, the end-to-end training scheme following these pre-training processes is described.

As with the basic training process of an SR network, in our end-to-end training scheme, the SR network is trained with the following reconstruction loss expressed by the mean absolute error (MAE): \begin{equation*} L_{rec}(x, \hat {x}) = \frac {1}{N}\sum ^{N}_{i=1}|x_{i} - \hat {x}_{i}|, \tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. where x and \hat {x} denote the ground-truth HR image and its SR image, respectively. N is the number of pixels in each image.

In contrast to general SR training using L_{rec} , we train SR networks also with a detection loss (denoted by L_{det} ) for optimizing SR for object detection. Let \downarrow _{s} (\cdot) denote an image downscaling function by factor s , and the SR network is denoted by S . The detection network D takes the SR image masked by the heatmap estimated by the RDSP (denoted by R ). The form of the output of D (i.e., detection results) and its ground truth (denoted by y ) differs depending on the object detector. For example, in CenterNet [53], y consists of three kinds of multi-channel images. The pixel values of the first, second, and third images are (i) the confidence value of object center detection in each pixel, (ii) the width and height of the object bounding box in each pixel, and (iii) bounding-box displacements caused by the output stride; see the original paper [53] for details.

With L_{rec} and L_{det} , the SR network in our full network is trained with the following compound loss function L_{SR} :\begin{align*} L_{SR}(x, y) &= \lambda _{rec} L_{rec}(x, S(\downarrow _{s}(x))) \\ &\quad + \lambda _{det} L_{det}(y, D(S(\downarrow _{s}(x)) \odot R(x))), \tag{4}\end{align*} View SourceRight-click on figure for MathML and additional features. where \lambda _{rec} and \lambda _{det} are constants determining the relative weights of the reconstruction loss and the detection loss, respectively.

The RDSP network is trained with the following loss expressed by the binary cross entropy (BCE) using the ground-truth heatmap that is created from the detection ground-truth:\begin{equation*} L_{scale}(p, \hat {p}) = - \frac {1}{N} \sum ^{N}_{i=1} p_{i}\log \hat {p}_{i} \tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features. where p and \hat {p} denote the ground-truth heatmap and the predicted heatmap, respectively. As described above, the RDSP network is required to roughly but robustly detect possible object regions. Therefore, the ground-truth heatmap p is designed to fulfill that requirement. The details of ground-truth heatmap p are described in Sec III-C.

We train the RDSP networks also with a detection loss for object detection. With L_{scale} and L_{det} , the RDSP network is trained with the following loss function L_{RDSP} :\begin{align*} L_{RDSP}(x, y, p) &= \lambda _{scale} L_{scale}(p, R(x)) \\ &\quad + \lambda _{det} L_{det}(y, D(S(\downarrow _{s}(x) \odot R(x))), \tag{6}\end{align*} View SourceRight-click on figure for MathML and additional features. where \lambda _{scale} are weighting parameter for L_{scale} .

In total, the loss function for the entire proposed framework is as follows:\begin{equation*} L_{total} = L_{RDSP} + L_{SR} \tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features.

C. Scale-Proposal Ground-Truth for RDSP Training

As described in Sec. III-B, the RDSP network is trained with the ground-truth heatmaps. Although the RDSP networks are required to estimate the regions that are suitable for the corresponding scale factor, it is difficult to learn to satisfy such requirements from end-to-end training with object detection alone. Therefore, to support the training, we create ground-truth heatmaps that satisfy the requirements. For producing the ground-truth data for this training, a standard training dataset for object detection is reprocessed as follows:

  1. The bounding boxes of objects for detection are divided into height-dependent groups. In our experiments, the bounding boxes are divided into three groups, namely bounding boxes whose appropriate scaling factors are 1, 2, and 4. More specifically, in our experiments, the groups of scaling factors of 1, 2, and 4 include the following ranges depending on the height of the bounding box (denoted by h_{b} ): (1) if h_{b} \geq 64 , factor of 1, (2) if 32 \leq h_{b} < 64 , factor of 2, and (3) if h_{b} < 32 , factor of 4. In the training images, the groups of factors 1, 2, and 4 have 7,156, 6,348, and 6,297 bounding boxes, respectively.

  2. Each RDSP network produces a heatmap-like image in which higher values are given in pixels where any target object is likely to be observed. The ground-truth image of the heatmap for the factor of S \in {1, 2, 4} contains only the bounding boxes included in factor S ’s group, as illustrated in Figure 6. In each ground-truth image, all bounding boxes are filled by 1, while all other pixels are 0.

  3. For robust detection, all bounding boxes filled by 1 are blurred by Gaussian. This blurred image is used as the ground truth of the output of the RDSP network for training.

FIGURE 6. - (a) Training image and ground-truth bounding boxes. Red, blue, and yellow bounding boxes are grouped into those of factors 1, 2, and 4, respectively. From this training image, ground-truth heatmap images for each factor are generated as shown in (b), (c), and (d), respectively. These heatmaps are blurred by Gaussian for robust detection.
FIGURE 6.

(a) Training image and ground-truth bounding boxes. Red, blue, and yellow bounding boxes are grouped into those of factors 1, 2, and 4, respectively. From this training image, ground-truth heatmap images for each factor are generated as shown in (b), (c), and (d), respectively. These heatmaps are blurred by Gaussian for robust detection.

SECTION IV.

Experiments

A. Dataset

We conducted experiments with the CityScapes dataset [54], which is a car-mounted camera dataset. Since this dataset is developed for evaluating instance segmentation methods, the annotations are pixelwise instance labels. From these pixelwise instance labels, bounding boxes for object detection were generated so that a rectangle circumscribing each instance is regarded as its bounding box. While 30 object classes are defined in the dataset, most of them are background objects such as “sky” “road,” and “vegetation.” In our experiments, only classes included in the human group (i.e., “person” and “rider”) were used for the object detection task. The CityScapes dataset is officially split into training and test images. The number of training and test images containing human bounding boxes is 2,965 and 492, respectively. In all the training and test images, 19,801 and 3,975 human bounding boxes are included, respectively.

Additionally, we conducted experiments on the BDD100k dataset, which is a dataset collected from car-mounted cameras, like the CityScapes dataset. In contrast to the CityScapes dataset, the BDD100k dataset is primarily designed for object detection tasks, so we use the provided bounding boxes as-is. The BDD100k dataset is officially split into training and test images. The training set contains 92,369 human bounding boxes, while the test set has 13,426 human bounding boxes. The dataset consists of a total of 70,000 training images and 10,000 test images.

B. Training Details

The proposed method has three components: SR network, RDSP network, and object detection network. As an SR network, we use Deep Back-Projection Network (DBPN) [10], which achieves competitive results in SR challenges [24]. The architecture of DBPN is shown in Figure 7. As a detection network, we use the following five detectors of three types: (1) One-stage detector; SSD [2] and RetinaNet [3], (2) Two-stage detector; Faster R-CNN [4], (3) Anchor-free detector; FCOS [55], CenterNet [53].

FIGURE 7. - The architecture of DBPN [10] network.
FIGURE 7.

The architecture of DBPN [10] network.

First, these three components are pretrained independently by L_{rec}, L_{scale}, L_{det} , respectively. For this pretraining, we use Adam [56] optimizer with \beta =(0.9, 0.999), and the mini-batch size is 8. The learning rate is initialized to 1e-4 and multiplied by 1/10 at 300,000 and 450,000 iterations, while the total iterations are 500,000. As augmentations, we apply random flipping and random cropping to 512\times512 . The weights of SR networks are initialized from the author model, while the weights of the detection networks are initialized from coco-pretrained weights published in mmdetection [57] project. The RDSP models are trained from randomly-initialized weights.

After this pretraining, these networks are fine-tuned in an end-to-end manner with loss functions described in Sec III-B. For this training, we set the mini-batch size 6. The learning rate is initialized to 1e-4 and multiplied by 1/10 at 200,000 and 280,000 iterations, while the total iterations are 300,000. Other settings follow the pretraining.

C. Results and Analysis

1) Effects of SR and RDSP

We compare the results with and without SR and RDSP to validate their effectiveness. The qualitative results on the CityScapes dataset in the SSD detector are shown in Figure 8. The first and second rows of the figure show that SR enables the detector to detect small objects that are not detected without SR. Rows 3–6 of the figure shows that, while detection with SR detects false positives in regions where people should not be present in terms of their contexts, our proposed method suppresses these false positives by successfully estimating regions where people should not be and masking them. The quantitative results of various detectors on CityScapes are shown in Table 1. Table 1 shows that the use of SR improves the detection performance of small objects, but sometimes has negative impacts on the performance of medium or large object detections. With our proposed RDSP networks, the detection performances are further improved in most cases. However, the results with CenterNet indicate a performance decrease when utilizing RDSP. The discussion of these results is presented later.

TABLE 1 Experimental Results on the CityScapes Dataset. We Conducted Comparisons Across Various Object Detectors (i.e., SSD, RetinaNet, Faster R-CNN, FCOS, and CenterNet) With and Without SR and our RDSP. ✓ Denotes That the Corresponding Sub-Network is Integrated Into the Full Network. Red and Blue Values Indicate the Best and the Second-Best Scores in Each Detector, Respectively
Table 1- 
Experimental Results on the CityScapes Dataset. We Conducted Comparisons Across Various Object Detectors (i.e., SSD, RetinaNet, Faster R-CNN, FCOS, and CenterNet) With and Without SR and our RDSP. ✓ Denotes That the Corresponding Sub-Network is Integrated Into the Full Network. Red and Blue Values Indicate the Best and the Second-Best Scores in Each Detector, Respectively
FIGURE 8. - The successful cases on the CityScapes dataset with Super-Resolution and the proposed RDSP. The predicted bboxes with confidence greater than 50% are visualized. For better visualization, a part of the image is cropped.
FIGURE 8.

The successful cases on the CityScapes dataset with Super-Resolution and the proposed RDSP. The predicted bboxes with confidence greater than 50% are visualized. For better visualization, a part of the image is cropped.

Furthermore, we evaluated the effectiveness of SR and RDSP on the BDD100k dataset. The quantitative results are shown in Table 2. Unlike the results on the CityScapes dataset, the table shows a significant performance drop when using SR without RDSP. This is because of the fact that the BDD100k dataset contains more realistic and challenging images, for example, including stronger blur and reflections on the front windshield. In such images, SR tends to generate more severe artifacts, exacerbating the false-positive issue. Nevertheless, even in these challenging cases, our RDSP mitigates the false-positive issue and dramatically improves overall performance.

TABLE 2 Experimental Results on the BDD100k Dataset. We Conducted Comparisons Across Various Object Detectors (i.e., SSD, RetinaNet, Faster R-CNN, FCOS, and CenterNet) With and Without SR and Our RDSP. ✓ Denotes That the Corresponding Sub-Network is Integrated Into the Full Network. Red and Blue Values Indicate the Best and the Second-Best Scores in Each Detector, Respectively
Table 2- 
Experimental Results on the BDD100k Dataset. We Conducted Comparisons Across Various Object Detectors (i.e., SSD, RetinaNet, Faster R-CNN, FCOS, and CenterNet) With and Without SR and Our RDSP. ✓ Denotes That the Corresponding Sub-Network is Integrated Into the Full Network. Red and Blue Values Indicate the Best and the Second-Best Scores in Each Detector, Respectively

2) Model Complexity Analysis

We evaluate the model complexity of our proposed method with the SSD detector. The results are shown in Table 3. From this table, we can see that our proposed RDSP achieves performance improvement with relatively reasonable computational costs and parameter counts compared to not using RDSP. Therefore, RDSP can be considered an efficient method for object detection within the SR-based detection framework. However, when compared to the original SSD, our model has significantly higher runtime and parameter counts. This is because our detection framework has independent detection branches at each scale. While the parallelization of these branches on high-performance computational resources may mitigate runtime, this complexity of SR-based object detection remains a limitation.

TABLE 3 Comparison of the Model Complexity of the Proposed Method
Table 3- 
Comparison of the Model Complexity of the Proposed Method

3) Comparisons With State-of-the-Art Detector

We compare our proposed framework with several state-of-the-art general object detection methods. These state-of-the-art methods are finetuned on the CityScapes dataset based on the COCO-pretrained models and training configurations provided by respective authors. The results are shown in Table 4. The table reveals that even these SoTA methods exhibit lower performance in some metrics. This highlights the difficulties posed by significant size variations and the practical challenges presented by the CityScapes dataset, captured by real car-mounted cameras. Our proposed method outperforms these SoTA methods, especially in the AP_{s} metric, indicating the effectiveness of utilizing SR in this kind of challenging dataset.

TABLE 4 Comparisons With Other State-of-the-Art Object Detector. Red and Blue Values Indicate the Best and the Second-Best Scores, Respectively
Table 4- 
Comparisons With Other State-of-the-Art Object Detector. Red and Blue Values Indicate the Best and the Second-Best Scores, Respectively

4) Comparisons With Interpolation-Based Upscaling

The previous scale-specific detectors [5], [6] use bicubic interpolation to upscale images. Furthermore, other methods [3], [53] demonstrated that the detection performance can be improved by merging detection results of upscaled images at several factors by bicubic interpolation. Therefore, we also experiment with bicubic interpolation instead of SR. Additionally, we also present experimental results using a simpler interpolation method, linear interpolation.

The results are shown in Table 5. With RDSP, SR outperforms interpolation-based upscaling. This implies that SR has the potential to outperform interpolation when the false positives are suppressed by our proposals. On the other hand, without RDSP, interpolation-based scaling performs slightly better than SR. In addition, we show examples of upscaled image regions that have false positives by SR in Figure 9. From this figure, we can see that the image upscaled by SR is sharp but has unnatural edges, which are not presented in the interpolation-based method. These edges are a potential cause of false positives. Our proposed RDSP suppresses these false positives, and thus takes advantage of SR for small object detection. On the other hand, with interpolation-based methods, it can be observed that the performance slightly degrades when RDSP is used. This is because RDSP masks certain regions of the image, potentially leading to the loss of useful information for detection, although these regions do not have unnatural artifacts. Furthermore, we can see that there is no significant difference in performance between Bicubic interpolation and Blinear interpolation. Since the proposed method trains the detector with upscaled images, it suggests that the subtle differences arising from the choice of interpolation method do not affect detection accuracy.

TABLE 5 Comparisons With Interpolation-Based Upscaling With SSD. ✓ Denotes That the Corresponding Architecture is Used
Table 5- 
Comparisons With Interpolation-Based Upscaling With SSD. ✓ Denotes That the Corresponding Architecture is Used
FIGURE 9. - The examples of upscaled image regions that have false positives by SR. These images are the results of x4 upscaling using Bicubic and SR. The top images represent a region of the original image size at 
$48\times 48$
, while the bottom images represent a region of the original size at 
$64\times 64$
. We can see that the SR images are clear but have unnatural edges compared to Bicubic ones.
FIGURE 9.

The examples of upscaled image regions that have false positives by SR. These images are the results of x4 upscaling using Bicubic and SR. The top images represent a region of the original image size at 48\times 48 , while the bottom images represent a region of the original size at 64\times 64 . We can see that the SR images are clear but have unnatural edges compared to Bicubic ones.

5) Ablations of RDSP Components

As described in Section III-A, the RDSP consists of the following three components; “Positional embedding,” “Scene structure Embedding,” and “UNet-like Backbone” (hereinafter called PE, SE, and UB, respectively). We conduct ablation studies to measure the effects of these structural components. For the ablation of UB, we remove the maxpool and up-conv layers from UB since the motivation of UB is to capture the contextual information. This allows comparison with networks that have the same number of layers but cannot capture contexts. The results of ablations are shown in Table 6.

TABLE 6 Ablation Studies for RDSP Network With SSD. PE, SE, and UB Denote “Positional Embedding,” “Scene Structure Embedding,” and “UNet-Like Backbone,” Respectively. ✓ Means the Corresponding Architecture is Used. Bold Values Indicate the Performance Difference Between Using All Three Components and the Rest
Table 6- 
Ablation Studies for RDSP Network With SSD. PE, SE, and UB Denote “Positional Embedding,” “Scene Structure Embedding,” and “UNet-Like Backbone,” Respectively. ✓ Means the Corresponding Architecture is Used. Bold Values Indicate the Performance Difference Between Using All Three Components and the Rest

From this table, when only one of the three components is ablated, there is no significant difference in performance compared to using all three components in any case. This suggests that these three components play similar roles in incorporating a global context, although they have different operations. On the other hand, when all three components are ablated, a significant performance drop is observed. This implies that extracting global context is essential for the estimation of RDSP.

6) Learning Strategies

In our proposed method, the RDSP is pretrained by our proposed loss function L_{scale} , and then fine-tuned by the combined loss of detection loss L_{det} and proposed loss L_{scale} , as described in Sec III-B. In this section, we present experiments with and without pretraining to validate the effectiveness of pretraining. In addition, we experiment with or without L_{scale} during fine-tuning by changing the weighting parameter for L_{scale} (denoted as \lambda _{scale} ). The comparisons of learning strategies are shown in Table 7. From this table, we can see that using L_{scale} during fine-tuning has no positive impact. In contrast, pretraining with L_{scale} improves the detection performance.

TABLE 7 Comparison of Learning Strategies With SSD Detector. ✓ Denotes RDSP Network is Pretrained
Table 7- 
Comparison of Learning Strategies With SSD Detector. ✓ Denotes RDSP Network is Pretrained

7) Independent Detectors for Each Scale Factor

In our proposed detection pipeline, SR images are fed into independent detectors for each scale factor, while the previous methods [5], [6] utilize shared detectors. This is because SR images upscaled by CNNs have unique features (e.g., checkerboard artifacts) that depend on their upscaling factors, and it is very difficult for a detector to generalize these unique features, such as artifacts. To prevent the detectors from being affected by such unique features, we utilize independent detectors for each scaling factor. To demonstrate the effectiveness of the independent detectors, we experimented with a shared detector on SSD. This detector sharing is done by fixing the detector’s weights with the pretrained model described in Section IV-B.

The quantitative results are shown in Table 8. The table shows that detector sharing has a negative effect on the detection performance on all metrics.

TABLE 8 Comparison of Detector Sharing With SSD
Table 8- 
Comparison of Detector Sharing With SSD

8) Embedding Operation

In RDSP, positional and scene structure features are embedded using an “Add” operation as described in Sec. III-A. In this section, we explore an alternative embedding method, which is the “Multiply” operation, and provide a comparison between the two approaches. The results of this comparison are shown in Table 9.

TABLE 9 Comparison of Embedding Operation
Table 9- 
Comparison of Embedding Operation

From the table, we can see that the “Add” and “Multiply” operations yield very similar performance. This suggests that, fundamentally, the choice between these operations does not significantly impact the embedding’s effectiveness.

9) Analysis of Performance Drop in CenterNet

As shown in Table 1 and Table 2, our proposed framework performs worse when using CenterNet. We attribute this performance drop to the Deformable Convolution in Centernet. Since Deformable Convolution allows flexible convolution positions and enables the extraction of contextual information, our proposed RDSP imposes a mask on the image, potentially causing the convolution positions of Deformable Convolution to extend into the masked regions. Table 10 shows the comparisons with and without Deformable Convolutions on Centernet. We can see that the performance degradations do not occur on CenterNet without Deformable Convolutions.

TABLE 10 Comparison of CenterNet With and Without Deformable Convolution. DC Denotes Deformable Convolution, and Red Values Indicate the Best Scores in Each Detector
Table 10- 
Comparison of CenterNet With and Without Deformable Convolution. DC Denotes Deformable Convolution, and Red Values Indicate the Best Scores in Each Detector

10) Analysis of Failure Case of Proposed Method

Figure 10 shows failure cases of the proposed method. The failure cases can be categorized into two types:

  • Limited contextual information. Since our RDSP relies on contextual information to recognize object scales, heatmap estimation fails when there is limited contextual information. In the examples shown in the 1st and 2nd row of Figure 10, the heatmaps do not activate in regions where it should be estimated as x2 or x4 scales, because the entire scene is dark and challenging to extract contextual information. Such RDSP estimation failures result in undetected objects, which are detected without RDSP.

  • Crowded scenes. Since our proposed detection framework merges the detection results from each branch using Non-Maximum Suppression (NMS), it struggled to detect overlapping objects. Therefore, while the use of SR and RDSP improves detection accuracy, undetected objects still exist, as shown in the 3rd and 4th row of Figure 10.

FIGURE 10. - The failure case with the proposed RDSP. The predicted bboxes with confidence greater than 50% are visualized. For better visualization, a part of the image is cropped.
FIGURE 10.

The failure case with the proposed RDSP. The predicted bboxes with confidence greater than 50% are visualized. For better visualization, a part of the image is cropped.

Improving the first category could potentially be achieved by providing additional cues beyond the image, such as LiDAR or RADAR data. For the second category, some performance improvement might be achieved through the use of more advanced NMS, such as Soft-NMS [62] or learnable NMS [63].

SECTION V.

Concluding Remarks

This paper proposed a method for estimating object-scale proposals for scale-optimized object detection using SR. With images that are rescaled by the appropriate SR scaling factor, an object detector can work better than in the original-size image. A variety of experimental results validated that our proposed RDSP network can capture the rough locations of objects depending on contextual information. We qualitatively and quantitatively verified that object detectors using our scale proposals outperform those without the scale proposals.

Since the proposed method can also be applied to many other computer vision tasks (e.g., human pose estimation, face detection, and human tracking) that capture tiny objects, we would like to extend our proposals to these tasks in future work.

References

References is not available for this document.