Loading web-font TeX/Main/Regular
YOLOH: You Only Look One Hourglass for Real-Time Object Detection | IEEE Journals & Magazine | IEEE Xplore

YOLOH: You Only Look One Hourglass for Real-Time Object Detection


Abstract:

Multi-scale detection based on Feature Pyramid Networks (FPN) has been a popular approach in object detection to improve accuracy. However, using multi-layer features in ...Show More

Abstract:

Multi-scale detection based on Feature Pyramid Networks (FPN) has been a popular approach in object detection to improve accuracy. However, using multi-layer features in the decoder of FPN methods entails performing many convolution operations on high-resolution feature maps, which consumes significant computational resources. In this paper, we propose a novel perspective for FPN in which we directly use fused single-layer features for regression and classification. Our proposed model, You Only Look One Hourglass (YOLOH), fuses multiple feature maps into one feature map in the encoder. We then use dense connections and dilated residual blocks to expand the receptive field of the fused feature map. This output not only contains information from all the feature maps, but also has a multi-scale receptive field for detection. The experimental results on the COCO dataset demonstrate that YOLOH achieves higher accuracy and better run-time performance than established detector baselines, for instance, it achieves an average precision (AP) of 50.2 on a standard 3\times training schedule and achieves 40.3 AP at a speed of 32 FPS on the ResNet-50 model. We anticipate that YOLOH can serve as a reference for researchers to design real-time detection in future studies. Our code is available at https://github.com/wsb853529465/YOLOH-main.
Published in: IEEE Transactions on Image Processing ( Volume: 33)
Page(s): 2104 - 2115
Date of Publication: 12 March 2024

ISSN Information:

PubMed ID: 38470577

Funding Agency:


I. Introduction

Object detection, aiming to locate and classify various objects, is a core task in computer vision. After the success of Convolutional Neural Networks (CNNs) [1], plenty of object detection models [2], [3] achieve great progress on benchmark datasets [4], [5]. However, these models rely on the last layer of the backbone, which contains limited semantic features due to the feature map with low resolution.

Contact IEEE to Subscribe

References

References is not available for this document.