Loading [MathJax]/extensions/MathMenu.js
DTTR: Detecting Text with Transformers | IEEE Conference Publication | IEEE Xplore

DTTR: Detecting Text with Transformers


Abstract:

Recently, most transformer-based approaches have achieved considerable success on vision tasks, even better than those with convolution neural networks (CNNs). In this pa...Show More

Abstract:

Recently, most transformer-based approaches have achieved considerable success on vision tasks, even better than those with convolution neural networks (CNNs). In this paper, we present a novel transformer-based model, named detecting text with transformers (DTTR), for scene text detection. In DTTR, a CNN backbone extracts local connectivity features and a transformer decoder captures global context information from a scene text, effectively. In addition, we propose a dynamic scale fusion (DSF) module that can fuse multiscale feature maps dynamically, thus significantly improving the scale robustness and rendering powerful representations for subsequent decoding. Experimental results show that DTTR achieves 0.5% H-mean improvements and 20.0% faster in inference speed than the SOTA model with a backbone of ResNet-50 on MMOCR. Code will be released at: https://github.com/ahsdx/DTTR.
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information:

ISSN Information:

Conference Location: Rhodes Island, Greece

Funding Agency:

Citations are not available for this document.

1. INTRODUCTION

Scene text detection is a challenging computer vision task with a wide range of practical applications including document analysis, autonomous driving, and so on. Some recent methods [1]–[7] first detect the fundamental elements, such as individual text parts or characters, and then aggregate these elements to form a complete text. Seglink [1] and its variant Seglink++ [2] detect local segments of a text and link adjacent segments to the final text. DRRG [3] further improves SegLink using a graph convolutional network (GCN [4]) to infer the linkage relationships between text segments. CRAFT [5] takes characters as fundamental elements and explores their affinities to aggregate detected characters. DB [6] and DBNet++ [7] follow a segmentation pipeline, predicting text pixels by an adaptive binarization method. The aforementioned methods can localize local units accurately and have a more flexible representation of text boundaries.

Cites in Papers - |

Cites in Papers - IEEE (1)

Select All
1.
Hao Zheng, Peng Liang, Yu Tang, Yanqi Shi, Linbo Qiao, Dongsheng Li, "3D Parallelism for Transformers via Integer Programming", ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.6440-6444, 2024.
Contact IEEE to Subscribe

References

References is not available for this document.