Conferences >ICASSP 2023 - 2023 IEEE Inter...

DTTR: Detecting Text with Transformers

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Recently, most transformer-based approaches have achieved considerable success on vision tasks, even better than those with convolution neural networks (CNNs). In this pa...Show More

Metadata

Abstract:

Recently, most transformer-based approaches have achieved considerable success on vision tasks, even better than those with convolution neural networks (CNNs). In this paper, we present a novel transformer-based model, named detecting text with transformers (DTTR), for scene text detection. In DTTR, a CNN backbone extracts local connectivity features and a transformer decoder captures global context information from a scene text, effectively. In addition, we propose a dynamic scale fusion (DSF) module that can fuse multiscale feature maps dynamically, thus significantly improving the scale robustness and rendering powerful representations for subsequent decoding. Experimental results show that DTTR achieves 0.5% H-mean improvements and 20.0% faster in inference speed than the SOTA model with a backbone of ResNet-50 on MMOCR. Code will be released at: https://github.com/ahsdx/DTTR.

Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 04-10 June 2023

Date Added to IEEE Xplore: 05 May 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49357.2023.10096961

Conference Location: Rhodes Island, Greece

Funding Agency:

Contents

1. INTRODUCTION

Scene text detection is a challenging computer vision task with a wide range of practical applications including document analysis, autonomous driving, and so on. Some recent methods [1]–[7] first detect the fundamental elements, such as individual text parts or characters, and then aggregate these elements to form a complete text. Seglink [1] and its variant Seglink++ [2] detect local segments of a text and link adjacent segments to the final text. DRRG [3] further improves SegLink using a graph convolutional network (GCN [4]) to infer the linkage relationships between text segments. CRAFT [5] takes characters as fundamental elements and explores their affinities to aggregate detected characters. DB [6] and DBNet++ [7] follow a segmentation pipeline, predicting text pixels by an adaptive binarization method. The aforementioned methods can localize local units accurately and have a more flexible representation of text boundaries.

References is not available for this document.

DTTR: Detecting Text with Transformers

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

DTTR: Detecting Text with Transformers

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. INTRODUCTION

References