1. INTRODUCTION
Scene text detection is a challenging computer vision task with a wide range of practical applications including document analysis, autonomous driving, and so on. Some recent methods [1]–[7] first detect the fundamental elements, such as individual text parts or characters, and then aggregate these elements to form a complete text. Seglink [1] and its variant Seglink++ [2] detect local segments of a text and link adjacent segments to the final text. DRRG [3] further improves SegLink using a graph convolutional network (GCN [4]) to infer the linkage relationships between text segments. CRAFT [5] takes characters as fundamental elements and explores their affinities to aggregate detected characters. DB [6] and DBNet++ [7] follow a segmentation pipeline, predicting text pixels by an adaptive binarization method. The aforementioned methods can localize local units accurately and have a more flexible representation of text boundaries.