I. Introduction
Scene text detection in the wild, as a fundamental task in the computer vision field, has been widely applied in numerous applications [1], [2], [3], [4], [5], [6], such as autonomous driving, document analysis and image understanding. The goal of text detection is to label each text instance with a bounding box from input images. Current leading approaches are mainly extended from the object detection [7], [8] or segmentation frameworks [9], which could be summarized into regression-based methods [6], [10], [11], [12], segmentation-based methods [4], [5], [11], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] and hybrid methods [23], [24], [25], [26], [27], respectively. However, they may suffer in more complicated cases, and the performances of State-of-the-Art methods are still far from the demands of real-world applications.