I. Introduction
Text mining in scene images typically extracts many meaningful information, like milestones, landmarks, product names, and shop titles. The computer vision faction gain awareness toward perceiving text instances for better scene understanding. Scene text detection is a task of locating the accurate location of all text occurrences in a scene images and confining them in a bounding box. It has yielded an immense advantage for its wide utility in real applications, like real-time traffic sign recognition, blind navigation assistance, social networks, autonomous navigation, and multilingual translation [1]–[3]. Nevertheless, it is a challenging problem because text regions have large span for scale orientation, color, aspect ratio, and script. Such text instances can also be horizontal, oriented, curved, and arbitrary-shaped, as shown in Fig. 1. This enhances the regression complexity for precisely locating the text regions. But, due to noisy situations, scene images have noise, which further influences the contrast of the image and faints the text edges, which make the detection more complex. The incorporation of deep neural networks has addressed many of the former challenges, however latter ones are still in preliminary phase of research. It is therefore important to design the text detection models considering the presence of poor contrast and faint edges in scene images. This increases the difficulty in detection, as shown in Fig. 1 (first four rows). Schechner et al. [4], Nayar and Narasimhan [5], and He et al. [6], study haze scene images to reduce the impact of bad weather conditions on scene images. However, they do not aim to perform text, face, or object detection from such hazy scene images. The only available work [7] performs text detection in the presence of haze in scene images as a postprocessing step. The probable text regions are dehazed using handcrafted features and then text instances are detected. Thus, the detection task directly from noisy scene images is still in its elementary phase.
Column (b) and (c) are the detected texts in the original images of column (a) using CRAFT [8] and our SESANet. The first four rows show the text regions having faint edges due low illumination, poor contrast, and image filtering. The last two rows illustrate the text instances having a wide range of scale, orientation, and aspect-ratio. Text instances can be in horizontal, oriented, and curved form (fifth row). (a) Original. (b) Craft. (c) SESANet.