Journals & Magazines >IEEE Access >Volume: 12

Better Skeleton Better Readability: Scene Text Image Super-Resolution via Skeleton-Aware Diffusion Model

Inference Diagram of the Skeleton-Aware Diffusion Model (SADM) for STISR: The process begins with a low-resolution scene text image, which undergoes a skeleton generation...

Abstract:

Scene text image super-resolution (STISR) aims to enhance the resolution of text images while simultaneously improving their readability by reducing noise, blur, and othe...Show More

Metadata

Abstract:

Scene text image super-resolution (STISR) aims to enhance the resolution of text images while simultaneously improving their readability by reducing noise, blur, and other degradations. Existing diffusion-based approaches for STISR primarily rely on text-prior information but often overlook the importance of explicitly modeling the visual structure of the text. In this paper, we propose a novel Skeleton-Aware Diffusion Method (SADM) for STISR, which introduces text skeletons as structural guidance to the diffusion process. The text skeleton serves as a critical visual cue, helping the model to better restore the fine details of text, even in severely degraded low-resolution images. Generating high-quality skeletons from low-resolution scene text is a challenging task due to the inherent blurring and noise present in such images. To tackle this, we introduce a diffusion-based Skeleton Correction Network (SCN), which refines the initial skeletons produced by a convolutional neural network-based skeletonization model. The SCN effectively improves the accuracy of the skeletons, allowing for more precise structural guidance during the diffusion process. Our extensive experiments demonstrate the significant benefits of incorporating skeleton information into the STISR pipeline. The proposed SADM achieves state-of-the-art performance on the TextZoom dataset, with accuracies of 81.4%, 64.9%, and 49.6% on the easy, medium, and hard subsets, respectively, compared to the previous best results by ASTER text recognizer. Through detailed analysis, we also show that improving the quality of skeletons from low-resolution images leads to better super-resolution outcomes and enhances the performance of text recognizers.

Inference Diagram of the Skeleton-Aware Diffusion Model (SADM) for STISR: The process begins with a low-resolution scene text image, which undergoes a skeleton generation...

Published in: IEEE Access ( Volume: 12)

Page(s): 187640 - 187651

Date of Publication: 02 December 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3510136

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Automatic text extraction is an essential component for understanding and decipherability of scenes for computers [1], [2]. Automatic text extraction makes various applications of computer vision possible, it includes assisting the visual-impaired person with non-Braille writing present in the surroundings [3], question-answering over text present in the natural images [4], captioning of images based on textual information present in an image [5]. Text-based scene understanding is dependent on text recognition [6]. Text recognition in natural images is done under the assumption of high resolution (HR) text. The SOTA scene text recognition methods struggled when handling low resolution (LR) text images. This depreciation in text recognition occurs mainly due to blurring the boundary of the text strokes or mixing of nearby character strokes and in some cases the complete characters is almost merged with the background.Some of these challenging situations are illustrated in Figure 1. Since, LR text regions are very common in real world scenarios, improving the text recognition for LR text regions are of huge importance.

FIGURE 1.

Examples of challenges in low-resolution (LR) text images: (a) Blurring that distorts text stroke boundaries; (b) Merging of nearby character strokes; and (c) Instances where entire characters blend into the background.

Show All

The most intuitive solution to improve the text recognition performance on LR text images is single image super resolution as pre-processing step. In past, various works has used single image super resolution as a pre-processing steps to improve the text recognition accuracy of pre-trained text recognizers and named this single image super resolution of scene text images as Scene Text Image Super-Resolution (STISR) [7].

In comparison to the natural image super-resolution, STISR poses unique challenges due to the structural complexity of text characters. Unlike natural images, where the primary goal is to enhance overall image quality, STISR requires the precise restoration of character shapes to ensure readability. Even minor blurring or distortions in character strokes can lead to significant drops in both visual quality and recognition accuracy. Conventional super-resolution methods, which focus primarily on pixel-wise fidelity, often struggle to preserve the intricate structural details that define characters, resulting in distorted or blurred text.

The motivation for incorporating skeletons into the STISR pipeline arises from the need to address this shortcoming. A skeleton captures the fundamental shape of a character, isolating its essential structure without being affected by noise, thickness variations, or other degradations. By focusing on this core structure, the super-resolution process can be guided to accurately restore the strokes and contours of each character. This is especially important in real-world applications, where text images are often degraded by noise, blurring, and low resolution, making traditional methods insufficient for restoring clarity.

Nowadays, diffusion models [9] have emerged as a state-of-the-art method for various computer vision tasks including text-to-image generation [10], image super-resolution [11], image inpainting [12] etc. The diffusion model also showed its impact on STISR [13], [14]. These methods conditioned the diffusion model with text-prior. However, a major bottleneck for adding text-prior is the miss-alignment of the text-prior predicted character with the spatial text stroke present in LR images. However, none of the existing methods has explored the scene text super-resolution using diffusion which does not use text labels in training. The text label-free training methods would help avoid this miss-alignment problem.

In place of the text-region condition, we consider the conditioning of text-skeleton to the diffusion method which gives more precise pixel-level conditioning to recover character strokes. The precise text-skeleton gives a more reliable guidance to the diffusion model which helps to improve the image resolution while recovering the blurred strokes of character. However, obtaining the skeleton of an LR image is itself a challenging problem. In the past, [8] has given an approach for generating a text skeleton, these skeletons are used for including a skeleton-aware loss for STISR. In this study, we observed that the diffusion model conditioned with skeleton generated by [8] (illustrated in Figure 2(b)) improves the results concerning the vanilla diffusion method (illustrated in Figure 2 (a)) for STISR. This motivates to explore the skeleton guidance on the diffusion model.

FIGURE 2.

Scene text image super-resolution based on (a) Vanilla diffusion method, (b) Diffusion model conditioned by skeleton produced by [8], (c) Skeleton-aware diffusion model (proposed) consist of skeleton correction network.

Show All

In this work, we investigated the importance of the skeleton guidance on the diffusion model for effective STISR. The diffusion conditioned by skeleton generated by [8] is taken as a baseline approach. From the detailed observation of the diffusion conditioned by skeleton produced by [8], we identified that the recovery of the strokes is not good in the case where the strokes are not generated well. The low-resolution scene text images often suffer from artifacts such as blur and noise, especially in challenging lighting conditions or due to oblique viewing angles, which obscure fine text details. Additionally, character strokes may merge or smear, especially in tightly packed text, making it difficult to differentiate between adjacent strokes or characters. Background elements can also interfere, as complex or textured backgrounds may be mistakenly, which can lead to inaccurate skeletonization and hinder the super-resolution process. This leads to the requirement of a network that aims to refine these skeletons, reduce noise, and improve structural precision. This correction is vital because accurate skeletons provide better structural guidance, which is critical for restoring character details and enhancing readability in the final output. To overcome this limitation, we introduced a stroke correction network (SCN). The role of SCN is to reduce the miss-alignment of a skeleton with the corresponding high-resolution skeleton (illustrated in Figure 2(c)). It helps to improve the skeleton of the LR image strokes and hence improves the image resolution with better readability of the textual part of the image. Since, in the past, the diffusion method showed its importance in image restoration, we posed the stroke correction as an image restoration task and used the diffusion process in SCN.

The Skeleton Correction Network (SCN) introduced in our proposed Skeleton-Aware Diffusion Method (SADM) is a key component designed to address the challenges of extracting high-quality skeletons from degraded, low-resolution text images. In real-world conditions, low-resolution scene text images often suffer from noise, blurring, and other degradations, which make it difficult to generate accurate skeletons using traditional convolutional networks. To overcome this limitation, we introduce SCN, which refines the initial skeletons generated by a standard CNN-based skeletonization network. The SCN operates in a diffusion-based framework, progressively improving the accuracy and structure of the skeletons, ensuring they more closely resemble the true character structures present in the high-resolution images. By integrating these refined skeletons into the diffusion process, our SADM uses them as structural guidance to enhance the reconstruction of text images. The refined skeletons ensure that the super-resolution process focuses on preserving and restoring the key visual cues—strokes, contours, and shapes—that define individual characters. This leads to more readable and visually accurate text, which not only improves the visual quality of the restored images but also significantly boosts the performance of downstream tasks like text recognition. This refinement of skeletons allows our method to achieve superior performance in STISR compared to approaches that rely solely on pre-defined skeletonization methods. By directly addressing the issues of noise and blur in low-resolution images, our SCN ensures that the skeletons used for guidance are of the highest quality, enabling the diffusion model to generate text images with greater clarity and accuracy.

The major contribution of the work are three fold:

To the best of our knowledge, this is the first work that introduced the guidance of text stroke information in diffusion model for STISR.
For better skeleton, we have introduced a novel diffusion-based skeleton correction network (SCN) for correcting the stroke generated from LR images.
An extensive analysis is conducted on effectiveness of the stroke conditioning on diffusion method for STISR.

The rest of the paper is organized into four sections. Section II discussed the prior work related to the proposed work. In Section III, the details of the proposed method are described. In Section IV, the experimental setup and results are discussed. Finally, the conclusion is drawn in Section V.

SECTION II.

Related Work

Scene text image super-resolution aims to increase the resolution of the scene text [15] images along with improving the readability of the text. STISR has its separate challenges compared to the natural image super-resolution. But, initially, the generic image super-resolution techniques were directly used for the recovery of character strokes. In [16], the natural image super-resolution approach SRCNN [17] was directly applied for STISR. Additionally, the usage of generative adversarial network [18], and application of attention mechanism for powerful feature expression [19], [20], [21] are common methods for STISR by generic image super-resolution approaches. For a better understanding of the text super-resolution [22] has introduced a benchmark dataset TextZoom for STISR. In [22], a network that utilizes the text property for STISR is used by proposing a TSRN method. It consists of horizontal and vertical BLSTM for extracting sequential features of text. Similar to TSRN [22], PCAN [23] also uses BLSTM for collecting text features. Another approach is to include the text structure information in the network. It includes a stroke-aware framework [24] by concentrating on stroke-level internal structures and adding text-specific skeleton losses [8]. The other approach is applying of graph attention to reduce pixel distortion caused by upsampling [25]. In addition to these, there are a set of methods that have used the text recognizer as a text prior in the model itself [26], [27], [28]. TPGSR [26] and TATT [27] have used pre-trained OCR as a text-prior in super-resolution pipeline for LR images. C3-STISR [28] has jointly used text recognizer, text images, and linguistical information via language model as three clues to guide super-resolution.

Recently, the extension of the iterative diffusion process is also explored in STISR [13], [14]. As the initial effort of applying diffusion for STISR, [13] has employed text recognizer as a semantic of LR image for the guidance of the diffusion model. Following a similar approach [14] also conditioned the diffusion model with text-prior. However, in both these methods the attempt was made to align the textual prior with the spatial strokes of text for text super-resolution. In contrast to prior diffusion-based methods, the proposed method has used the structural property of the text and used the skeleton of the text as semantic information for the guidance of the iterative diffusion process.

In previous skeleton-based approaches like [8], skeletons were used as an auxiliary tool through skeleton loss functions, helping networks align with character structures during training. However, these methods relied on pre-defined skeletonization processes and did not address the challenges of generating accurate skeletons from low-resolution text images in real-world conditions. In contrast, our proposed Skeleton-Aware Diffusion Model (SADM) introduces a Skeleton Correction Network (SCN) to dynamically refine skeletons from low-resolution images. By integrating these refined skeletons into the diffusion process, our method provides more precise and visually accurate text restoration, overcoming the limitations of static skeleton-based losses.

SECTION III.

Proposed Method

The proposed work introduces a novel approach termed as Skeleton-Aware Diffusion Model (SADM) for Scene Text Image Super-Resolution (STISR). This model leverages diffusion mechanics [10], [29] to generate high-resolution images, conditioning not only on the LR images but also on the text skeleton of the scene text image. Comprising two key components, the STISR model demonstrates a comprehensive strategy for enhancing scene text image resolution. The Skeleton Generation process is discussed in detail in Section III-B, focuses on converting the skeleton images of LR images into skeleton high-resolution (HR). These predicted skeleton HR images are crucial input for Skeleton-Aware Diffusion Model for STISR described in detail in Section III-A. It utilizes both the predicted skeleton high-resolution images and the LR images for conditioning during training and sampling phases, effectively synthesizing high-resolution outputs for scene text. This multi-faceted approach integrates both textual context and visual cues to achieve superior text super-resolution results. Figure 3 illustrates the schematic diagram of proposed method.

FIGURE 3.

A detailed diagram of the proposed Skeleton-aware diffusion model for STISR is included. It consists of two forward passes: the first pass involves the skeleton correction network (SCN) within the skeleton generation module, and the second pass involves the skeleton-aware diffusion model for STISR. The skeletonization network is utilized for generating skeletonization for both LR and HR images.

Show All

A. Skeleton-Aware Diffusion Model for STISR

The Skeleton-Aware Diffusion Model for STISR are constructed upon the foundation of denoising diffusion probabilistic models (DDPMs) [29]. It incorporates both forward and reverse processes inherent in the diffusion framework. In the forward process, Gaussian noise is systematically added to the input image, gradually obscuring its finer details until it converges into pure Gaussian noise. Our model’s forward process involves channel-wise concatenation of the input image $HR_{t}$ with conditioning input $C$ at each diffusion step $t$ . Here, $C$ encompasses the bicubic interpolated LR image and the super-resolved skeleton images $SK$ derived from the skeleton super-resolution network. Conversely, the reverse process simulates a sequence of denoising auto-encoders $\epsilon _{\theta } (I_{t}, t)$ , spanning from 1 to $T$ , aiming to predict the Gaussian noise contained at each step of the forward process, thereby progressively reconstructing the original image. To quantify the performance of the reverse process, we employ the conditional DDPM loss function

$\begin{equation*} \mathcal {L} = \mathbb {E}_{t,I_{0},\epsilon }[\|\epsilon - \epsilon _{\theta }(I_{t},t)\|^{2}] \tag {1}\end{equation*}$ View Source

Here,

$t$

is uniformly sampled from

$\{1, \ldots , T\}$

, and

$\epsilon$

is drawn from a Gaussian distribution

$\mathcal {N}(0, I)$

, with

$I_{T}$

representing a noise image with condition C at the final diffusion step.

In this super-resolution network module, we adopt the UNet architecture [30], a well-established framework for image restoration tasks [29], [31]. Following previous studies, we utilize the ResNet block [32] as the fundamental building unit within our UNet architecture [30]. However, unlike diffusion-based models where text encoding embeddings play a crucial role as conditions for the model [29], our approach operates solely on low and high-resolution image pairs, devoid of text encoding information. To adapt the UNet architecture to our task, we have removed the cross-attention layers typically employed for integrating text encoding embeddings. Since our model does not rely on text encoding information, these cross-attention layers are redundant in our context. Nevertheless, self-attention layers remain integral components within our architecture, facilitating the model’s capability to capture intricate spatial dependencies across feature maps. These self-attention mechanisms are applied at multiple stages throughout the network, enabling the extraction of relevant image features and enhancing the overall quality of the super-resolved output.

B. Skeleton Generation

Our method simplifies text characters into skeleton images and highlighting essential structural elements through skeletonization. By reducing strokes to a single-pixel width while preserving shape, this technique emphasizes the central axis of each character. To achieve this conversion of text images into skeleton images, where strokes are normalized and background texture is eliminated, we leverage a skeletonization network. The detail of skeletonization network is described in Section III-B1. Additionally, we introduce a skeleton correction network aimed at enhancing the quality of the skeleton image. This correction network is specifically designed to refine the output of the skeletonization network by learning the mapping between LR and HR skeleton images. Further details on the design and operation of the correction network are provided in Section III-B2.

1) Skeletonization Network

In the absence of image-to-skeleton pairs for supervised training, we resort to synthetic supervised data generation. Synthetic images are created by combining randomly selected background colors and fonts from a pool of 35 font types, with random text patterns also selected. To generate skeleton images (ground truth) for these synthetic images, similar [8], we employ a line width normalization network [33]. This network effectively removes backgrounds and thins lines without distorting strokes in clean text images. Figure 4 depicts samples of generated synthetic data, showcasing synthetic images alongside their corresponding synthetic skeleton images.

FIGURE 4.

Some samples of synthetic generated images with their skeleton images.

Show All

FIGURE 5.

Comparison of traditional skeleton generation and diffusion-based skeleton generation images (Proposed).

Show All

Our skeletonization network architecture mirrors that of [8], comprising two convolutional layers at the beginning and end, with intermediate layers consisting of sequential residual blocks (SRBs) [7]. Each SRB layer employs ReLU non-linearity, except for the final layer, which utilizes a Sigmoid function to constrain the output within the range [ $0, 1$ ]. To train the skeletonization network on LR images, Gaussian blur is applied to synthetic texts.

2) Skeleton Correction Network

The SCN is essential for refining initial skeletons extracted from low-resolution images, which often suffer from blur, noise, merged strokes, background interference, and complex text shapes. These issues can distort skeleton accuracy, leading to potential misguidance in the super-resolution process. The SCN addresses this by employing a diffusion-based correction to enhance skeleton fidelity, effectively capturing the true structural essence of text characters. This refinement enables the model to leverage accurate skeleton guidance, resulting in improved text clarity and overall super-resolution quality. The skeletonization network, while effective, often produces suboptimal skeleton images from LR inputs. To address this limitation, we introduce the Skeleton Correction Network (SCN), aimed at enhancing the quality of skeleton images. The SCN is a diffusion-based model trained to generate improved skeleton images by conditioning on low-quality skeleton images. The architecture is designed with a Unet structure [30], similar to that of the skeleton-aware diffusion module III-A. In the forward process of the SCN, high-resolution skeleton images $skel_{HR}$ are diffused using an incremental Gaussian noise schedule and concatenated with conditioning input $C_{k}$ i.e. $C_{k} = [I_{LR}; skel_{LR}]$ , where $I_{LR}$ represents the bicubic interpolated low-resolution image and $skel_{LR}$ denotes the skeletonized low-resolution image. Both $skel_{LR}$ and $skel_{HR}$ are generated by the skeletonization network, processing corresponding low and high-resolution images. In the reverse process, the Unet model is trained to predict the added Gaussian noise to $skel_{HR}$ . The Gaussian noise scheduling follows the approach outlined in III-A, where $t=0$ corresponds to the original $skel_{HR}$ image, and $t=T$ represents a Gaussian distribution $\mathcal {N}(0, I)$ .

During sampling from the skeleton correction model, low-resolution images and their corresponding skeleton images are used to condition the model. The noise sample ( $skel_{HR}^{T}$ ) from Gaussian distribution $\mathcal {N}(0, I)$ with conditioning is passed to the trained Unet to predict the noise and generate the corrected skeleton sample.

SECTION IV.

Experiments

A. Implementation Details

During the execution of the skeleton-aware diffusion model for STSR, distinct training settings are applied to each component. The key hyperparameters used in training the Skeleton-Aware Diffusion Model (SADM) are summarized in Table 2. The skeletonization network trains on synthetic data with a stable learning rate of 0.001 and Adam optimizer with a momentum term set to 0.9, over $70,000$ iterations. It utilizes a loss function combining Mean Squared Error and skeleton loss [8], with a weightage of 0.008 [8]. The skeleton correction and skeleton-aware diffusion models operate with a batch size of 32, with 1000 diffusion steps, and a linear noise schedule. Only standard normalization has been done on RGB images, we scaled the images within the range of [-1,1]. They employ 128 channels, channel multipliers of ( $1, 2, 4, 8$ ), 1 attention head, 2 residual blocks, and 6 self-attention layers. The learning rate is set at $3\times 10^{-4}$ , and sampling is performed using the DDIM sampler [39], running for 250 timesteps.

TABLE 1 The Proposed Method is Compared With Different Categories of Methods on Three Subsets of the TextZoom Dataset. The Results are Evaluated on the Recognition Accuracy of Three Text Recognizers: CRNN [34], MORAN [35], and ASTER [36]. BICUBIC (LR)

$\uparrow$ Indicates the Upscaled Bicubic Interpolation of LR Images, Which Serves as the Baseline Recognition Results. The Best Results are Highlighted in Bold

$Table 1- The Proposed Method is Compared With Different Categories of Methods on Three Subsets of the TextZoom Dataset. The Results are Evaluated on the Recognition Accuracy of Three Text Recognizers: CRNN [34], MORAN [35], and ASTER [36]. BICUBIC (LR) $\uparrow $ Indicates the Upscaled Bicubic Interpolation of LR Images, Which Serves as the Baseline Recognition Results. The Best Results are Highlighted in Bold$

TABLE 2 Hyperparameters Used in Training the Skeleton-Aware Diffusion Model (SADM)

B. Experimental Results

1) Dataset Description

The TextZoom dataset, introduced by Wang et al. [7], is specifically designed for text image super-resolution tasks. It contains a large collection of $21,740$ paired low-resolution (LR) and high-resolution (HR) scene text images, along with corresponding text labels. The dataset is built from two major super-resolution datasets, RealSR [40] and SRRAW [22], which were originally created for general image super-resolution tasks. These datasets were adapted to focus on scene text images, providing a tailored resource for improving and evaluating text super-resolution methods.

The LR-HR image pairs in TextZoom were captured under diverse real-world conditions, simulating the challenges faced in everyday scenarios. Different cameras, each with varying focal lengths, were used to collect these images. This variety in camera settings, lighting, and environmental conditions adds a layer of complexity to the dataset, making it a robust benchmark for evaluating super-resolution models, particularly in the context of text recognition.

The dataset is divided into two main parts: a training set and a testing set. The training set consists of $17,367$ LR-HR image pairs, providing a substantial amount of data for model training. The testing set is further categorized into three distinct subsets based on the difficulty level, which is determined by the focal length of the camera used during data collection. These subsets are labeled easy, medium, and hard, with $1,619$ , $1,411$ , and $1,343$ samples, respectively. The easy subset represents images captured under favorable conditions, such as close camera distance and good lighting, while the hard subset includes more challenging cases, such as blurred or distorted text due to longer camera focal lengths or adverse environmental conditions.

This comprehensive dataset not only serves as a valuable resource for training and evaluating text super-resolution models but also reflects the challenges of real-world applications, where factors like image quality, text orientation, and environmental noise can significantly affect the performance of text recognition systems. By utilizing such a diverse and challenging dataset, models can be trained to better generalize to various conditions, making them more robust and practical for real-world use.

2) Evaluation Metric

The experiments conducted in this study focus on evaluating the performance of the proposed model on the TextZoom dataset [7], a real-world dataset specifically designed for Scene Text Image Super-Resolution (STISR). The primary aim of the evaluation is to assess the readability of the super-resolved images and the quality of the super-resolution output. To achieve this, two types of metrics are used: (1) text recognition accuracy for readability, and (2) image quality metrics for evaluating the super-resolution performance.

For assessing readability, three widely recognized text recognizers are employed: ASTER [36], CRNN [34], and MORAN [35]. These models are capable of recognizing text in images, and their accuracy is used as the key metric to evaluate how well the super-resolved images maintain the readability of the text. Since the main goal of STISR is to generate high-quality images that allow for accurate text recognition, the accuracy of these recognizers serves as a direct measure of the effectiveness of the proposed model in real-world applications. The higher the recognition accuracy on super-resolved images, the better the model performs in preserving the textual content during the super-resolution process.

In addition to readability, the quality of the super-resolution output is evaluated using two standard image quality metrics: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) [41]. PSNR measures the pixel-level differences between the super-resolved and ground-truth high-resolution images, providing insight into the overall reconstruction quality. Higher PSNR values indicate better fidelity of the super-resolved image to the original high-resolution image. SSIM, on the other hand, evaluates the structural similarity between the images, focusing on how well the super-resolved image preserves the structural details and overall visual quality. Both PSNR and SSIM are essential for assessing the visual clarity and structural integrity of the super-resolved images.

3) Quantitative Results

The proposed method, Skeleton-Aware Diffusion Model (SADM) for Scene Text Image Super-Resolution (STISR), is compared with different categories of STISR methods. The evaluation is conducted on all three subsets of a test set from the TextZoom dataset [7], as depicted in Table 1 and Table 3. Table 1 illustrates the accuracies of three text recognizers: ASTER [36], CRNN [34], and MORAN [35]. The generic image super-resolution methods, such as SRCNN [17], SRResNet [18], RCAN [21], SAN [19], and HAN [20], which are not specifically tailored for super-resolving scene text images, are also included for comparison. Table 1 results indicate that the proposed method consistently outperforms these methods.

TABLE 3 The Proposed Method is Compared With Different Categories of Methods on Three Subsets of the TextZoom Dataset. The rEsults are Evaluated Based on PSNR and SSIM Metrics. The Best Results are Highlighted in Bold

Further analysis of Table 1 reveals that our proposed method also surpasses text-based backbone networks for STISR, such as TSRN [7], TBSRN [24], and PCAN [23]. These networks incorporate structural awareness by utilizing stroke-level loss, such as TG [37] and skeleton loss in SA [8], for STISR tasks. Additionally, our model outperforms TSAN [25], a gradient-based graph attention method for STISR, except for the MORAN [35] recognizer on the easy test set subset. Moreover, our method achieves superior results compared to Dual Prior-based methods, such as DPMN (TG) [38], which utilizes a pretrained STISR network TG [37] with structural-aware features. DPMN (TBSRN) [38] work on a text based backbone network TBSRN [24] trained with text recognizer loss. In the comparison with diffusion-based methods, The DDPM [29] model is trained and test on textzoom dataset. our proposed method, SADM, exhibits comparable or superior accuracy to the DDPM [29] method across all text recognizers and test subsets. It is noteworthy that the compared methods utilize image pairs (LR and HR) for training without employing ground truth text labels.

The evaluation of the proposed model on super-resolution image quality is presented in Table 3. The results demonstrate significant performance improvements, with our model nearly outperforming other methods in terms of both PSNR and SSIM metrics. These detailed of Table 3 results underscore the effectiveness of our proposed approach in enhancing the quality of super-resolution images compared to existing methods.

4) Qualitative Results

In Figure 6, we present a qualitative comparison of our proposed model with several representative models from different categories: SRCNN [17], a generic image super-resolution method; TBSRN [7], a text super-resolution backbone method; and TG [37], a structure-aware model. To provide a comprehensive evaluation, we conducted comparisons across a variety of sample images with diverse characteristics, including differences in contrast, color, blurriness, and orientation, among other factors. The results in Figure 6 illustrate instances where our proposed model outperforms other methods. In the first column of Figure 6, the text recognizer (ASTER [36]) successfully identifies the high-resolution (HR) image text as ‘verilog’. However, for Non-STISR methods like Bicubic and SRCNN [17], the super-resolution of the low-resolution (LR) image is inadequate, leading to failure in recognizing even a single character correctly. Similarly, even STISR methods are unable to fully reconstruct the word correctly in this instance. In the second column, all methods struggle to produce accurate predictions. The super-resolved image quality produced by TG [37] is particularly poor, leading the text recognizer to produce a severely incorrect prediction, failing even to count the number of characters correctly. While the HR image contains the three-character word ‘arm’, the recognizer incorrectly predicts the four-character word ‘area’. The third column of Figure 6 shows another challenging case where comparative models perform poorly, with the text recognizer unable to correctly identify a single character. In the fourth and fifth columns, while the comparative models correctly predict most of the characters, they fail to recognize the complete word accurately. Finally, in the last column, which contains the number ’1927’, the Non-STISR methods perform so poorly that the recognizer mistakes the numbers for English letters. Even the STISR methods fail to accurately recognize the last number, ’7’. Figure 6 results underscore the robustness of our approach, particularly in scenarios involving significant degradation in image quality. This qualitative assessment, alongside quantitative evaluations, highlights the potential of our model as a reliable solution for text super-resolution tasks.

FIGURE 6.

High resolution image generation by various state-of-the-art methods and the proposed method. BICUBIC (LR) and HR denote the input to the methods and ground truth, respectively. Text recognition results by ASTER [36] are displayed in the images. green and red characters represent correct and incorrect recognition results, respectively.

Show All

C. Ablation Study

The ablation study is conducted to verify the effectiveness of the components of the proposed method, particularly focusing on the induced skeleton awareness. In our method, skeleton awareness is introduced through skeleton image-based conditions in the diffusion model. The skeleton generation module is responsible for generating these skeleton images for conditioning. To assess the effectiveness of our proposed diffusion-based skeleton generation module, we replaced it with a traditional skeleton generation module [8]. This traditional module, similar to the skeletonization loss network [8], is trained on pairs of LR images and high-resolution skeleton images from the TextZoom dataset. The high-resolution skeleton images are obtained from HR images using a synthetic data-trained skeletonization network [8]. Subsequently, the traditional skeleton generation module is used as a replacement for our proposed module, and the results are compared in Table 4. The comparison reveals a degradation in performance upon replacement, highlighting the effectiveness of our proposed skeleton generation module.

TABLE 4 The Ablation Study Assesses the Impact of the Skeleton Generation Module in the Proposed Skeleton-Aware Diffusion Model. Results are Evaluated Based on the Recognition Accuracy of Three Text Recognizers: CRNN [34], MORAN [35], and ASTER [36]

Additionally, our proposed method is tested by conditioning high-resolution skeleton images generated from HR images using the synthetic data-trained skeletonization network. The last row in Table 4 demonstrates a significant improvement in model performance. This study underscores the positive impact of skeleton awareness in our proposed method, emphasizing the importance of quality skeleton generation for enhancing model performance.

Figure 5 showcases a comparative analysis of skeleton images generated by the traditional method, our proposed diffusion-based method, and high-resolution skeleton images for HR inputs. A detailed observation highlights that the skeleton images produced by our proposed method offer superior clarity and structure compared to the traditional approach. In the first row of Figure 5, the skeleton image of the word “Tourist” generated using the traditional method is noticeably blurred. Specifically, the characters “ris” are distorted and resemble “na,” making it difficult to accurately recognize the word. In contrast, the image produced by our diffusion-based method retains the structural integrity of the word, with clear and well-defined characters. In the second row, the traditional skeleton generation method causes significant blurriness, resulting in the word “their” being misinterpreted as “then.” This introduces ambiguity in word recognition, which can be critical in applications requiring precise text interpretation. On the other hand, our proposed method avoids this issue by generating a skeleton image with well-preserved letter forms, allowing for accurate word recognition. Similarly, in the last row of Figure 5, the traditional method fails to clearly represent the digit “3,” causing it to appear vague and indistinct. This could lead to misinterpretation or loss of information in scenarios requiring numerical accuracy. However, the skeleton image generated by our method distinctly captures the digit “3,” maintaining both its shape and readability. Overall, the proposed diffusion-based method consistently outperforms the traditional skeleton generation approach by preserving finer details, reducing blurriness, and enhancing the clarity of both letters and numbers across various samples.

D. Discussion

1) Impact on Low Resolution License Plate Recognition

One practical application of Scene Text Super-Resolution (STISR) is license plate recognition. Factors such as motion blur, poor focus, and varying illumination (e.g., low light or glare) significantly impact Optical Character Recognition (OCR) accuracy, especially under adverse weather conditions like rain or fog. Additional complications arise from variations in plate format, dirt or damage, and partial occlusions. Furthermore, challenges such as oblique camera angles and distance variations affect image resolution, making accurate license plate OCR challenging [42]. To assess the efficiency of our proposed Skeleton-Aware Diffusion Method (SADM) for Scene Text Super-Resolution, we conducted tests on license plate images using a model trained on the TextZoom dataset, without additional training on license plate datasets. We degraded these images by applying Gaussian blur and motion blur kernels to simulate low-resolution scenarios. Figure 7 presents the quantitative results.

In Figure 7(a), low-resolution (LR) images were generated by applying a gaussian blur kernel with a radius of 3, simulating conditions where the vehicle is stationary, but the camera quality is inadequate or taken from a considerable distance, causing Gaussian blur. The first column shows a standard license plate image without occlusions or defects, captured with the correct camera angle and ideal illumination. In the corresponding LR image, the Gaussian blur causes character “B” to resemble “8.” Our model significantly reduces this blur, enhancing character clarity and improving legibility. In the second column, the license plate is captured at an oblique angle with illumination variations, including glare across the plate. This causes character strokes to appear indistinct in the LR image. Our model successfully addresses these issues, producing clearer, more readable text. The final column of Figure 7(a) presents a license plate with defects, posing additional challenges for OCR in recognizing the characters “8” and “4” in the LR image. Our method mitigates these issues, enhancing readability despite the plate’s imperfections.

FIGURE 7.

License plate recognition results using scene text super-resolution. Each row compares Low-Resolution (LR) license plate images with results from our proposed method (Our) and High-Resolution (HR) ground truth images. (a) Gaussian-blurred LR images (radius =3) simulate stationary vehicles with poor camera quality or long-distance capture. Our method effectively reduces blur, enhancing readability across standard plates, plates with glare, and those with defects. (b) Motion-blurred LR images (kernel size =10) replicate challenges of moving vehicles. Our model restores clarity and continuity in distorted strokes and overlapping characters, with minor limitations on untrained Chinese characters.

Show All

Figure 7(b) illustrates low-resolution images created with a motion blur kernel of size 10, replicating conditions encountered when capturing license plates of moving vehicles. In the first column, character strokes in the LR image are highly distorted due to motion blur and illumination variations. Although our model successfully restores English and numeric characters, it struggles with Chinese characters, likely due to the absence of such characters in the TextZoom training dataset. The second column highlights a scenario where motion blur causes discontinuities in character strokes, as observed in the character “6.” Our method successfully restores the continuity of these strokes. In the last column, the motion blur causes characters “00” to overlap in the LR image. While our model handles English characters well, treating Chinese characters as part of the background, it successfully resolves character overlap, producing a clearer and more readable result. These results demonstrate the robustness of our SADM model in enhancing license plate readability under various challenges, including blur, illumination variation, and occlusions, making it a considerable tool for real-world license plate recognition applications.

2) Discussion on Computational Complexity

The computational complexity per step of the proposed Skeleton-Aware Diffusion Model (SADM) can be analyzed based on its high demands in terms of both operations and parameters. Specifically, each inference step requires 68.22 GMac (giga multiply-accumulate operations), underscoring the considerable computational resources necessary for high-resolution text image super-resolution. Additionally, the model contains 194.78 million parameters, reflecting the architectural complexity required to capture the fine details of text within low-quality images. The total time for a single inference step is 483.602 ms, which demonstrates an effective balance between computational cost and performance in enhancing text clarity, even under challenging input conditions.

In designing SADM, we acknowledge the computational demands associated with diffusion models. To address this, we implement the DDIM [39] sampler with 250 time steps, a configuration that provides a well-optimized trade-off between cost and performance. Importantly, our model architecture is designed to accommodate future advancements in sampling efficiency, as the sampler operates as an independent module. This design choice allows us to integrate any improvements in sampling methods, such as reducing time steps or utilizing more efficient sampling techniques, directly into our framework without modifying the core model architecture. Such updates would result in immediate reductions in computational cost while preserving model performance.

The field of diffusion models is rapidly advancing, with significant research dedicated to enhancing computational efficiency. We recognize that these developments could enable SADM to scale effectively for larger datasets and potentially support real-time applications. Exploring these optimizations is a priority for our future work, as they hold promise for reducing inference time and computational burden, further expanding the practical applications of SADM in real-world scenarios.

SECTION V.

Conclusion and Future Work

In this paper, we present a novel approach skeleton-Aware Diffusion Model for STISR. Our model integrates the visual structure of text via conditioning the super-resolution diffusion process via skeleton images. These skeleton images are generated by a diffusion-based skeleton generation module. Extensive experiments and ablation studies demonstrate the effectiveness of image skeletons in improving the performance of the diffusion model by improving both image quantity and the performance of the downstream recognition task. This work highlights the critical role that visual structure plays in STISR and introduces a new avenue for future research by demonstrating the potential of skeleton-aware approaches. The integration of skeleton correction and diffusion in our SADM provides a robust framework that not only improves image quality but also boosts text readability, making it a powerful solution for real-world applications where clarity and accuracy of text are paramount. An in-depth study of the SADM helps to conclude that better skeletonization may lead to better STISR which leads to better text readability by text recognizer.

However, the proposed approach has some limitations. One of the main challenges is the lack of access to real ground-truth skeletons images for training. The skeleton images used in this work are generated by a skeletonization network trained on synthetic data, which may not fully capture the complexity of real-world text structures. This reliance on synthetic data could affect the accuracy of skeletons in more complex or noisy text scenarios. In future work, addressing this limitation by developing methods that can either obtain real-world skeleton annotations or improve the skeletonization process with minimal reliance on synthetic data could further enhance the performance of our approach. Exploring such solutions could lead to more accurate skeletons images, resulting in even better super-resolution results and higher text readability.

References is not available for this document.

Better Skeleton Better Readability: Scene Text Image Super-Resolution via Skeleton-Aware Diffusion Model

Abstract:

Metadata

Abstract:

Introduction

Related Work

Proposed Method