Introduction
White-light interference microscopy, also known as white-light interferometry (WLI), is a nondestructive, ultraprecise, high-speed and time-saving measurement technique that has revolutionized the field of precision measurement and remains indispensable [1]. WLI has an extremely high vertical resolution (axial resolution) [2], which is comparable to the resolution of scanning probe microscopy. It has evolved into a standard technique for measuring surface roughness. However, the lateral resolution (image resolution) of conventional WLI is relatively weak [3]. Surface topography reconstruction can be achieved through WLI by means of fast scanning. To avoid external disturbances as much as possible while ensuring high resolution (HR) in the vertical direction, the scanning speed used in practical applications is very fast; consequently, low-resolution (LR) images are generally collected. These LR images limit the lateral resolution of WLI. Therefore, under the premise of ensuring high-speed scanning, improving image resolution is a key concern for interferometry development.
One way to improve resolution is to use transparent microspheres. Although previous experiments [4], [5] have confirmed the validity of this approach for improving resolution, it still has many limitations, such as an unclear imaging mechanism, a reduced field of view, a risk of damage to the sample, and complex sample preparation. As such, the microsphere method is a time-consuming and high-cost method that requires additional hardware. As an alternative, we hope to develop a more low-cost and time-efficient method by finding an image-based approach for achieving image resolution improvement.
In recent years, the techniques based on deep learning have made great progress in various fields [6]–[16]. As one of them, single image super-resolution (SISR) has attracted considerable attention for both academic and commercial applications. SISR has been widely used in many fields, such as medical diagnosis [17], remote sensing [18], synthetic aperture radar (SAR) [19], [20], and face completion [21], and has achieved remarkable results. Consequently, SISR is a potentially suitable choice to achieve improvement in interference image resolution.
As one of the key techniques, generative adversarial networks (GANs) have greatly promoted SISR development. A GAN is composed of a generator and a discriminator and can produce realistic data through a zero-sum game [22]. Researchers have conducted substantial research on GANs to improve super-resolution performance. To pursue deeper networks and more complex features, the residual block (RB) [23] structure has been developed, which is different from a traditional convolutional neural network (CNN) in that it allows the use of skip connections between different layers, thereby guaranteeing that the shallow-layer parameters can be updated effectively. A residual network (ResNet) is composed of multiple RBs. ResNet can achieve higher accuracy by virtue of its far greater depth. The DenseNet architecture preserves the best features of ResNet while offering additional innovative capabilities that further improve network performance [24]. DenseNet uses thin layers to enable feature reuse, the effectiveness of which has been well proven in image classification and detection [25], [26]. DenseNet has been used as the main network architecture for both generators and discriminators for performing musculoskeletal quality evaluation [27] and medical image super-resolution [26]. It proves that DenseNet offers stronger feature extraction capabilities in GANs.
To extract even more abundant features, the residual dense block (RDB) structure has been proposed [28]. By combining the characteristics of the RB and dense block (DB) structures, the RDB structure ensures the extraction of more complex features and the integration of features from all layers. Accordingly, various studies on improving super-resolution (SR) performance are emerging. The GAN, DenseNet [25] and RBD [29] architectures have all been applied to achieve SR. Moreover, the increase in the receptive field enabled by dilated convolution can improve the SR performance [29]. For a GAN, the least squares function can be used as a discriminant loss function to stabilize the training process [30].
One of the milestones achieved in the pursuit of realistic image details is the super-resolution generative adversarial network (SRGAN) architecture [31], which is based on a GAN. In SRGAN, ResNet is used as the basic model of the generator network. For the discriminator network, the mean square error (MSE) content loss is replaced with a novel perceptual loss. It has been verified that the SRGAN approach greatly improves the quality of SR image reconstruction [32]. SRGANs are widely used in the super-resolution field [26], [33], [34].
An improved version of SRGAN, called the enhanced super-resolution generative adversarial network (ESRGAN) model [35], has also been developed. The improvements are as follows: the basic generator model is replaced with one based on residual-in-residual dense blocks (RRDBs), the relativistic average generative adversarial network (RaGAN) architecture is adopted to enhance the discriminator, and features extracted before activation are employed. ESRGAN won first place in the PIRM2018-SR Challenge, thus proving that this network exhibits state-of-the-art performance.
Our purpose in this study is to develop a super-resolution method based on deep learning for interference microscopy images. Based on ESRGAN, we propose an interference image super-resolution (IISR) model to further improve the visual quality of interference microscopy images. For this paper, the main contributions are as follows:
To filter out more realistic images, the discriminator is implemented on the basis of a modified DenseNet architecture. To obtain more features to be used in the discriminator, the pooling layers in DenseNet are replaced with dilated convolutional layers.
Perceptual loss is calculated by DenseNet, instead of the Visual Geometry Group (VGG) network. The parameters for calculating generator loss are reassigned. Content loss uses a new evaluation function to increase robustness and adaptability.
We detail the network architecture in Section II, including the generator, discriminator and perceptual loss. A quantitative evaluation is presented in Section III, including an ablation study and comparative experiments. Section IV concludes the paper.
Proposed Approach
A. .Network Architecture
The proposed IISR model is based on a GAN structure, consisting of a generator and a discriminator. The image generated by the trained generator should fool the discriminator as much as possible and let the discriminator think that the generated image is real. At the same time, the discriminator further improves the ability to identify true and false images through training. In this way, a zero-sum game is established to continuously promote the joint improvement of generator and discriminator performance. This allows the generator to produce more realistic images with more details.
As Fig. 1 shows, the structure of the generator is mainly based on RRDBs, thus allowing it to generate sufficiently realistic images. Each RRDB is composed of 3 RDBs. The first four convolutional layers in each RDB have 3*3 kernels with 23 filters, and the last one has a 3*3 kernel with 64 filters. The input LR image is first passed through a convolutional layer with a 3*3 kernel with 23 filters, followed by a total of 23 RRDBs. Then, subpixel convolutional layers are applied to upsample the image by 4
B. Discriminator
The function of the discriminator is to distinguish the differences between generated and real images, thus allowing it to judge image authenticity. The discriminator is based on the relativistic GAN model, which is used to predict the probability that fake data will be less realistic than real data. There are stringent requirements on the performance of the discriminator. Accordingly, the ability of the network to extract more effective information needs to be further improved. As shown in Fig. 2, a modified DenseNet architecture with dense connections, which has a stronger ability to extract features than the VGG network, is used in the discriminator network. The input image is first passed through a convolutional layer with a 7*7 kernel with 64 filters and a dilated convolutional layer (rate = 2) with a 3*3 kernel. The resulting feature maps then enter the modified DenseNet, in which the pooling layers are replaced with dilated convolutional layers, followed by a set of dense layers and a sigmoid activation function to evaluate the authenticity of the image.
The modified DenseNet is composed of 4 dense blocks and intermediate transition layers. Each dense block consists of multiple dense layers. All dense layers in the same dense block are generally identical. As shown in Fig. 2, the first dense block contains six dense layers, and each layer has the same network structure. The superscript of each dense block indicates the number of dense layers. Dense connections are applied between adjacent layers. All layers are directly connected to each other. Following the feedforward approach, each layer takes the outputs from all preceding layers as its input. The network contains \begin{equation*} f_{l}=C_{l}([f_{0},f_{1},\ldots,f_{l-1}])\tag{1}\end{equation*}
Based on this design, the DenseNet concatenates features rather than simply summing them. As Fig. 3 shows, the process of feature extraction in a CNN is linear. The features in each layer are extracted based on the features from the previous layer. Each deep feature is established based on the extraction of a new shallow feature. By contrast, a DenseNet is a highly branched network. Features from all preceding layers can be applied in each subsequent layer. The ability to consider different combinations of layers makes the network richer, leading to more diverse features and higher feature utilization.
The continuous concatenation process results in the combination of an increasing number of features. To guarantee the feature activity, a growth rate parameter and bottleneck layers are proposed. The growth rate parameter, representing the growth rate of the network, controls the size of the output feature maps from each layer. We specify that only \begin{equation*} k_{l}=k_{0}+k\ast (l-1)\tag{2}\end{equation*}
Since the feature map size in each layer is the same, dimensional reduction of the feature maps cannot be applied as in a CNN. Therefore, a transition layer, which reduces the feature map size through convolution and pooling (Pool), is designed to be placed between each pair of adjacent dense blocks. This allows the network to proceed in the “depth” direction.
The original transition layer design has a Conv-Pool structure. In the IISR model, however, the pooling layers in DenseNet are replaced with dilated convolutional layers. Applying a dilated convolution with a dilation rate r is equivalent to adding
The main purpose of a pooling operation in a CNN is to quickly expand the receptive field. However, pooling results in a loss of feature information, whereas in the discriminator, as much detailed feature information should be preserved as possible. Therefore, dilated convolution is more suitable for this purpose. Dilated convolution expands the receptive field while reducing the size of the feature map. For the same 3*3 convolution kernel, a dilated convolution has a larger receptive field than a standard convolution. As shown in Fig. 4, for a 7*7 feature map, with padding = 0, stride = 1, and rate = 2, the size of the feature map after a standard convolution is 5, whereas the size of the feature map after a dilated convolution is 3.
C. Perceptual Loss
The loss for the generator consists of three components: a perceptual loss \begin{equation*} L_{Gen}=L_{p}+{\lambda L}_{a}+\eta L_{c}\tag{3}\end{equation*}
Compared with SRGAN, the proportion of perceptual loss increases in ESRGAN. In IISR, the proportion further increases. \begin{align*}&\hspace {-1.5pc}l_{p}=\raise 0.7ex\hbox {1} \!\mathord {\left /{ {\vphantom {1 {w_{i,j}H_{i,j}}}}}\right. }\!\lower 0.7ex\hbox {${w_{i,j}H_{i,j}}$}\sum \nolimits _{x=1}^{W_{i,j}} \sum \nolimits _{y=1}^{H_{i,j}} \\&\qquad \qquad \cdot \left ({\phi _{i,j}\left ({I^{HR} }\right)_{x,y}-\phi _{i,j}\left ({G_{\theta _{G}}\left ({I^{HR} }\right) }\right)_{x,y} }\right)^{2}\tag{4}\end{align*}
We evaluate the loss through the DenseNet architecture. DenseNet has a relatively strong feature extraction ability. By means of feature reuse, DenseNet can acquire a richer combination of features and evaluate features more comprehensively. Moreover, the features extracted before activation are used. \begin{align*} L_{c}=\begin{cases} \dfrac {1}{2}\!\left ({G_{\theta _{G}}\!\left ({I^{LR} }\right)-I^{HR} }\right)^{2},&\!\! for \left |{ G_{\theta _{G}}\!\left ({I^{LR} \!}\right){-I}^{HR} }\right |\!\le \! \delta \\ \delta \left |{ G_{\theta _{G}}\left ({I^{LR} }\right){-I}^{HR} }\right |-\dfrac {1}{2}\delta ^{2}, &\!\!otherwise \end{cases}\!\!\!\!\!\! \\\tag{5}\end{align*}
The content loss defined above is mainly based on the mean absolute error (MAE), but the MAE calculation is replaced with the MSE calculation when the error is sufficiently small. This function is a continuously differentiable piecewise function that inherits the advantages of both the MAE and the MSE. For the MAE, there exist vertices that make the derivative unstable. Consequently, when large deviations occur, the training process is unstable. By contrast, the improved function has derivatives at zero, which is conducive to convergence. Thus, the improved function not only maintains the continuous derivative of the loss function but also takes advantage of the error reduction characteristic of the MSE gradient to obtain a more accurate minimum value while simultaneously showing better robustness to outliers.
Experiments
A. Experimental Settings
We built a Mirau-type interferometric system. The experiments reported in the following are all based on this system. Because our network is distinct from other networks, to better improve the performance of our system, it was necessary to build our own dataset. The images comprising both the training set and the test set were captured using this system. These images are related to the actual application, providing an approach that better evaluates the actual application effect of the network. There were 520 HR images in the training set. The high-quality original images were cropped into smaller sub-images for training. LR images were obtained by means of bicubic blurring. To achieve a faster input/output speed, the training images were organized in Lightning Memory-Mapped Database (LMDB) format. The learning rate was initialized as
The reconstructed images were evaluated with objective indicators (peak signal-to-noise (PSNR) and structural similarity index (SSIM)) but were mainly assessed based on visual quality. The network was trained on an NVIDIA GeForce GTX 1080Ti GPU in the PyTorch environment.
B. Experimental Results
1) Ablation Study
To verify the effect of dilated convolution, we conducted a comparative experiment to evaluate the results of replacing a pooling layer with a dilated convolutional layer. After this replacement, the extracted feature map of the input interference microscopy image contains more details, as shown in Fig. 5. Thus, the effectiveness of dilated convolution is proven.
Feature maps of a) an input image after b) a dilated convolutional layer and c) a max pooling layer.
The details of an output image obtained via the proposed IISR model with the modified DenseNet are more abundant that those of the equivalent image obtained with a VGG-based discriminator, as shown in Fig. 6. The output image obtained based on the VGG network contains a certain amount of noise and is grainy. By contrast, the modified DenseNet retains more details without distortion, suppresses noise, and obtains features with clearer edge shapes. It can be seen that with the help of the DenseNet discriminator, the IISR network can output super-resolution images with more detail and less noise.
Super-resolution images obtained after a) VGG network and b) the modified DenseNet.
The content loss in the original network is calculated based on the MAE. By contrast, the content loss in the IISR network is defined as a continuously differentiable piecewise function, as given in (5). As Fig. 7 shows, before this replacement of the content loss function (blue line), some fluctuations are observed in the second half of the training process. After this replacement (pink line), the fluctuations decrease, and the training process shows more stable convergence. This is because at the beginning of training, the difference is large, and the training process is unstable. However, when the difference is small, the original function will fluctuate around a stable value, making it difficult to reach convergence. With the improved function, even when the difference is small, the function value will gradually converge to a stable value. Thus, the improved function provides stronger suppression of outliers and makes the network more robust.
The training times for the ESRGAN and IISR models were 65 h and 96 h, respectively. These findings demonstrate that with the proposed modifications to the network, the training time is increased. The main reason is that DenseNet is more complex than the VGG network, requiring a larger amount of feature calculation.
2) Comparative Experiments
The USAF resolution test chart was etched on a resolution test target. The pattern includes several groups of features made up of three-line sets. The size of the line sets gradually decreases in each group. As shown in Fig. 8, the chart used in these experiments includes a total of nine groups, and each group contains six elements. The labels of the groups and elements are identified next to the corresponding features.
Several state-of-the-art models, namely, a super-resolution CNN (SRCNN), an SRGAN and an ESRGAN, were selected for comparison. The original HR images in the test set are blurred to generate LR images. SR images are obtained from LR images via different models. The performance of the model is measured by comparing the difference between SR images and HR images. The closer the restored SR image is to the HR image, the better the performance of the model, which ensures the reliability of the model. Two sets of images in the test set were selected to demonstrate the performance of models. Two sets of test images were collected in accordance with the same standards; the only difference was whether the images contained interference fringes. To be fair, super-resolution results were obtained by processing the same blurred images with different models.
Fig. 9 to Fig. 14 compare the proposed model with the other models in terms of visual quality performance. Fig. 9 presents the super-resolution performance for a microscopy image of the resolution test target. In the image reconstructed by the SRCNN, Element 6 in Group 8 (indicated by an arrow) on the resolution test target is not able to be resolved. The SRGAN results also show blurring of this feature, while the ESRGAN results show improvement. However, for Element 2 of Group 9 (indicated by an arrow), the ESRGAN restores only the horizontal features of Element 2, while the IISR network restores all of the features. These findings prove that the IISR method has a stronger super-resolution capability, fully recovering the details of the HR image.
Qualitative results for a microscopy image of a standard sample with stripe features.
Qualitative results for a microscopy image of a standard sample with point features.
Qualitative results for an interference image (with interference fringes) of the USAF resolution test chart.
Qualitative results for an interference image (with interference fringes) of a standard sample with stripe features.
Qualitative results for an interference image (with interference fringes) of a standard sample with point features.
Fig. 10 and Fig. 11 present the super-resolution performance for microscopy images of standard samples. For both stripe and point features, the SRCNN is not able to achieve satisfactory restoration. Some of the stripe features generated by the SRGAN are interrupted (indicated by an arrow) and thus are obviously incorrect. Similarly, the point features in the SRGAN results contain grainy noise and lack intact shapes. The stripe and point features generated by the ESRGAN are accompanied by artifacts. After passing through the ESRGAN, the edges of two identical point features are different (indicated by arrows). By contrast, the features restored by the IISR network have sharper edges and more realistic textures, comparable to those of the HR images.
In addition to the super-resolution performance for normal microscopy images, the super-resolution performance for interference images also needs to be confirmed. In contrast to normal microscopy images, the images acquired via interference microscopy (i.e., interference images) include interference fringes. Consequently, achieving super-resolution is more challenging for interference images than it is for normal microscopic images. The influence of the interference fringes should be excluded when restoring features. It should be noted that the features that deviate from the interference fringes are most strongly affected. Therefore, these features are selected for evaluation. Below, the evaluation results for the reconstruction effects of the different models for interference images are presented.
Fig. 12 presents the super-resolution performance for an interference image of the resolution test target. Under the influence of the interference fringes, the details of the image are attenuated. Compared with the results in Fig. 9, the image generated by the SRGAN contains more noise and exhibits certain artifacts; in particular, the features in Group 8 show obvious artifacts (indicated by a dotted box). The details of Element 1 in Group 9 (indicated by an arrow), which are visible in the previous normal micrograph, are also lost. The ESRGAN correctly restores the features of the eighth group, thus performing better than the SRGAN. However, the features of the ninth group (indicated by an arrow) are blurred. In contrast, after passing through the IISR network, the features of the ninth group are partially restored. In particular, the horizontal features of Element 2 (indicated by an arrow) are restored, although the restoration is not as effective as it is for the normal micrograph.
Fig. 13 and Fig. 14 present the super-resolution performance for interference images of standard samples. For the SRGAN, the reconstructed stripe features are curved, and the point features are also deformed and drowned in noise. Similarly, the stripe features reconstructed by the ESRGAN are also curved, and identical point features (indicated by an arrow) appear different after restoration. By contrast, the proposed IISR network restores the details well, achieving results that are basically consistent with the HR images.
The above comparative experiments demonstrate that for both normal microscopy images and interference images, the proposed model achieves visible superiority in terms of visual quality.
Conclusion
We have presented an IISR model that can achieve super-resolution for interference microscopy images. The IISR model is based on a GAN and generates realistic interference microscopy images through adversarial learning. To make the generated images more similar to real images, a novel discriminator architecture is proposed based on a modified DenseNet. To better preserve image details, dilated convolution is adopted. The perceptual loss is determined based on the modified DenseNet. The content loss is evaluated using a piecewise function that combines the characteristics of the MAE and MSE. The parameters of the generator loss are rebalanced to focus on improving the visual quality of the reconstructed images. Experiments were performed using both normal microscopy images without interference fringes and interference images. Various state-of-the-art SISR models were selected for comparison. Based on the experimental results, the proposed IISR model is proven to be effective for application to images acquired via interference microscopy. We will continue to investigate this topic in the future. The next steps will be to further optimize the network structure, improve the resolution ability of the model, and shorten the training time. For example, we will apply a gradient penalty to our network and test whether it can speed up convergence.