Introduction
With the development of computer vision technology, various image sensors have been used. Many advanced artificial intelligence technologies are applied in the processing of multimodal remote sensing data [1], [2], [3], [4], [5], [6] Because of the high complementarity of visible and infrared image information, they have become the focus of attention in remote sensing imaging. Image-to-image (I2I) translation aims to convert images from one domain to another with content preserved as much as possible. Although the task of visible-to-infrared image translation can provide infrared data support for downstream tasks, such as template matching, the reliability of the generated data has always been a contentious issue.
In classical computer vision, the region-based template matching method, represented by the normalized cross-correlation (NCC) algorithm [7], can well solve the matching problem between homologous images. Some hand-designed feature-based matching methods designed have been proposed to address the difficulty in image matching caused by the large differences between multimodal images. However, these algorithms rely heavily on handcrafted features, and hence each has a specific usage domain, and prior information is needed to decide on a method for image matching.
Deep learning technology provides a new idea to solve the matching between multimodal images. One method uses a deep learning network to extract common depth features between the visible and infrared image and performs correlation calculation to obtain the matching position. For instance, the normalized cross-correlation network (NCCNet) [8], pseudo-Siamese CNN (PSiam) [9], CorrASL [10], Siamese U-Net [11], and multiscale fusion SiamUNet-7 (MSF SiamUNet-7) [12].
Another method is to convert the visible and infrared images into the same modality data for the matching task. After obtaining the same modality data, classical template matching methods can then be used to solve the visible and infrared images matching problem. Most image translation algorithms are built on conditional generative adversarial networks (CGANs) [13]. ThermalGAN [14], Pix2Pix-MRFFF [15], InfraGAN [16], IR-GAN [17], and edge-guided multidomain RGB-to-TIR (EMRT) [18] based on Pix2Pix [19] and CycleGAN [20] are improved for infrared image generation tasks. However, these algorithms evaluate the quality of infrared generated images from subjective and objective perspectives, without considering the downstream application reliability of tasks, such as image matching.
In our earlier work, Siamese U-Net [11] aimed to solve the matching problem between visible and synthetic aperture radar (SAR) images. IR-GAN [17] aims to obtain infrared images corresponding to visible images. This article combines the work of Siamese U-Net [11] and IR-GAN [17] to convert visible images into infrared images with high confidence in image matching tasks. Fig. 1 shows the comparison of our proposed I2I task and other I2I tasks.
I2I translation based on GANs and specific task train together, where the former generates images that are often used for direct human interaction or evaluated using human-set criteria but are rarely considered for direct feeding into downstream tasks based on neural networks or models. We propose a task-oriented approach for joint training with image translation and illustrate it using multimodal image matching.
I2I translation, as an important aspect of image generation, is growing rapidly in the field of remote sensing because many vision or image problems can be viewed as I2I translation problems [21], such as shadow removal [22], [23], image dehazing [24], [25], image segmentation [26], [27], and SAR-to-optical image translation [28], [29], [30]. These tasks have great potential in remote sensing, mainly to help people better recognize remote sensing imagery.
I2I translation has been gradually developed in the medical field [31], [32], as well as firefighting [33], [34], and autonomous driving [35], [36], which have high reliability and accuracy requirements for I2I translation. Template matching is an essential foundation for visual localization and navigation. Heterogeneous template matching can likewise be viewed as an I2I translation problem with similar requirements. Visible-to-infrared image translation is an important branch of image translation. The task of visible-to-infrared template matching can be facilitated by image translation, converting images into the same modality, thereby addressing the challenge posed by significant modality differences. Visible-to-infrared image translation can be utilized to augment infrared data for the purposes of heterogeneous image matching and multimodal data fusion research.
Based on the requirements of heterogeneous template matching applications, common I2I translation methods find it difficult to ensure the realism of the generated images and the reliability of downstream task applications. An important reason is that the training process is not integrated with the downstream tasks. Therefore, unlike current image translation methods that primarily focus on the visual effects of the generated images, we pay more attention to the reliability of the generated images in downstream tasks. We propose to incorporate the template matching task into the I2I translation training process. Our method generates high-quality images and performs well on the template matching task. In experiments, the quality of image generation and the template matching task are complementary in generative adversarial network (GAN)-generated images, and their effectiveness can be improved together.
We propose an infrared image generation method based on a CGAN [13] for matching tasks, named TMGAN. The main contributions of this article are as follows.
Integrating I2I translation with multimodal image matching tasks to ensure the reliability of the generated images in downstream matching tasks.
Two types of matching losses are proposed, one based on the NCC matching algorithm and the other on the idea of contrastive learning, modeling the matching problem as a contrastive learning process.
Experiments are conducted on three large benchmark datasets. Our method outperforms state-of-the-art methods in terms of both image generation quality and template matching accuracy.
The rest of this article is organized as follows. Section II introduces the related work. In Section III, TMGAN is described in detail. The experiments and results are presented in Section IV followed by the discussions and analyses in Section V. Finally, Section VI concludes this article.
Related Work
A. Multimodal Image Matching Based on Network Features
An NCCNet [8] maximizes the contrast between true and false matching NCC values, transforming image features using a trained Siamese convolutional network, thereby improving the robustness of the algorithm. PSiam [9] is a pseudo-Siamese convolutional neural network architecture using eight convolutional layers and a fully connected layer to fuse features and identify corresponding patches in multimodal images. CorrASL [10] is a deep learning end-to-end framework that uses a feature-space cross-correlation operator based on learning features to generate a correspondence heat map. Pixelwise deep dense features (PDDFs) [37], Siamese U-Net [11], and MSF SiamUNet-7 [12] use a Siamese network architecture to extract common features between images. PDDFs define a loss function based on the sum of the squares of differences between learned local pixelwise features. The Siamese U-Net generates a heat map of image pairs using a cross-correlation layer and uses balance loss between true and predicted values to obtain trained model weights. MSF SiamUNet-7 extracts phase structure convolutional features to fully fuse local texture information at a large scale and global structure information at a small scale.
B. Visible-to-Infrared Image Translation
With the excellent performance of I2I translation based on GANs in colorization [38], [39], deblurring [40], [41], and inpainting [42], [43], it has a wide range of applications in advertising, games, entertainment, media, and medicine. Photo animation [44], [45], makeup transfer [46], [47], [48], and photograph to sketch image translation [49], [50], [51] based on GANs can provide good emotional value. However, in critical industries such as medicine and firefighting, we must consider its reliability. I2I translation of medical images offers a range of valuable applications for medical physicists. However, the scarcity of external validation studies of I2I models limits the immediate applicability of the proposed methods [31]. Template matching faces the same dilemma.
ThermalGAN [14], Pix2Pix-MRFFF [15], InfraGAN [16], IR-GAN [17], EMRT [18], and IRformer [52] convert the visible image into a corresponding infrared image. With the rapid development of diffusion models in image generation, UNIT-DDPM [53], DDIB [54], Cycle-Diffusion [55], VI-Diff [56], and CycleGAN-Turbo [57] have achieved significant performance in various image translation tasks based on diffusion models. However, these algorithms have not been widely applied to the task of translating visible images to infrared images, and they suffer from slow inference speed and high equipment requirements.
From the perspective of the realism of image translation, algorithms based on CGAN enhance the generative network or adversarial network, and they improve the texture and structure information of generated infrared images by adding new losses. Algorithms based on diffusion models utilize denoising diffusion probabilistic models to provide stable model training. They usually use common metrics to evaluate the quality of the generated infrared image and use the generated images for interaction with a human being. Since the generated images rarely interact directly with downstream tasks based on neural networks or models, there is a controversy over the usability of the generation. Unlike GANs that typically focus solely on the quality of the generated images, our approach integrates the image translation model training process with downstream tasks, incorporating the matching process into the training to enhance the reliability of the generated images.
C. Contrastive Learning
Contrastive learning is often used in self-supervised learning [58], aiming to, respectively, narrow and expand the distance between positive and negative examples [59]. SiamCLR v1 [60], SiamCLR v2 [61], MoCo v1 [62], MoCo v2 [63], BYOL [64], [65], SwAV [66], and SiaSiam [67] enable the network to show stronger performance without expanding the data. They generally use data augmentation to generate positive samples, and images of different categories as negative samples. They enhance the network's ability to learn common features with positive samples and distinguish differences with negative samples. Based on the contrastive learning method, CUT [68], DCLGAN [69], and EnCo [70] achieve image translation by maximizing the mutual information between the input and output image patches. We apply contrastive learning to the matching task, taking correctly matched image blocks as positive samples and incorrectly matched image blocks as negative samples, which converts contrastive learning into template matching.
Method
The proposed method converts visible images into infrared images to solve the problem of difficult template matching between visible and infrared images due to modal differences. Fig. 2 shows a flowchart of this GAN-based method. A generative network converts visible images into infrared images. L1 loss and GAN loss are common losses used for image generation. Combined with a multimodal template matching task, NCC loss and match patch (MP) loss are proposed. In TMGAN-NCC (see Sections III-A–III-C), an NCC coefficient is used as a kind of matching loss supervision. In TMGAN-MP (see Sections III-A, III-B, and III-D), we model template matching as a contrastive learning problem.
Algorithm structure. During the training process, L1 loss is derived by calculating L1 distance between real and fake infrared images. GAN loss is based on judgment of discriminator on truth or falseness of a fake image. We incorporate a multimodal image template match task into I2I translation and two match losses, based on NCC and MP loss. During inference, we only use visible images as input to the generative network.
A. Infrared Image Generation Structure
TMGAN is based on an improved CGAN. The generative network uses only visible images as input. The generative network has a U-Net architecture, the encoder has a ConvNeXt [71] architecture, and the decoder has a ConvNeXt-like architecture. The specific structure is shown in Fig. 3. The generative network is used to convert the visible image into a corresponding infrared image. This is referred to as the infrared image generation structure. We use PatchGAN [72] as the discriminator, which outputs a discriminant matrix to distinguish between real and generated images. PatchGAN has a lower computational burden than multiscale PatchGAN [73] and UNetGAN [74].
Infrared image generation structure. The generative network is mainly composed of three modules: ConvNeXt block, double residual block, and upsampling. The encoder of the generative network uses the ConvNeXt
B. Loss
Loss in this article includes LSGAN loss, L1 loss, and the proposed matching loss. LSGAN and L1 loss are used to promote the authenticity of the generated image, and the matching loss ensures a high degree of reliability in the matching task
\begin{align*}
l_{\text{CGAN}}(G,D) = & E_{x,y \sim P_{\text{data}}}[(D(x,y))^{2}]\\
& + E_{x \sim P_{\text{data}}}[(1-D(x,G(x,z)))^{2}] \tag{1}\\
l_{L1} =& ||y-G(x,z)||_{1} \tag{2}\\
\mathcal {L} =& \mu l_{\text{CGAN}}(G,D) + \lambda l_{L1} + \varphi l_{M} \tag{3}
\end{align*}
C. NCC Loss
NCC can effectively overcome the effects of lighting and other conditions, is insensitive to sudden changes in local pixel values, has strong robustness, and has been widely used in template matching. NCC can be expressed as
\begin{equation*}
\text{NCC}(T,I) = \frac{\sum _{x} \sum _{y} (T(p,q)-\mu _{T})(I(p,q)-\mu _{I})}{\sqrt{\sum _{x} \sum _{y} (T(p,q)-\mu _{T})^{2}(I(p,q)-\mu _{I})^{2}}} \tag{4}
\end{equation*}
In visible-to-infrared image translation, if the NCC coefficient between the IR-generated and real IR image is closer to 1, then the IR-generated image performs better in the template matching task. Combined with (4), the match loss can be defined as
\begin{equation*}
l_{\text{NCC}} = 1 - \frac{\sum _{x} \sum _{y} (f(p,q)-\mu _{f})(r(p,q)-\mu _{r})}{\sqrt{\sum _{x} \sum _{y} (f(p,q)-\mu _{f})^{2}(r(p,q)-\mu _{r})^{2} + \text{eps}}} \tag{5}
\end{equation*}
D. MP Loss
NCC loss is a simple way to incorporate matching loss into the network training. We use another type of matching loss based on Siamese U-Net and contrastive learning. When constructing a match loss, a simple idea is to directly apply the loss of Siamese U-Net to image generation, but it is disadvantageous to use the given matching labels simply and rudely for image generation. The ideal matching label is shown in Fig. 4(a), where the template image has different correlations at different positions on the search image. Simply using 0 and 1 to make matching labels introduces beneficial information at the correct matching position but harmful information at wrong matching positions, which is not conducive to image generation and can lead to the collapse of the GAN.
Similarity calculation using three labels. The x and y axes in the graph represent the matching positions, and the z-axis indicates the image correlation at that position, with values ranging from [–1,1]. The higher the value, the stronger the correlation. (a) Siamese U-Net match label. (b) Real image normalized correlation. (c) Fake image normalized correlation.
We consider using the normalized correlation of the template image on the real search image as the matching label, as shown in Fig. 4(b), which is continuous and authentic and intuitively advantageous for image generation. However, the normalized correlation of the template image on the fake search image [see Fig. 4(c)] has a high similarity with the real normalized correlation matching label, which makes it difficult to obtain more useful information in the image generation process.
Contrastive learning enhances the power of the network by learning the common features between positive samples and the differences between negative samples. Combining the idea of contrastive learning and the Siamese U-Net, the image block at the correct matching position on the template image and the search image is taken as a positive sample, which is equivalent to 1 in the matching label, and the image block at the wrong matching position is taken as a negative sample, which is equivalent to 0. This has the advantage that the construction is a pseudo 0–1 ideal matching label, and the label constructed using contrastive learning is compared with the label of Fig. 4(a). At the wrong matching position, it introduces a weaker constraint that avoids the collapse of the GAN. Compared with the label in Fig. 4(b), the problem of not being able to effectively learn more information from the label can be avoided when the generative network has certain generative abilities.
The MP Loss can be expressed as
\begin{equation*}
l_{\text{MP}} = \sum _{i=1}^{m} \frac{\text{Sim}(I_{i},I_{i}^{+})}{\sum _{j=1}^{n} \text{Sim}(I_{i},I_{i,j}^{-})}, \tag{6}
\end{equation*}
Algorithm 1: PyTorch-Style Pseudocode for $l_{MP}$ .
Input:
Output:
Params:
Operator:
for
for
Experiments and Results
We describe the metrics and datasets used in the study and present the results. All experiments were conducted using an Intel Core i9-10980XE CPU running @ 3 GHz on an Nvidia RTX 3090 GPU. We conducted an experiment to evaluate the quality of infrared image generation and a template matching experiment using the generated infrared image as a template image on public datasets. In addition, we conducted point feature matching and infrared target detection experiments on the generated infrared images.
A. Datasets
The KAIST dataset [75] contains daily traffic scenes of schools, streets, and villages, which are generally used for pedestrian detection and are important in the field of autonomous driving. The PointGrey Flea3 color camera produces an image of size 640 × 480. The FLIR-A35 infrared camera produces a 320 × 256 image. After camera calibration, a 640 × 512 image pair is obtained. Template matching experiments on the KAIST dataset enabled us to check the usability of the proposed method in autonomous driving. In the experiment, images taken in the daytime were sampled to reduce the number of repeated image pairs. A 256 × 256 image was randomly cut from a 640 × 512 search image for use as a template image. We used 5000 pairs for training and 1247 for testing.
The VEDAI dataset [76] contains spring images collected by the AGRC satellite in Utah, USA, in 2012, including 1024 × 1024 and 512 × 512 images, where a pixel represents an area of 12.5 cm × 12.5 cm and 25 cm × 25 cm, respectively, and the infrared image is near-infrared. The dataset's nine categories are plane, boat, camping car, car, pick-up, tractor, truck, van, and other. VEDAI is important in remote sensing with higher resolution and more complex scenes. We verified the usability of the proposed method for remote sensing by performing template experiments on the VEDAI dataset, using 1024 × 1024 search images, and randomly cropped 256 × 256 images as template images, with 1018 images for training and 250 images for testing.
The AVIID [77] is a dataset for converting aerial visible images into infrared images, containing more than 3000 pairs of visible and infrared images. The dataset is composed of three subdatasets, and the AVIID-3 subset was selected for the experiment due to its inclusion of 1280 pairs of complex scene images, captured at various altitudes and angles, which provides multiscale targets and a variety of backgrounds. The image resolution is 480 × 480, with a total of 1024 image pairs used for training, and 256 pairs for testing in the experiment. In the template matching experiment, the true infrared images were cropped to a size of 240 × 240 and used as template images. Experiments using the AVIID data can evaluate the potential application of the method in the field of low-altitude UAVs.
B. Implementation Details
The proposed improvements are based on the Pix2Pix framework. All experiments were conducted using public code and datasets for training and testing under the same experimental environment, to provide a fair comparison. In the experimental setup of Pix2Pix [19], ThermalGAN [14], Pix2Pix-MRFFF [15], InfraGAN [16], IR-GAN [17], TMGAN-NCC, and TMGAN-MP. The Adam optimizer was employed, with
C. Image Quality Evaluation Metrics
Image quality was measured by the peak signal-to-noise ratio (PSNR) [78], structural similarity index measure (SSIM), multiscale structural similarity index measure (MS-SSIM) [79], learned perceptual image patch similarity (LPIPS) [80], and Fréchet inception distance (FID) [81]. We briefly explain these below. PSNR and SSIM are widely used in image quality assessment. PSNR is given by
\begin{equation*}
\text{PSNR}(I,S) = 10\log \frac{(2^{n}-1)^{2}}{\text{MSE}(I,S)} \tag{7}
\end{equation*}
\begin{equation*}
\text{MSE}(I,S) = \frac{1}{MN} \sum _{x=1}^{M}\sum _{y=1}^{N} (I(p,q)-S(p,q))^{2} \tag{8}
\end{equation*}
\begin{equation*}
\begin{aligned} \text{SSIM}(I,S) &= \frac{2 \mu _{I} \mu _{S} + C_{1}}{\mu _{I}^{2} + \mu _{S}^{2} + C_{1}} \frac{2 \sigma _{I} \sigma _{S} + C_{2}}{\sigma _{I}^{2} + \sigma _{S}^{2} + C_{2}} \frac{\sigma _{IS} + C_{3}}{\sigma _{I} \sigma _{S} + C_{3}} \\
&= \frac{2 \mu _{I} \mu _{S} + C_{1}}{\mu _{I}^{2} + \mu _{S}^{2} + C_{1}} \frac{\sigma _{IS} + C_{3}}{\sigma _{I}^{2} + \sigma _{S}^{2} + C_{2}} \end{aligned} \tag{9}
\end{equation*}
\begin{equation*}
\text{MS-SSIM}(I,S) = [b_{M}(I,S)]^{\alpha _{M}} \prod \limits _{j=0}^{M} [c_{j}(I,S)]^{\beta _{j}} [s_{j}(I,S)]^{\gamma _{j}} \tag{10}
\end{equation*}
LPIPS measures the Euclidean distance between two image feature vectors and is calculated as
\begin{align*}
& \text{LPIPS}(I,S) \\
&= \sum _{l}^{L}{\frac{1}{H_{l}W_{l}}} \sum _{H_{l},W_{l}}[f_{l}(I)_{H_{l},W_{l}} - f_{l}(S)_{H_{l},W_{l}}]^{2} \times \omega _{1}. \tag{11}
\end{align*}
FID measures the similarity between two sets of images from statistical aspects of computer vision features of the original image, a measure that calculates the distance between the feature vectors of the real and generated images. The visual features of FID are extracted and calculated using the Inception v3 image classification model. FID scores are often used to assess the quality of images generated by GANs, with lower scores having a high correlation with higher quality images. FID scores 0in the best case, indicating that both sets of images are identical. FID is calculated as
\begin{equation*}
\text{FID}(I,S) = \sqrt{||\mu _{I} - \mu _{S}||^{2} + \text{Tr}(\sigma _{I} + \sigma _{S} - 2 \sqrt{\sigma _{I} \sigma _{S}})} \tag{12}
\end{equation*}
D. Simulation Evaluation of Infrared Image Generation Quality
We employed eight comparative algorithms, among which Pix2Pix [19] established the fundamental framework for paired image translation. ThermalGAN [14] algorithm introduced a temperature vector as input, but to ensure a fair comparison, the ThermalGAN algorithm only used visible images as input. Pix2Pix-MRFFF [15] proposed a multireceptive field feature Fusion network. InfraGAN [16] and IR-GAN [17] improve the image quality by improving the expressive power of the generative and adversarial networks and improving the edge information of the generated image using SSIM loss and gradient vector loss, respectively. EMRT [18] proposed an edge-guided multidomain RGB-to-TIR image translation model. IRformer [52] has constructed an implicit multispectral transformer. CycleGAN-turbo [57] uses the LORA technique to fine-tune the diffusion model to adapt to new tasks.
An observation of the IR images generated from three datasets supports the following conclusions. From the results of the KAIST dataset (see Fig. 5), we can find that the infrared image generated by our method has clearer texture information and more realistic gray information (red box, Fig. 5). In addition, the infrared image generated by the algorithm is more complete in terms of detail information retention (blue box, Fig. 5). All algorithms perform better on the VEDAI dataset, which is relatively simple for the infrared generation task because its infrared images have a simple structure and little texture information. From the results of the VEDAI dataset (see Fig. 6), we can find that the proposed algorithm generates more accurate grayscale information of the infrared image (red box, Fig. 6). From the results of the AVIID dataset, we can observe that the proposed algorithm has generated infrared images that retain more detailed information. (red box, Fig. 7). The Pix2Pix, ThermalGAN, and Pix2Pix-MRFFF algorithms all perform poorly on the three datasets. They are designed for visible-to-infrared image translation on 256 × 256 images, which results in poor performance on high-resolution images. InfraGAN and IR-GAN algorithms focus on multiscale information of images, resulting in better quality of generated images. EMRT uses an edge-guided approach, but the edges extracted from visible light and infrared images are not fully consistent, resulting in blurred edges in the images produced by EMRT and the poorest performance across all datasets. The IRformer model has a lightweight design, but since the method is not based on GANs, the generated images are somewhat blurry. Due to the complexity of converting visible-to-infrared images, the grayscale information in the images generated by CycleGAN-turbo is not well preserved. From the final results, the image generation quality of TMGAN-NCC and TMGAN-MP is somewhat superior to the performance of IR-GAN. This is because TMGAN is developed based on IR-GAN. We performed a quantitative evaluation to further analyze the performance of TMGAN.
Examples of infrared images generated on the KAIST dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.
Examples of infrared images generated on the VEDAI dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.
Examples of infrared images generated on the AVIID dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.
We use common indicators to evaluate the quality of the generated images, as shown in Table I, where the best values are in boldface. Our methods (TMGAN-NCC and TMGAN-MP) achieved the best results for infrared image generation. In addition, we have summarized the model parameter count and training time on the KAIST dataset for all algorithms in Table II. The TMGAN algorithm has a moderate number of model parameters and the training time is also acceptable. TMGAN-NCC introduces an NCC coefficient as a kind of matching loss supervision. TMGAN-MP model template matching is a contrastive learning problem. The more similar the generated infrared image is to the real infrared, the higher the success rate of matching, which makes the matching loss improve the quality of the generated infrared.
E. Matching Evaluation Metrics
We compared various approaches to assess performance in template matching and point feature matching. Template matching accuracy is defined by the mean L2 error and the percentage of match pairs that have an L2 distance within one pixel. Template matching precision is the standard deviation of matching L2 error and is also called mean average precision (
\begin{equation*}
\text{EPE} = \frac{1}{\mathcal {N}} \sum _{i=1}^{\mathcal {N}} \sqrt{(p_{i} - \hat{p}_{i})^{2} + (q_{i} - \hat{q}_{i})^{2}} \tag{13}
\end{equation*}
F. Simulation Evaluation of Template Matching
The NCC was used to evaluate the template matching effect between the generated infrared image and the real template image. The proposed method was compared with Pix2Pix, ThermGAN, Pix2Pix-MRFFF, InfraGAN, IR-GAN, EMRT, IRformer, and CycleGAN-turbo, with results shown in Table III. Index values were calculated for all IR images generated by each model. We also compared the traditional NCC matching algorithm and Siamese-UNet matching algorithm based on the Siamese network, which both use visible and infrared images as input for matching tasks. Statistical results for the KAIST, VEDAI, and AVIID datasets, including averages, are shown in Table III and suggest that our algorithm performed best across all three metrics.
Comparing the image quality and matching results in Tables I and III, respectively, it can be found that the proposed method performs best in both cases. From the visual contrast effect, the image texture generated by this algorithm is clearer, and the gray information is more valid. According to Table III, the best matching accuracy and precision are obtained when the visible image is converted into the corresponding infrared image by the proposed method. The matching effect of the proposed method is even better than that of our previously proposed deep learning-based multimodal matching algorithm (Siamese-UNet), which shows that the generated infrared image has high reliability in downstream matching tasks and suggests that using our joint training methods, the template matching and image generation tasks can benefit each other.
G. Simulation Evaluation of Point Feature Matching
The scale invariant feature transform algorithm [82] was employed to evaluate the performance of point feature matching between the generated infrared images and the authentic ones. Our proposed method was compared against eight other methods. The statistical outcomes from the KAIST, VEDAI, and AVIID datasets, including the mean values, are displayed in Table IV. These results demonstrate that our algorithm achieved the best score in the EPE metric. The results from Tables III and IV collectively demonstrate that the infrared images generated by our proposed method exhibit the best performance in both template matching and point feature matching tasks.
H. Object Detection Evaluation Metrics
Infrared objective detection is an important research area in the field of remote sensing [83], [84], [85], [86]. We compared the results of different methods on the VEDAI dataset. We used common indicators in objective detection, including precision (P), recall (R), and mAP. We paid special attention to mAP50 and mAP50-95. mAP50 is calculated using an IoU threshold of 0.50 for average precision, while mAP50-95 calculates the average precision (mAP) by computing 10 mAP values at intervals of 0.05 between 0.50 and 0.95, and then taking the average of these ten values.
I. Simulation Evaluation of Object Detection
Infrared small target detection has always been a challenging issue in the field of remote sensing. The UIU-Net [83] proposes a simple and effective “U-Net in U-Net” framework. The ISNet [84] introduces a Taylor finite difference-inspired edge block and a two-orientation attention aggregation block. The GCI-Net [85] presents a novel Gaussian curvature inspired network. IRSAM [86] enables the application of an advanced segment anything model for the detection of small infrared targets. We performed infrared object detection using the common YOLOv8 [87]. The results are presented in Table V, summarizing the performance metrics obtained in the infrared object detection task. The table provides a quantitative evaluation, illustrating the effectiveness of our generated images in enhancing the precision and reliability of infrared object detection. The results from Tables I, III, and V collectively indicate that the infrared images generated by our proposed method are highly reliable in downstream tasks and can be effectively used for visible-infrared heterogeneous template matching and infrared object detection tasks.
J. Ablation Study
The effectiveness of matching loss was verified through ablation experiments on the KAIST dataset. UConvNeXt was used as the generation network of baseline, PatchGAN was used as the adversarial network, and LSGAN and L1 loss were used as the loss function. The NCC algorithm was used to evaluate the matching results, and five indicators were used to evaluate the quality of image generation.
Table VI shows the image matching and generation results after adding the two matching losses (
In TMGAN-MP, L1 loss and MP loss represent the quality of image generation and the effectiveness of image matching, respectively. Based on our experience with previous working parameter settings [17], we set
Evaluation metrics for various values of
Discussion
A. Overfitting in Task-Oriented Training
A potential disadvantage of task-oriented image generation is that the generated images may only achieve good results for the matching method involved in training and apply poorly to other methods.
MP loss (6) has an important operation,
From the results in Table VII, we can see that different
In the MP loss, there are two hyperparameters,
B. Relationship Between Template Matching and Image Generation Tasks
From Tables I and III, it can be observed that the image generation and image matching tasks are complementary.
We further explored their relationship by calculating the Pearson correlation coefficient. Table IX shows the relationship between image evaluation metrics and matching results. Due to the influence of metric dimensions, we normalized them between 0 and 1 and made them have the same trend of change to better calculate the Pearson correlation coefficient. Larger values of PSNR, SSIM, and MS-SSIM indicate better quality of image generation, as do smaller values of LPIPS and FID. It is not difficult to find that there is a strong correlation between image evaluation metrics and matching results, with most Pearson correlation coefficients exceeding 0.9. However, the correlation between the FID metric and the matching metric is relatively poor.
Regarding loss function design, from Fig. 9, a positive correlation between L1 loss and MP loss can be found. We performed a linear fit to the L1 loss and MP loss and found Pearson's r of 0.98813, indicating an extremely strong positive correlation. The strong correlation between image generation quality and task execution quality is the prerequisite for the application of this type of method, and the positive correlation between the generated image quality loss and matching task loss is a prerequisite for the convergence of the neural network. In addition, as shown in Fig. 8, this is the key to achieving a good weight ratio between image generation loss and matching loss to ensure the best performance of this method. When the ratio of
Summarizing the results of Table IX and Fig. 9, we can conclude that the image generation and image template matching tasks are complementary. The organic combination of the two tasks can simultaneously improve the quality of image generation and the accuracy of image matching.
Conclusion
We proposed TMGAN-NCC and TMGAN-MP to convert visible images into infrared images, solving the problem of template matching between visible and infrared images due to modal differences. We combined the template matching and image generation tasks. Experimental results on the VEDAI, KAIST, and AVIID datasets showed that these tasks can promote each other. The method outperforms state-of-the-art methods in terms of image translation quality and template matching accuracy.
We took the template matching task as an example to explore the relationship between image translation and image matching tasks. There was a strong positive correlation between these tasks, showing that the “image translation + downstream task” pattern is successful and provides support for the reliability of downstream tasks in image translation data. Our future work will extend this pattern to more downstream tasks, such as multisource image object detection and data association.
Code and Data Availability
The code and data information in the manuscript are as follows.
Disclosures
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.
ACKNOWLEDGMENT
The authors would like to thank the National Natural Science Foundation of China, the China Postdoctoral Science Foundation, and the Young Talent Fund of the University Association for Science and Technology in Shannxi for their support, as well as all those who have provided valuable comments on this article.