Processing math: 0%
Visible-to-Infrared Image Translation for Matching Tasks | IEEE Journals & Magazine | IEEE Xplore

Visible-to-Infrared Image Translation for Matching Tasks


Abstract:

Visible-to-infrared image translation is an important way to enrich infrared data. However, the reliability of the data generated by image translation in downstream tasks...Show More

Abstract:

Visible-to-infrared image translation is an important way to enrich infrared data. However, the reliability of the data generated by image translation in downstream tasks has always been controversial. This article proposes a method that integrates visible-to-infrared image translation tasks and multimodal template matching tasks. The image generation network is based on a generative adversarial networks (GANs), and network training is supervised by L1 loss, GANs loss, and match loss, where the matching loss includes normalized cross-correlation (NCC) loss and match patch (MP) loss. NCC loss is constructed based on the NCC matching algorithm. MP loss is calculated by modeling template matching as a contrastive learning problem. In experiments on the KAIST, VEDAI, and AVIID datasets, this method outperforms state-of-the-art methods in terms of image generation quality and template matching accuracy. Our method incorporates the image matching process into image-to-image translation, demonstrating the usability of GANs-based image generation for critical downstream tasks. This research resolves the practical controversy of generating images based on GANs and provides a theoretical reference for image generation for tasks, such as multisource image object detection and data association.
Page(s): 18199 - 18213
Date of Publication: 25 September 2024

ISSN Information:

Funding Agency:

Figures are not available for this document.

SECTION I.

Introduction

With the development of computer vision technology, various image sensors have been used. Many advanced artificial intelligence technologies are applied in the processing of multimodal remote sensing data [1], [2], [3], [4], [5], [6] Because of the high complementarity of visible and infrared image information, they have become the focus of attention in remote sensing imaging. Image-to-image (I2I) translation aims to convert images from one domain to another with content preserved as much as possible. Although the task of visible-to-infrared image translation can provide infrared data support for downstream tasks, such as template matching, the reliability of the generated data has always been a contentious issue.

In classical computer vision, the region-based template matching method, represented by the normalized cross-correlation (NCC) algorithm [7], can well solve the matching problem between homologous images. Some hand-designed feature-based matching methods designed have been proposed to address the difficulty in image matching caused by the large differences between multimodal images. However, these algorithms rely heavily on handcrafted features, and hence each has a specific usage domain, and prior information is needed to decide on a method for image matching.

Deep learning technology provides a new idea to solve the matching between multimodal images. One method uses a deep learning network to extract common depth features between the visible and infrared image and performs correlation calculation to obtain the matching position. For instance, the normalized cross-correlation network (NCCNet) [8], pseudo-Siamese CNN (PSiam) [9], CorrASL [10], Siamese U-Net [11], and multiscale fusion SiamUNet-7 (MSF SiamUNet-7) [12].

Another method is to convert the visible and infrared images into the same modality data for the matching task. After obtaining the same modality data, classical template matching methods can then be used to solve the visible and infrared images matching problem. Most image translation algorithms are built on conditional generative adversarial networks (CGANs) [13]. ThermalGAN [14], Pix2Pix-MRFFF [15], InfraGAN [16], IR-GAN [17], and edge-guided multidomain RGB-to-TIR (EMRT) [18] based on Pix2Pix [19] and CycleGAN [20] are improved for infrared image generation tasks. However, these algorithms evaluate the quality of infrared generated images from subjective and objective perspectives, without considering the downstream application reliability of tasks, such as image matching.

In our earlier work, Siamese U-Net [11] aimed to solve the matching problem between visible and synthetic aperture radar (SAR) images. IR-GAN [17] aims to obtain infrared images corresponding to visible images. This article combines the work of Siamese U-Net [11] and IR-GAN [17] to convert visible images into infrared images with high confidence in image matching tasks. Fig. 1 shows the comparison of our proposed I2I task and other I2I tasks.

Fig. 1. - I2I translation based on GANs and specific task train together, where the former generates images that are often used for direct human interaction or evaluated using human-set criteria but are rarely considered for direct feeding into downstream tasks based on neural networks or models. We propose a task-oriented approach for joint training with image translation and illustrate it using multimodal image matching.
Fig. 1.

I2I translation based on GANs and specific task train together, where the former generates images that are often used for direct human interaction or evaluated using human-set criteria but are rarely considered for direct feeding into downstream tasks based on neural networks or models. We propose a task-oriented approach for joint training with image translation and illustrate it using multimodal image matching.

I2I translation, as an important aspect of image generation, is growing rapidly in the field of remote sensing because many vision or image problems can be viewed as I2I translation problems [21], such as shadow removal [22], [23], image dehazing [24], [25], image segmentation [26], [27], and SAR-to-optical image translation [28], [29], [30]. These tasks have great potential in remote sensing, mainly to help people better recognize remote sensing imagery.

I2I translation has been gradually developed in the medical field [31], [32], as well as firefighting [33], [34], and autonomous driving [35], [36], which have high reliability and accuracy requirements for I2I translation. Template matching is an essential foundation for visual localization and navigation. Heterogeneous template matching can likewise be viewed as an I2I translation problem with similar requirements. Visible-to-infrared image translation is an important branch of image translation. The task of visible-to-infrared template matching can be facilitated by image translation, converting images into the same modality, thereby addressing the challenge posed by significant modality differences. Visible-to-infrared image translation can be utilized to augment infrared data for the purposes of heterogeneous image matching and multimodal data fusion research.

Based on the requirements of heterogeneous template matching applications, common I2I translation methods find it difficult to ensure the realism of the generated images and the reliability of downstream task applications. An important reason is that the training process is not integrated with the downstream tasks. Therefore, unlike current image translation methods that primarily focus on the visual effects of the generated images, we pay more attention to the reliability of the generated images in downstream tasks. We propose to incorporate the template matching task into the I2I translation training process. Our method generates high-quality images and performs well on the template matching task. In experiments, the quality of image generation and the template matching task are complementary in generative adversarial network (GAN)-generated images, and their effectiveness can be improved together.

We propose an infrared image generation method based on a CGAN [13] for matching tasks, named TMGAN. The main contributions of this article are as follows.

  1. Integrating I2I translation with multimodal image matching tasks to ensure the reliability of the generated images in downstream matching tasks.

  2. Two types of matching losses are proposed, one based on the NCC matching algorithm and the other on the idea of contrastive learning, modeling the matching problem as a contrastive learning process.

  3. Experiments are conducted on three large benchmark datasets. Our method outperforms state-of-the-art methods in terms of both image generation quality and template matching accuracy.

The rest of this article is organized as follows. Section II introduces the related work. In Section III, TMGAN is described in detail. The experiments and results are presented in Section IV followed by the discussions and analyses in Section V. Finally, Section VI concludes this article.

SECTION II.

Related Work

A. Multimodal Image Matching Based on Network Features

An NCCNet [8] maximizes the contrast between true and false matching NCC values, transforming image features using a trained Siamese convolutional network, thereby improving the robustness of the algorithm. PSiam [9] is a pseudo-Siamese convolutional neural network architecture using eight convolutional layers and a fully connected layer to fuse features and identify corresponding patches in multimodal images. CorrASL [10] is a deep learning end-to-end framework that uses a feature-space cross-correlation operator based on learning features to generate a correspondence heat map. Pixelwise deep dense features (PDDFs) [37], Siamese U-Net [11], and MSF SiamUNet-7 [12] use a Siamese network architecture to extract common features between images. PDDFs define a loss function based on the sum of the squares of differences between learned local pixelwise features. The Siamese U-Net generates a heat map of image pairs using a cross-correlation layer and uses balance loss between true and predicted values to obtain trained model weights. MSF SiamUNet-7 extracts phase structure convolutional features to fully fuse local texture information at a large scale and global structure information at a small scale.

B. Visible-to-Infrared Image Translation

With the excellent performance of I2I translation based on GANs in colorization [38], [39], deblurring [40], [41], and inpainting [42], [43], it has a wide range of applications in advertising, games, entertainment, media, and medicine. Photo animation [44], [45], makeup transfer [46], [47], [48], and photograph to sketch image translation [49], [50], [51] based on GANs can provide good emotional value. However, in critical industries such as medicine and firefighting, we must consider its reliability. I2I translation of medical images offers a range of valuable applications for medical physicists. However, the scarcity of external validation studies of I2I models limits the immediate applicability of the proposed methods [31]. Template matching faces the same dilemma.

ThermalGAN [14], Pix2Pix-MRFFF [15], InfraGAN [16], IR-GAN [17], EMRT [18], and IRformer [52] convert the visible image into a corresponding infrared image. With the rapid development of diffusion models in image generation, UNIT-DDPM [53], DDIB [54], Cycle-Diffusion [55], VI-Diff [56], and CycleGAN-Turbo [57] have achieved significant performance in various image translation tasks based on diffusion models. However, these algorithms have not been widely applied to the task of translating visible images to infrared images, and they suffer from slow inference speed and high equipment requirements.

From the perspective of the realism of image translation, algorithms based on CGAN enhance the generative network or adversarial network, and they improve the texture and structure information of generated infrared images by adding new losses. Algorithms based on diffusion models utilize denoising diffusion probabilistic models to provide stable model training. They usually use common metrics to evaluate the quality of the generated infrared image and use the generated images for interaction with a human being. Since the generated images rarely interact directly with downstream tasks based on neural networks or models, there is a controversy over the usability of the generation. Unlike GANs that typically focus solely on the quality of the generated images, our approach integrates the image translation model training process with downstream tasks, incorporating the matching process into the training to enhance the reliability of the generated images.

C. Contrastive Learning

Contrastive learning is often used in self-supervised learning [58], aiming to, respectively, narrow and expand the distance between positive and negative examples [59]. SiamCLR v1 [60], SiamCLR v2 [61], MoCo v1 [62], MoCo v2 [63], BYOL [64], [65], SwAV [66], and SiaSiam [67] enable the network to show stronger performance without expanding the data. They generally use data augmentation to generate positive samples, and images of different categories as negative samples. They enhance the network's ability to learn common features with positive samples and distinguish differences with negative samples. Based on the contrastive learning method, CUT [68], DCLGAN [69], and EnCo [70] achieve image translation by maximizing the mutual information between the input and output image patches. We apply contrastive learning to the matching task, taking correctly matched image blocks as positive samples and incorrectly matched image blocks as negative samples, which converts contrastive learning into template matching.

SECTION III.

Method

The proposed method converts visible images into infrared images to solve the problem of difficult template matching between visible and infrared images due to modal differences. Fig. 2 shows a flowchart of this GAN-based method. A generative network converts visible images into infrared images. L1 loss and GAN loss are common losses used for image generation. Combined with a multimodal template matching task, NCC loss and match patch (MP) loss are proposed. In TMGAN-NCC (see Sections III-A–​III-C), an NCC coefficient is used as a kind of matching loss supervision. In TMGAN-MP (see Sections III-A, III-B, and III-D), we model template matching as a contrastive learning problem.

Fig. 2. - Algorithm structure. During the training process, L1 loss is derived by calculating L1 distance between real and fake infrared images. GAN loss is based on judgment of discriminator on truth or falseness of a fake image. We incorporate a multimodal image template match task into I2I translation and two match losses, based on NCC and MP loss. During inference, we only use visible images as input to the generative network.
Fig. 2.

Algorithm structure. During the training process, L1 loss is derived by calculating L1 distance between real and fake infrared images. GAN loss is based on judgment of discriminator on truth or falseness of a fake image. We incorporate a multimodal image template match task into I2I translation and two match losses, based on NCC and MP loss. During inference, we only use visible images as input to the generative network.

A. Infrared Image Generation Structure

TMGAN is based on an improved CGAN. The generative network uses only visible images as input. The generative network has a U-Net architecture, the encoder has a ConvNeXt [71] architecture, and the decoder has a ConvNeXt-like architecture. The specific structure is shown in Fig. 3. The generative network is used to convert the visible image into a corresponding infrared image. This is referred to as the infrared image generation structure. We use PatchGAN [72] as the discriminator, which outputs a discriminant matrix to distinguish between real and generated images. PatchGAN has a lower computational burden than multiscale PatchGAN [73] and UNetGAN [74].

Fig. 3. - Infrared image generation structure. The generative network is mainly composed of three modules: ConvNeXt block, double residual block, and upsampling. The encoder of the generative network uses the ConvNeXt $\_$ tiny architecture. In the decoder, the double residual block performs upsampling by adding to the low-level features from the encoder. Finally, the upsampling module is used to obtain the infrared image.
Fig. 3.

Infrared image generation structure. The generative network is mainly composed of three modules: ConvNeXt block, double residual block, and upsampling. The encoder of the generative network uses the ConvNeXt $\_$ tiny architecture. In the decoder, the double residual block performs upsampling by adding to the low-level features from the encoder. Finally, the upsampling module is used to obtain the infrared image.

B. Loss

Loss in this article includes LSGAN loss, L1 loss, and the proposed matching loss. LSGAN and L1 loss are used to promote the authenticity of the generated image, and the matching loss ensures a high degree of reliability in the matching task \begin{align*} l_{\text{CGAN}}(G,D) = & E_{x,y \sim P_{\text{data}}}[(D(x,y))^{2}]\\ & + E_{x \sim P_{\text{data}}}[(1-D(x,G(x,z)))^{2}] \tag{1}\\ l_{L1} =& ||y-G(x,z)||_{1} \tag{2}\\ \mathcal {L} =& \mu l_{\text{CGAN}}(G,D) + \lambda l_{L1} + \varphi l_{M} \tag{3} \end{align*} View SourceRight-click on figure for MathML and additional features.where $\mathcal {L}$ is the total loss, $l_{\text{CGAN}}(G, D)$ is the loss of the CGAN, $l_{L1}$ is the L1 loss, and $l_{M}$ is the match loss; $E(\cdot)$ denotes the expected value; $x \sim P_{\text{data}}$, where the subscript denotes data taken from the visible image; $x,y \sim P_{\text{data}}$, where the subscript denotes data taken from the visible image and the corresponding real infrared image; $y$ is the label information (real infrared image), and $D(x,y)$ is the probability that the discriminator determines whether the real data are real; $G(x,z)$ indicates the target domain (infrared) image generated according to the source domain (visible) image; $D(x, G(x,z))$ is the probability that the discriminator judges whether the generated data are true; and $\mu$, $\lambda$, and $\varphi$ are the respective weights for LSGAN, L1, and matching loss. In this article, two types of matching losses are included, namely NCC loss and MP loss.

C. NCC Loss

NCC can effectively overcome the effects of lighting and other conditions, is insensitive to sudden changes in local pixel values, has strong robustness, and has been widely used in template matching. NCC can be expressed as \begin{equation*} \text{NCC}(T,I) = \frac{\sum _{x} \sum _{y} (T(p,q)-\mu _{T})(I(p,q)-\mu _{I})}{\sqrt{\sum _{x} \sum _{y} (T(p,q)-\mu _{T})^{2}(I(p,q)-\mu _{I})^{2}}} \tag{4} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $T$ is the template image, $I$ is the search image, and $\mu _{T}$, $\mu _{I}$ are the respective mean values of $T$ and $I$. The correlation coefficient takes values ranging from –1 to 1. The closer it is to 1, the more similar the images.

In visible-to-infrared image translation, if the NCC coefficient between the IR-generated and real IR image is closer to 1, then the IR-generated image performs better in the template matching task. Combined with (4), the match loss can be defined as \begin{equation*} l_{\text{NCC}} = 1 - \frac{\sum _{x} \sum _{y} (f(p,q)-\mu _{f})(r(p,q)-\mu _{r})}{\sqrt{\sum _{x} \sum _{y} (f(p,q)-\mu _{f})^{2}(r(p,q)-\mu _{r})^{2} + \text{eps}}} \tag{5} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $f$ is an IR-generated image, $r$ is a real IR image, and $\mu _{f}$, $\mu _{r}$ are the respective mean values of $f$ and $r$. To prevent gradient explosion due to a zero denominator, a small offset eps is added to (5). We refer to the method of adding $l_{\text{NCC}}$ as TMGAN-NCC.

D. MP Loss

NCC loss is a simple way to incorporate matching loss into the network training. We use another type of matching loss based on Siamese U-Net and contrastive learning. When constructing a match loss, a simple idea is to directly apply the loss of Siamese U-Net to image generation, but it is disadvantageous to use the given matching labels simply and rudely for image generation. The ideal matching label is shown in Fig. 4(a), where the template image has different correlations at different positions on the search image. Simply using 0 and 1 to make matching labels introduces beneficial information at the correct matching position but harmful information at wrong matching positions, which is not conducive to image generation and can lead to the collapse of the GAN.

Fig. 4. - Similarity calculation using three labels. The x and y axes in the graph represent the matching positions, and the z-axis indicates the image correlation at that position, with values ranging from [–1,1]. The higher the value, the stronger the correlation. (a) Siamese U-Net match label. (b) Real image normalized correlation. (c) Fake image normalized correlation.
Fig. 4.

Similarity calculation using three labels. The x and y axes in the graph represent the matching positions, and the z-axis indicates the image correlation at that position, with values ranging from [–1,1]. The higher the value, the stronger the correlation. (a) Siamese U-Net match label. (b) Real image normalized correlation. (c) Fake image normalized correlation.

We consider using the normalized correlation of the template image on the real search image as the matching label, as shown in Fig. 4(b), which is continuous and authentic and intuitively advantageous for image generation. However, the normalized correlation of the template image on the fake search image [see Fig. 4(c)] has a high similarity with the real normalized correlation matching label, which makes it difficult to obtain more useful information in the image generation process.

Contrastive learning enhances the power of the network by learning the common features between positive samples and the differences between negative samples. Combining the idea of contrastive learning and the Siamese U-Net, the image block at the correct matching position on the template image and the search image is taken as a positive sample, which is equivalent to 1 in the matching label, and the image block at the wrong matching position is taken as a negative sample, which is equivalent to 0. This has the advantage that the construction is a pseudo 0–1 ideal matching label, and the label constructed using contrastive learning is compared with the label of Fig. 4(a). At the wrong matching position, it introduces a weaker constraint that avoids the collapse of the GAN. Compared with the label in Fig. 4(b), the problem of not being able to effectively learn more information from the label can be avoided when the generative network has certain generative abilities.

The MP Loss can be expressed as \begin{equation*} l_{\text{MP}} = \sum _{i=1}^{m} \frac{\text{Sim}(I_{i},I_{i}^{+})}{\sum _{j=1}^{n} \text{Sim}(I_{i},I_{i,j}^{-})}, \tag{6} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $l_{\text{MP}}$ is the MP Loss in this article, and $\text{Sim}(\cdot)$ is the similarity. The higher the similarity, the smaller the value, and vice versa. We use cosine similarity. $I_{i}$ refers to the template image. $I_{i}^{+}$ refers to the image block (positive sample) in the correct matching position in the search image. $I_{i,j}^{-}$ refers to the image block (negative sample) at the $j$th mismatched position in the search image. $m$ represents that a total of $m$ matches were performed, and $n$ represents that a total of $n$ blocks of mismatched locations were selected. In this article, $m$ and $n$ are categorized as 5 and 20, respectively. We refer to the method of adding $l_{\text{MP}}$ as TMGAN-MP. Algorithm 1 illustrates the operational process of $l_{\text{MP}}$.

Algorithm 1: PyTorch-Style Pseudocode for $l_{MP}$.

Input: $\mathbf {X}, \text{fake infrared image} $

$\mathbf {Y},\text{real infrared image}$

Output: $\mathbf {Out}, \text{value of MP Loss}$

Params: $\mathbf {eps}, \text{epsilon, 1e--6}$

$\mathbf {m}, \text{positive sample number}$

$\mathbf {n}, \text{negative sample number}$

Operator: $\mathbf {S_+}, \text{select randomly positive sample}$

$\mathbf {S_-}, \text{select randomly negative sample}$

$\mathbf {CS}, \text{calculate similarity}$

$\mathbf {Out = 0}$

for $i$ in range($m$):

$\mathbf{loss}_{+} = \mathbf{CS(S}_{+}\mathbf{(X,Y))}$

$\mathbf {loss_- = 0}$

for $j$ in range($n$):

$\mathbf{loss}_\mathbf{- +}= \mathbf{CS(S}\_{\mathbf{(X,Y)}}$

$\mathbf {Out \; += loss_+ / (loss_- + eps)}$

SECTION IV.

Experiments and Results

We describe the metrics and datasets used in the study and present the results. All experiments were conducted using an Intel Core i9-10980XE CPU running @ 3 GHz on an Nvidia RTX 3090 GPU. We conducted an experiment to evaluate the quality of infrared image generation and a template matching experiment using the generated infrared image as a template image on public datasets. In addition, we conducted point feature matching and infrared target detection experiments on the generated infrared images.

A. Datasets

The KAIST dataset [75] contains daily traffic scenes of schools, streets, and villages, which are generally used for pedestrian detection and are important in the field of autonomous driving. The PointGrey Flea3 color camera produces an image of size 640 × 480. The FLIR-A35 infrared camera produces a 320 × 256 image. After camera calibration, a 640 × 512 image pair is obtained. Template matching experiments on the KAIST dataset enabled us to check the usability of the proposed method in autonomous driving. In the experiment, images taken in the daytime were sampled to reduce the number of repeated image pairs. A 256 × 256 image was randomly cut from a 640 × 512 search image for use as a template image. We used 5000 pairs for training and 1247 for testing.

The VEDAI dataset [76] contains spring images collected by the AGRC satellite in Utah, USA, in 2012, including 1024 × 1024 and 512 × 512 images, where a pixel represents an area of 12.5 cm × 12.5 cm and 25 cm × 25 cm, respectively, and the infrared image is near-infrared. The dataset's nine categories are plane, boat, camping car, car, pick-up, tractor, truck, van, and other. VEDAI is important in remote sensing with higher resolution and more complex scenes. We verified the usability of the proposed method for remote sensing by performing template experiments on the VEDAI dataset, using 1024 × 1024 search images, and randomly cropped 256 × 256 images as template images, with 1018 images for training and 250 images for testing.

The AVIID [77] is a dataset for converting aerial visible images into infrared images, containing more than 3000 pairs of visible and infrared images. The dataset is composed of three subdatasets, and the AVIID-3 subset was selected for the experiment due to its inclusion of 1280 pairs of complex scene images, captured at various altitudes and angles, which provides multiscale targets and a variety of backgrounds. The image resolution is 480 × 480, with a total of 1024 image pairs used for training, and 256 pairs for testing in the experiment. In the template matching experiment, the true infrared images were cropped to a size of 240 × 240 and used as template images. Experiments using the AVIID data can evaluate the potential application of the method in the field of low-altitude UAVs.

B. Implementation Details

The proposed improvements are based on the Pix2Pix framework. All experiments were conducted using public code and datasets for training and testing under the same experimental environment, to provide a fair comparison. In the experimental setup of Pix2Pix [19], ThermalGAN [14], Pix2Pix-MRFFF [15], InfraGAN [16], IR-GAN [17], TMGAN-NCC, and TMGAN-MP. The Adam optimizer was employed, with $\beta _{1}$ and $\beta _{2}$ set to 0.5 and 0.999, respectively. A total of 200 training epochs were included to ensure model convergence. The learning rates of the generative and adversarial networks remained fixed at 0.0002 and 0.000002, respectively, for the first 100 epochs, and then both decreased linearly to zero. All networks were trained from scratch with random initialization. We set $\mu$ = 1, $\lambda$ = 100, and $\varphi$ = 100. Specifically, in MP loss, we set the hyperparameters $m$ and $n$ to 5 and 20, respectively, and use cosine similarity as $\text{Sim}(\cdot)$ operation. IRformer, EMRT, and CycleGAN-turbo follow the setup of [18], [52], and [57].

C. Image Quality Evaluation Metrics

Image quality was measured by the peak signal-to-noise ratio (PSNR) [78], structural similarity index measure (SSIM), multiscale structural similarity index measure (MS-SSIM) [79], learned perceptual image patch similarity (LPIPS) [80], and Fréchet inception distance (FID) [81]. We briefly explain these below. PSNR and SSIM are widely used in image quality assessment. PSNR is given by \begin{equation*} \text{PSNR}(I,S) = 10\log \frac{(2^{n}-1)^{2}}{\text{MSE}(I,S)} \tag{7} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \begin{equation*} \text{MSE}(I,S) = \frac{1}{MN} \sum _{x=1}^{M}\sum _{y=1}^{N} (I(p,q)-S(p,q))^{2} \tag{8} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $I$ is a real IR image, $S$ is a simulated IR image, and $M$ and $N$ are the respective height and width of the image. SSIM is established using image brightness, contrast, and structural similarity and can be defined as \begin{equation*} \begin{aligned} \text{SSIM}(I,S) &= \frac{2 \mu _{I} \mu _{S} + C_{1}}{\mu _{I}^{2} + \mu _{S}^{2} + C_{1}} \frac{2 \sigma _{I} \sigma _{S} + C_{2}}{\sigma _{I}^{2} + \sigma _{S}^{2} + C_{2}} \frac{\sigma _{IS} + C_{3}}{\sigma _{I} \sigma _{S} + C_{3}} \\ &= \frac{2 \mu _{I} \mu _{S} + C_{1}}{\mu _{I}^{2} + \mu _{S}^{2} + C_{1}} \frac{\sigma _{IS} + C_{3}}{\sigma _{I}^{2} + \sigma _{S}^{2} + C_{2}} \end{aligned} \tag{9} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $C_{1} = (0.01L)^{2}$, $C_{2} = (0.03L)^{2}$, and $C_{3}=C_{2}/2$ are parameters used to ensure the stability of a partition; $\mu _{I}$, $\mu _{S}$ are the respective means of $I$ and $S$; $\sigma _{I}$, $\sigma _{S}$ are their respective standard deviations; $\sigma _{IS}$ is the covariance of $I$ and $S$; and $L$ is the range of image pixels. MS-SSIM maintains stable performance for images of different resolutions and is defined as \begin{equation*} \text{MS-SSIM}(I,S) = [b_{M}(I,S)]^{\alpha _{M}} \prod \limits _{j=0}^{M} [c_{j}(I,S)]^{\beta _{j}} [s_{j}(I,S)]^{\gamma _{j}} \tag{10} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $b_{M}(I, S)$, $c_{j}(I, S)$, and $s_{j}(I, S)$ are the brightness, contrast, and structural similarity, respectively; and $\alpha _{M}$, $\beta _{j}$, and $\gamma _{j}$ are weight coefficients for $b_{M}(I,S)$, $c_{j}(I,S)$, and $s_{j}(I,S)$, respectively. To simplify parameter selection, $\alpha _{j} = \beta _{j} = \gamma _{j}$, $\sum _{j=1}^{M} \gamma _{j} = 1$, and $M=5$.

LPIPS measures the Euclidean distance between two image feature vectors and is calculated as \begin{align*} & \text{LPIPS}(I,S) \\ &= \sum _{l}^{L}{\frac{1}{H_{l}W_{l}}} \sum _{H_{l},W_{l}}[f_{l}(I)_{H_{l},W_{l}} - f_{l}(S)_{H_{l},W_{l}}]^{2} \times \omega _{1}. \tag{11} \end{align*} View SourceRight-click on figure for MathML and additional features.To calculate this, comparative features were obtained from a convolutional neural network-based backbone, pretrained on ImageNet (the AlexNet network model was used in the experiment). It was found that LPIPS can favorably evaluate the closeness rate between target and reference patches in terms of human judgment, compared with traditional metrics, such as SSIM and PSNR, using deep networks and their deep features.

FID measures the similarity between two sets of images from statistical aspects of computer vision features of the original image, a measure that calculates the distance between the feature vectors of the real and generated images. The visual features of FID are extracted and calculated using the Inception v3 image classification model. FID scores are often used to assess the quality of images generated by GANs, with lower scores having a high correlation with higher quality images. FID scores 0in the best case, indicating that both sets of images are identical. FID is calculated as \begin{equation*} \text{FID}(I,S) = \sqrt{||\mu _{I} - \mu _{S}||^{2} + \text{Tr}(\sigma _{I} + \sigma _{S} - 2 \sqrt{\sigma _{I} \sigma _{S}})} \tag{12} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $\mu _{I}$ and $\mu _{S}$ are the respective mean vectors of the real and generative data distributions, with respective covariance matrices $\sigma _{I}$ and $\sigma _{S}$; Tr is the trace of the matrix; and $||\cdot ||$ is the binary norm of the vector.

D. Simulation Evaluation of Infrared Image Generation Quality

We employed eight comparative algorithms, among which Pix2Pix [19] established the fundamental framework for paired image translation. ThermalGAN [14] algorithm introduced a temperature vector as input, but to ensure a fair comparison, the ThermalGAN algorithm only used visible images as input. Pix2Pix-MRFFF [15] proposed a multireceptive field feature Fusion network. InfraGAN [16] and IR-GAN [17] improve the image quality by improving the expressive power of the generative and adversarial networks and improving the edge information of the generated image using SSIM loss and gradient vector loss, respectively. EMRT [18] proposed an edge-guided multidomain RGB-to-TIR image translation model. IRformer [52] has constructed an implicit multispectral transformer. CycleGAN-turbo [57] uses the LORA technique to fine-tune the diffusion model to adapt to new tasks.

An observation of the IR images generated from three datasets supports the following conclusions. From the results of the KAIST dataset (see Fig. 5), we can find that the infrared image generated by our method has clearer texture information and more realistic gray information (red box, Fig. 5). In addition, the infrared image generated by the algorithm is more complete in terms of detail information retention (blue box, Fig. 5). All algorithms perform better on the VEDAI dataset, which is relatively simple for the infrared generation task because its infrared images have a simple structure and little texture information. From the results of the VEDAI dataset (see Fig. 6), we can find that the proposed algorithm generates more accurate grayscale information of the infrared image (red box, Fig. 6). From the results of the AVIID dataset, we can observe that the proposed algorithm has generated infrared images that retain more detailed information. (red box, Fig. 7). The Pix2Pix, ThermalGAN, and Pix2Pix-MRFFF algorithms all perform poorly on the three datasets. They are designed for visible-to-infrared image translation on 256 × 256 images, which results in poor performance on high-resolution images. InfraGAN and IR-GAN algorithms focus on multiscale information of images, resulting in better quality of generated images. EMRT uses an edge-guided approach, but the edges extracted from visible light and infrared images are not fully consistent, resulting in blurred edges in the images produced by EMRT and the poorest performance across all datasets. The IRformer model has a lightweight design, but since the method is not based on GANs, the generated images are somewhat blurry. Due to the complexity of converting visible-to-infrared images, the grayscale information in the images generated by CycleGAN-turbo is not well preserved. From the final results, the image generation quality of TMGAN-NCC and TMGAN-MP is somewhat superior to the performance of IR-GAN. This is because TMGAN is developed based on IR-GAN. We performed a quantitative evaluation to further analyze the performance of TMGAN.

Fig. 5. - Examples of infrared images generated on the KAIST dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.
Fig. 5.

Examples of infrared images generated on the KAIST dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.

Fig. 6. - Examples of infrared images generated on the VEDAI dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.
Fig. 6.

Examples of infrared images generated on the VEDAI dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.

Fig. 7. - Examples of infrared images generated on the AVIID dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.
Fig. 7.

Examples of infrared images generated on the AVIID dataset by our algorithm. (a) Optical image. (b) Infrared image. (c) Pix2Pix. (d) ThermalGAN. (e) Pix2Pix-MRFFF. (f) InfraGAN. (g) IR-GAN. (h) EMRT. (i) IRformer. (j) CycleGAN-turbo. (k) TMGAN-NCC. (l) TMGAN-MP.

We use common indicators to evaluate the quality of the generated images, as shown in Table I, where the best values are in boldface. Our methods (TMGAN-NCC and TMGAN-MP) achieved the best results for infrared image generation. In addition, we have summarized the model parameter count and training time on the KAIST dataset for all algorithms in Table II. The TMGAN algorithm has a moderate number of model parameters and the training time is also acceptable. TMGAN-NCC introduces an NCC coefficient as a kind of matching loss supervision. TMGAN-MP model template matching is a contrastive learning problem. The more similar the generated infrared image is to the real infrared, the higher the success rate of matching, which makes the matching loss improve the quality of the generated infrared.

TABLE I Infrared Image Generation Quality Evaluation Results
Table I- Infrared Image Generation Quality Evaluation Results
TABLE II Network Parameter Count and Training Time Consumption on KAIST Dataset
Table II- Network Parameter Count and Training Time Consumption on KAIST Dataset

E. Matching Evaluation Metrics

We compared various approaches to assess performance in template matching and point feature matching. Template matching accuracy is defined by the mean L2 error and the percentage of match pairs that have an L2 distance within one pixel. Template matching precision is the standard deviation of matching L2 error and is also called mean average precision ($\text{mAP}_{m}$). The point feature matching experimental results are evaluated using the matching end point error (EPE), which is represented as \begin{equation*} \text{EPE} = \frac{1}{\mathcal {N}} \sum _{i=1}^{\mathcal {N}} \sqrt{(p_{i} - \hat{p}_{i})^{2} + (q_{i} - \hat{q}_{i})^{2}} \tag{13} \end{equation*} View SourceRight-click on figure for MathML and additional features.where $\mathcal {N}$ represents the number of successfully matched points, $(p_{i}, q_{i})$ represents the position of the point feature on the target image, and $(\hat{p}_{i}, \hat{q}_{i})$ represents the position of the corresponding point feature after transformation by the true homography matrix.

F. Simulation Evaluation of Template Matching

The NCC was used to evaluate the template matching effect between the generated infrared image and the real template image. The proposed method was compared with Pix2Pix, ThermGAN, Pix2Pix-MRFFF, InfraGAN, IR-GAN, EMRT, IRformer, and CycleGAN-turbo, with results shown in Table III. Index values were calculated for all IR images generated by each model. We also compared the traditional NCC matching algorithm and Siamese-UNet matching algorithm based on the Siamese network, which both use visible and infrared images as input for matching tasks. Statistical results for the KAIST, VEDAI, and AVIID datasets, including averages, are shown in Table III and suggest that our algorithm performed best across all three metrics.

TABLE III Template Matching Evaluation Results
Table III- Template Matching Evaluation Results

Comparing the image quality and matching results in Tables I and III, respectively, it can be found that the proposed method performs best in both cases. From the visual contrast effect, the image texture generated by this algorithm is clearer, and the gray information is more valid. According to Table III, the best matching accuracy and precision are obtained when the visible image is converted into the corresponding infrared image by the proposed method. The matching effect of the proposed method is even better than that of our previously proposed deep learning-based multimodal matching algorithm (Siamese-UNet), which shows that the generated infrared image has high reliability in downstream matching tasks and suggests that using our joint training methods, the template matching and image generation tasks can benefit each other.

G. Simulation Evaluation of Point Feature Matching

The scale invariant feature transform algorithm [82] was employed to evaluate the performance of point feature matching between the generated infrared images and the authentic ones. Our proposed method was compared against eight other methods. The statistical outcomes from the KAIST, VEDAI, and AVIID datasets, including the mean values, are displayed in Table IV. These results demonstrate that our algorithm achieved the best score in the EPE metric. The results from Tables III and IV collectively demonstrate that the infrared images generated by our proposed method exhibit the best performance in both template matching and point feature matching tasks.

TABLE IV Point Feature Matching Evaluation Results
Table IV- Point Feature Matching Evaluation Results

H. Object Detection Evaluation Metrics

Infrared objective detection is an important research area in the field of remote sensing [83], [84], [85], [86]. We compared the results of different methods on the VEDAI dataset. We used common indicators in objective detection, including precision (P), recall (R), and mAP. We paid special attention to mAP50 and mAP50-95. mAP50 is calculated using an IoU threshold of 0.50 for average precision, while mAP50-95 calculates the average precision (mAP) by computing 10 mAP values at intervals of 0.05 between 0.50 and 0.95, and then taking the average of these ten values.

I. Simulation Evaluation of Object Detection

Infrared small target detection has always been a challenging issue in the field of remote sensing. The UIU-Net [83] proposes a simple and effective “U-Net in U-Net” framework. The ISNet [84] introduces a Taylor finite difference-inspired edge block and a two-orientation attention aggregation block. The GCI-Net [85] presents a novel Gaussian curvature inspired network. IRSAM [86] enables the application of an advanced segment anything model for the detection of small infrared targets. We performed infrared object detection using the common YOLOv8 [87]. The results are presented in Table V, summarizing the performance metrics obtained in the infrared object detection task. The table provides a quantitative evaluation, illustrating the effectiveness of our generated images in enhancing the precision and reliability of infrared object detection. The results from Tables I, III, and V collectively indicate that the infrared images generated by our proposed method are highly reliable in downstream tasks and can be effectively used for visible-infrared heterogeneous template matching and infrared object detection tasks.

TABLE V Comparison of the Performance of Different Models in Infrared Object Detection
Table V- Comparison of the Performance of Different Models in Infrared Object Detection

J. Ablation Study

The effectiveness of matching loss was verified through ablation experiments on the KAIST dataset. UConvNeXt was used as the generation network of baseline, PatchGAN was used as the adversarial network, and LSGAN and L1 loss were used as the loss function. The NCC algorithm was used to evaluate the matching results, and five indicators were used to evaluate the quality of image generation.

Table VI shows the image matching and generation results after adding the two matching losses ($l_{\text{NCC}}$ and $l_{\text{MP}}$). Compared with the baseline, both $l_{\text{NCC}}$ and $l_{\text{MP}}$ produce a great improvement in the image generation quality and the accuracy of template matching.

TABLE VI Template Matching and Image Generation Results of two Matching Losses
Table VI- Template Matching and Image Generation Results of two Matching Losses

In TMGAN-MP, L1 loss and MP loss represent the quality of image generation and the effectiveness of image matching, respectively. Based on our experience with previous working parameter settings [17], we set $\mu$ (GAN loss weight) to 1 and $\lambda$ (L1 loss weight) to 100. By adjusting the weight of $\varphi$ (MP loss weight), we observe the change in image generation quality and image matching accuracy. From Fig. 8, it can be seen that the change trends of image generation quality and image matching accuracy are basically consistent; with the gradual increase in $\varphi$, there is a trend of first rising and then falling. This indicates that the incorporation of MP loss has a beneficial effect on the quality of image generation and the accuracy of image matching. However, when $\varphi$ is set unreasonably, it leads to L1 loss and MP loss being unable to cooperate effectively, and the model results begin to deteriorate. When $\varphi$ is equal to 100, the model achieves the best results in both image generation and image matching. We also plot the relationship between the change in L1 loss and MP loss during training when $\varphi$=100.

Fig. 8. - Evaluation metrics for various values of $\varphi$. Due to the influence of the dimension of different metrics, values of each metric were normalized between 0 and 1. The figure shows that $\varphi$ takes a value of 100 when the current metric is performing optimally.
Fig. 8.

Evaluation metrics for various values of $\varphi$. Due to the influence of the dimension of different metrics, values of each metric were normalized between 0 and 1. The figure shows that $\varphi$ takes a value of 100 when the current metric is performing optimally.

SECTION V.

Discussion

A. Overfitting in Task-Oriented Training

A potential disadvantage of task-oriented image generation is that the generated images may only achieve good results for the matching method involved in training and apply poorly to other methods.

MP loss (6) has an important operation, $\text{Sim}(\cdot)$, which computes the similarity of the corresponding image blocks. The similarity calculation of different image blocks may result in the generated image being valid only on a certain matching algorithm. For validation, we chose three classical template matching algorithms: mean absolute difference (MAD), sum of squares of mean errors (MSD), and NCC. For our $\text{Sim}(\cdot)$ operation, we chose three calculation methods: L1 distance, L2 distance, and cosine similarity.

From the results in Table VII, we can see that different $\text{Sim}(\cdot)$ operations have an impact on the final matching result. Compared with the baseline results, it can be found that for the NCC matching algorithm, different $\text{Sim}(\cdot)$ operations as matching losses can effectively improve the matching accuracy; the percentage of match pairs that have an L2 distance within one pixel is increased by approximately 14.5 $\%$, while the mean L2 error is reduced by about 1pixel. The algorithm exhibits the same performance on the MAD and MSD matching algorithms. Comparing the results of different $\text{Sim}(\cdot)$ operations, it can be seen that their matching accuracies when using the same matching algorithm are not very different; the percentage of match pairs that have an L2 distance within one pixel is approximately 1$\%$, and the mean L2 error is about 0.1pixels. When the $\text{Sim}(\cdot)$ operation is calculated using cosine similarity, the resulting matched loss achieves the best matching precision. Therefore, cosine similarity is used as the $\text{Sim}(\cdot)$ operation in this article. This shows that the template matching loss constructed by contrastive learning has good applicability to different matching methods. Considering the results from Tables III and VII, using NCC loss in practical applications is recommended when the matching algorithm is the NCC algorithm. When the matching algorithm is unknown, it is recommended to use MP loss.

TABLE VII Relationship Between $\text{Sim}(\cdot)$ Operations and Matching Results in KAIST Dataset
Table VII- Relationship Between $\text{Sim}(\cdot)$ Operations and Matching Results in KAIST Dataset

In the MP loss, there are two hyperparameters, $m$ and $n$. To explore the impact of these two on the final result of template matching, we set $\text{Sim}(\cdot)$ to cosine similarity and conducted template matching experiments on the AVIID dataset using the NCC matching algorithm. We mainly considered the final matching results and the increased training time. Table VIII demonstrates the effects of varying $m$ and $n$ on the matching outcomes and the additional time consumed by training. When only increasing $m$, the improvement in matching results is minimal, and the increase in training time is also relatively small. When $m=5$, as $n$ increases, there is a significant improvement in matching results, achieving a matching accuracy of 87.50% when $n=20$. After that, increasing $n$ does not help much in improving the matching results but will increase the training time. When $n=20$, increasing $m$ to 10 also results in a small improvement in matching results, but the increase in training time is relatively large. Considering the comprehensive situation of matching results and the increase in training time, we finally set $m=5$ and $n=20$.

TABLE VIII Relationship Between $m$ and $n$ in MP Loss and Matching Results in AVIID Dataset
Table VIII- Relationship Between $m$ and $n$ in MP Loss and Matching Results in AVIID Dataset

B. Relationship Between Template Matching and Image Generation Tasks

From Tables I and III, it can be observed that the image generation and image matching tasks are complementary.

We further explored their relationship by calculating the Pearson correlation coefficient. Table IX shows the relationship between image evaluation metrics and matching results. Due to the influence of metric dimensions, we normalized them between 0 and 1 and made them have the same trend of change to better calculate the Pearson correlation coefficient. Larger values of PSNR, SSIM, and MS-SSIM indicate better quality of image generation, as do smaller values of LPIPS and FID. It is not difficult to find that there is a strong correlation between image evaluation metrics and matching results, with most Pearson correlation coefficients exceeding 0.9. However, the correlation between the FID metric and the matching metric is relatively poor.

TABLE IX Relationship Between Metrics for Image Quality Assessment and Image Matching
Table IX- Relationship Between Metrics for Image Quality Assessment and Image Matching

Regarding loss function design, from Fig. 9, a positive correlation between L1 loss and MP loss can be found. We performed a linear fit to the L1 loss and MP loss and found Pearson's r of 0.98813, indicating an extremely strong positive correlation. The strong correlation between image generation quality and task execution quality is the prerequisite for the application of this type of method, and the positive correlation between the generated image quality loss and matching task loss is a prerequisite for the convergence of the neural network. In addition, as shown in Fig. 8, this is the key to achieving a good weight ratio between image generation loss and matching loss to ensure the best performance of this method. When the ratio of $l_{1}$ to $l_{\text{MP}}$ is 1:1, the best image generation quality and matching results are obtained. It is worth noting that, in general, the smaller the value of $l_{1}$, the better the quality of image generation, and the smaller the value of $l_{\text{MP}}$, the better the matching effect of the generated image.

Fig. 9. - Relationship between changes in L1 and MP loss during training.
Fig. 9.

Relationship between changes in L1 and MP loss during training.

Summarizing the results of Table IX and Fig. 9, we can conclude that the image generation and image template matching tasks are complementary. The organic combination of the two tasks can simultaneously improve the quality of image generation and the accuracy of image matching.

SECTION VI.

Conclusion

We proposed TMGAN-NCC and TMGAN-MP to convert visible images into infrared images, solving the problem of template matching between visible and infrared images due to modal differences. We combined the template matching and image generation tasks. Experimental results on the VEDAI, KAIST, and AVIID datasets showed that these tasks can promote each other. The method outperforms state-of-the-art methods in terms of image translation quality and template matching accuracy.

We took the template matching task as an example to explore the relationship between image translation and image matching tasks. There was a strong positive correlation between these tasks, showing that the “image translation + downstream task” pattern is successful and provides support for the reliability of downstream tasks in image translation data. Our future work will extend this pattern to more downstream tasks, such as multisource image object detection and data association.

Code and Data Availability

The code and data information in the manuscript are as follows.

TABLE X Code and Data Address
Table X- Code and Data Address

Disclosures

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

ACKNOWLEDGMENT

The authors would like to thank the National Natural Science Foundation of China, the China Postdoctoral Science Foundation, and the Young Talent Fund of the University Association for Science and Technology in Shannxi for their support, as well as all those who have provided valuable comments on this article.

References

References is not available for this document.