Introduction
Image synthesis has long been a key research area in computer science and image processing, with widespread applications in various real-world scenarios [1], [2], [3]. The objective of image synthesis is to generate new images or visual scenes through algorithms. By leveraging existing image data, these techniques can create visually appealing, novel, or task-specific images. However, due to limitations in input data, the key challenge in generating high-quality images lies in the model’s ability to capture detailed features.
Semantic image synthesis, which uses semantic layouts as input, is an important area of research within image synthesis. This approach allows flexible control over the generated image content by editing or drawing the input semantic map, enabling the creation of photorealistic images. A semantic layout is a labeled map where each pixel is categorized into a specific class, providing a clear blueprint for positioning and classifying objects in the synthesized image. However, since semantic layouts contain only contour information and lack detailed object features, synthesizing realistic images from these layouts presents significant challenges.
Since its introduction, Generative Adversarial Networks (GANs) [4] have revolutionized the field of image synthesis. By leveraging adversarial training between a generator and a discriminator, GANs have demonstrated the ability to produce high-quality, realistic images. While GANs have made substantial progress in semantic image synthesis, several issues persist: (1) the extensive use of adaptive normalization layers to compensate for semantic information increases the number of parameters, leading to complex network structures that are difficult to train; (2) features are directly extracted from the semantic layout via convolutional kernels, leading to potential confusion in internal semantic information and unclear boundaries between classes, causing important information to be lost during synthesis; (3) feature fusion often overlooks the interactions between different scales, and simple fusion techniques fail to emphasize the importance of multi-level features; and (4) upsampling operations may introduce noise, impacting the quality of the final synthesized image.
To address these challenges, this paper proposes a novel GAN architecture based on the laplacian pyramid. The generator network generates multi-scale laplacian images, which are combined through a defined formula to produce the final synthesized image, reducing the number of network parameters and simplifying the backbone structure. To handle the complex scene information present in semantic maps, a multi-scale channel attention mechanism is introduced to capture intricate relationships within semantic information. Inspired by [5], a new feature fusion module is proposed, which integrates both local and global features to guide the synthesis of sub-images more effectively. To mitigate checkerboard noise introduced during the upsampling process, an unconditional guidance mechanism is incorporated, inspired by unconditional diffusion models [6], where empty labels and semantic labels are input into the generator in the same batch, sharing weights across them.
The main contributions of this paper are as follows:
A generator network utilizing a laplacian pyramid architecture is proposed, which effectively reduces network parameters and simplifies overall complexity.
A multi-scale channel attention mechanism (MSCA) is introduced to improve the model’s ability to capture complex scene information from semantic maps.
A novel feature fusion block (FFBL) is proposed to enable interaction across different scales and integrate local and global features more effectively.
To mitigate checkerboard noise during upsampling, a combined conditional and unconditional training approach is adopted.
The proposed method was evaluated on three challenging datasets. Experimental results show that the approach outperforms several state-of-the-art methods in both evaluation metrics and visual quality.
This paper is organized as follows. Section II introduces related work, Section III introduces the proposed model and loss function, Section IV introduces the experimental details and ablation experiments, and Section V is the conclusion.
Related Work
A. Semantic Image Synthesis
Image synthesis has been a central area of research, and the introduction of Generative Adversarial Networks (GANs) marked a significant breakthrough in this field, particularly in the application of deep learning. Since then, GANs and their numerous variants have achieved remarkable success in image synthesis [7], [8], [9], [10], [11], [12]. Although some recent non-GAN-based approaches have emerged [13], [14], [15], [16], [17], [18], GANs remain the dominant method for various image synthesis tasks. In the specific case of semantic image synthesis, GANs have excelled, with the key challenge being to transform sparse semantic information into dense, detailed representations, making optimal use of the available semantic inputs. Numerous GAN-based methods have been developed for semantic image synthesis, broadly categorized into direct convolution-based methods and normalization-constrained generation methods. Both approaches typically synthesize the image as a whole.
Pix2PixHD [19] is a representative example of direct convolution-based methods. It uses a multi-scale generator with a U-Net-like architecture and a multi-scale discriminator. In addition to semantic layouts, Pix2PixHD incorporates instance-level boundary maps to separate different instances, ensuring sharper boundaries. EdgeGAN [20] uses edge information as an intermediate representation and introduces an attention-guided edge transfer module to enhance image generation. SIMS [21] introduces a repository of image patches from training images, allowing the retrieval of photographic references for synthesis. CC-FPSE [22] generates intermediate feature maps from noise using convolution kernels, which are then transformed into the final image. DPGAN [23] proposes a dual-pyramid generator architecture that adaptively adjusts to the size of input objects, preventing artifacts between objects of different sizes. ECGAN [24] incorporates edge information as a prior, guiding the generator to synthesize more detailed images. By leveraging edge information, the model better preserves the geometric structure of the input semantic map, enhancing contour and detail quality in the generated images.
In contrast, normalization-constrained generation methods rely on normalization techniques to integrate semantic layouts into the synthesis process. SPADE [25] is a classic example, introducing spatially-adaptive normalization, where semantic layouts modulate activations within normalization layers. SEAN [26] builds on this by proposing semantic region-adaptive normalization, allowing independent control over the style of each semantic region. RESAIL [27] takes a retrieval-based approach to spatially adaptive normalization, guiding feature normalization with retrieved patches, providing fine-grained pixel-level guidance. OASIS [28] further advances this by redesigning the discriminator as a semantic segmentation network, using a combination of 3D noise tensors and semantic layouts as inputs to the generator.
Direct convolution methods [19], [20], [21], [22], [23], [24] often fail to capture fine-grained details, maintain global consistency, and utilize multi-scale information effectively, resulting in subpar image quality. Normalization-based methods [25], [26], [27], [28], meanwhile, can suppress essential features, reduce diversity, and inadequately leverage semantic maps, leading to feature loss and limited adaptability in complex scenes.
LMCGAN overcomes these limitations by introducing a laplacian pyramid generator for progressive detail refinement and consistent global structure, a multi-scale channel attention mechanism (MSCA) to dynamically prioritize critical features and enhance diversity, and a feature fusion block (FFBL) to integrate cross-scale information and fully utilize semantic maps. Together, these innovations enable LMCGAN to produce highly detailed, semantically consistent, and visually realistic images, addressing the shortcomings of both approaches.
B. Laplacian Image Pyramid
An image can essentially be regarded as a combination of signals at various frequencies. The laplacian pyramid [29] is designed to decompose the source image into different spatial frequency bands. By performing the fusion process at each frequency layer separately, it allows for the application of specialized fusion operators tailored to the features and details of each frequency band. This approach enhances specific characteristics within targeted frequency bands, enabling the seamless integration of features and details from multiple images.
C. Channel Attention
The channel attention mechanism boosts the performance of visual models by assigning varying levels of importance to each channel in a convolutional neural network. A prime example is SENet [30], which uses global average pooling to compress feature maps and fully connected layers to compute weights for each channel, enabling adaptive adjustment of their importance. Following SENet, many variants have been developed [31], [32], [33], further exploring the dependencies between channels and between channels and spatial dimensions, enhancing the model’s capacity to extract and utilize feature information.
Method
This section details the proposed method. First, the overall network architecture is presented, followed by the loss functions employed. Lastly, the proposed unconditional guidance structure is introduced.
A. Generator and Discriminator Architecture
We propose a generator architecture that leverages the laplacian pyramid structure, breaking down the image synthesis task into multiple sub-image synthesis tasks. This reduces the network’s learning complexity while each layer of the laplacian pyramid captures image details at different scales, enabling the generator to focus more precisely on various levels of detail. The architecture is shown in Figure 1.
Initially, the semantic map is one-hot encoded into a class matrix and fed into the encoder. Due to the complexity of scene information in semantic maps, standard convolutional neural networks struggle to retain critical features. To address this, we introduce a novel multi-scale channel attention mechanism (MSCA) that aggregates semantic maps across different scales, facilitating better interpretation of semantic relationships in complex scenes. Through the encoder phase, the semantic map is transformed into a feature map rich in semantic content.
The decoder has four branches, each producing feature maps at three different scales via three layers of transposed convolutions. The feature map from the final layer retains its original resolution, resulting in four feature maps, each at a different scale. However, these feature maps only contain scale-specific semantic information and lack contextual understanding. To address this, we introduce a feature fusion block (FFBL), which progressively fuses features from lower to higher scales. Each layer incorporates global information from the previous layers, effectively guiding the synthesis of sub-images at their respective scales. In the end, four sub-images at different scales are generated. After generating each sub-image, the final synthesized image is reconstructed using the laplacian pyramid method. The reconstruction formula is as follows:
The mathematical expression for the one-dimensional logical mapping used is:\begin{equation*} I_{final} = I_{0} + \sum _{i=1}^{n} upsample(I_{i}) \tag {1}\end{equation*}
The discriminator architecture, as shown in Figure 2, processes sub-images at different scales through corresponding discriminators. The real image is first decomposed using the laplacian pyramid, resulting in a series of sub-images at different scales:
Each discriminator, composed of convolutional layers, processes the real and generated sub-images at its respective scale. The final output provides a discrimination result, determining whether the input sub-images are real or synthesized, ensuring both global structures and fine-grained details are evaluated across all scales.
B. Multi-Scale Channel Attention
In semantic image generation tasks, effectively leveraging semantic map information is key to producing realistic images. Each channel in a semantic map represents a different category, but not all categories carry equal importance. Redundant information can exist, which may interfere with the dependencies between classes, leading to confusion in semantic features and ultimately degrading image quality. To tackle this issue, we propose a novel multi-scale channel attention (MSCA) mechanism. By capturing both detailed and global features through multi-scale convolutional kernels, MSCA enhances the network’s ability to understand complex interactions between categories.
As illustrated in Figure 3(a), the MSCA structure consists of multi-scale convolutional layers, a global average pooling layer, and a MLP. The multi-scale convolutional layers use dilated convolutions [34] to extract features at various scales from the input feature map. Dilated convolutions expand the receptive field by inserting spaces (zeros) between the standard convolution kernel elements, enabling the network to gather broader contextual information without increasing the parameter count or computational cost. In our method, we employ convolution kernels of size 3 with dilation rates of 1, 2, and 4, corresponding to effective kernel sizes of
The pooled features are then processed by the MLP, which performs channel-wise upscaling and refines multi-scale local information. Finally, these features are weighted through an activation function, summed, and merged into the final activated feature map.\begin{align*} z & = D_{1}(x) + D_{2}(x) + D_{3}(x) \\ \theta _{i} & = S_{i}(F_{ci}(Avg(z))) \\ x' & = \theta _{1} * D_{1}(x) + \theta _{2} * D_{2}(x) + \theta _{3} * D_{3}(x) \tag {2}\end{align*}
C. Feature Fusion Block
In the laplacian pyramid, each sub-image is relatively independent, with each layer representing the details at its corresponding scale. However, there is a lack of direct interaction between the coarser features of higher layers and the finer details of lower layers, leading to suboptimal utilization of information. To address this, inspired by [5], we propose a novel feature fusion block(FFBL). FFBL integrates global features (using global average pooling and global max pooling) and local features (via convolution), allowing it to effectively focus on the importance of feature maps across different scales. The final output is obtained by weighting and combining these feature maps through an activation function.
The fusion process begins from the bottom layer feature map, progressively merging it with the higher-level feature maps. This hierarchical fusion ensures that each sub-image incorporates information from the previous scale, facilitating the capture of more details during image generation and ultimately enhancing the quality of the generated image. The architecture of the module is depicted in Figure 3(b), where x and y denote the input feature maps, and z represents the fused output.\begin{align*} F_{L} & = Avgp(x+y) + Maxp(x+y) \\ F_{G} & = DWConv(x+y) \\ \beta & = S(F_{L}(x+y) \bullet F_{G}(x+y)) \tag {3}\end{align*}
\begin{equation*} z = \beta * x + (1-\beta ) * y \tag {4}\end{equation*}
D. Loss Function
Since the band-pass components and low-frequency residual image convey distinct types of information, multiple loss functions are integrated within the generator. For the band-pass components, perceptual loss is employed, utilizing a pre-trained VGG19 network [35] as the perceptual feature extractor. Meanwhile, for the low-frequency residual image, pixel loss is applied to ensure accurate reconstruction of low-frequency details. The perceptual loss can be succinctly defined as follows:\begin{equation*} L_{G1} = \sum _{h=1}^{N-1}(Gram(VGG(f_{h})) - Gram(VGG(r_{h}))) \tag {5}\end{equation*}
\begin{equation*} L_{G2} = | f_{N} - r_{N}|^{2} \tag {6}\end{equation*}
\begin{align*} L_{G3} & = (Gram(VGG(f_{0})) - Gram(VGG(r_{0}))) + | f_{0} - r_{0}|^{2} \\ & \quad \qquad \qquad \qquad \qquad + E_{s \sim P_{data}(Z)} [log(D(G(z)))] \tag {7}\end{align*}
\begin{equation*} L_{G} = L_{G1} + L_{G2} + L_{G3} \tag {8}\end{equation*}
The total loss of the discriminator is:\begin{align*} L_{D} & = \sum _{h=1}^{N}[E_{s \sim P_{data}} [log(1-D(G(f_{h})))] \\ & \qquad \qquad \qquad \qquad + E_{s \sim P_{data}} [log(D(r_{h}))]] \tag {9}\end{align*}
E. Unconditional Guidance
Since our decoder employs deconvolution (transposed convolution) to upsample and increase resolution, it can sometimes lead to checkerboard artifacts. To address this issue, and inspired by the recent rise of diffusion models, which like GANs are generative models, we incorporate ideas from the unconditional diffusion model [6]. Specifically, we introduce unconditional guidance to mitigate noise introduced during training and disentangle the generated image from the semantic label guidance.
The goal of unconditional guidance is to eliminate the noise artifacts that arise during training and ensure that the generation process is less dependent on semantic labels. To achieve this, we introduce an empty label alongside the semantic labels in the same batch, allowing both to pass through the generator. The key idea is that both types of inputs—semantic and empty labels—share the same generator weights, with each processed as follows:\begin{equation*} f = (G(x) - G(\phi )) \cdot \delta + G(\phi ) \tag {10}\end{equation*}
Experiments
A. Dataset
The experiments in this paper were conducted on three challenging datasets: Cityscapes [37], ADE20K [38], and CelebA-HQ [39]. The Cityscapes dataset consists of street scene images from German cities, with 3,000 images used for training and 500 images used for validation. The ADE20K dataset contains 20,210 images for training and 2,000 images for validation, featuring challenging scenes with 150 semantic categories. The CelebA-HQ dataset is a large-scale facial image dataset comprising 30,000 high-resolution face images, with 19 semantic categories. Except for the Cityscapes dataset, the images in the other datasets were resized to
This paper employs the Laplacian pyramid to pre-decompose the dataset images into multiple sub-images at different frequency levels. These sub-images are then used as input labels for each branch of the generator and discriminator. The results of the decomposition are illustrated in Figure 5.
B. Experimental Details and Evaluation Metrics
Given the adaptive learning rate capability of the ADAM optimizer [40], which adjusts the learning rate based on the gradient variations of each parameter, this paper adopts the ADAM optimizer to optimize the network parameters. The batch size is set to 4, with the generator’s learning rate set at 0.0001 and the discriminator’s at 0.0004. During training, the learning rates are gradually decayed linearly. All experiments were conducted using the PyTorch framework on an Nvidia GeForce RTX 3090 GPU.
The Frechet Inception Distance (FID) measures the similarity between two sets of images by calculating the distance between the feature vectors of generated images and real images. A lower FID indicates better image quality. Mean Intersection-over-Union (mIoU) is a key metric for assessing image segmentation accuracy. In this paper, state-of-the-art segmentation networks are used for each dataset: DRN-D-105 [41] for Cityscapes, Upper-Net101 [42] for ADE20K, and BiSeNetV1 [43] for CelebA-HQ. A higher mIoU represents better segmentation performance.
C. Experimental Results
To assess the effectiveness of the proposed method, we compared it against several state-of-the-art GAN-based semantic image synthesis models that have shown excellent performance. The baseline models include SIMS [21], Pix2pixHD [19], CC-FPSE [22], SPADE [25], OASIS [28], DPGAN [23], and ECGAN [24].
Table 1 compares the baseline models with LMCGAN, showing that our method outperforms others in both mIoU and FID across three datasets.For the CelebA-HQ dataset, our method achieves an mIoU of 74.2, which is 0.9 higher than the state-of-the-art, with an average improvement of 4.26. The FID is 28.6, outperforming ECGAN with an average improvement of 0.98. For the ADE20K dataset, our method achieves an mIoU of 53.8, improving by 1.1 over the best-performing method, with an average improvement of 11. The FID is 24.3, which is 0.4 better than the top-performing method, with an average improvement of 13.45. For the Cityscapes dataset, our method achieves an mIoU of 74.1, 0.8 higher than the state-of-the-art, with an average improvement of 9.8. The FID is 42.1, comparable to ECGAN, with an average improvement of 15.7.
These results demonstrate that our proposed method outperforms benchmark models, effectively capturing information closer to real images and generating more realistic outputs. Additionally, in terms of network parameters, except for SIMS, our method uses significantly fewer parameters than other models while still delivering higher image quality. This highlights the efficiency of our architecture in reducing both parameter count and network complexity without compromising the quality of the generated images.
Figures 6, 7, and 8 illustrate the visual comparisons between our method and other approaches on the ADE20K, Cityscapes, and CelebA-HQ datasets. For the ADE20K dataset, our method demonstrates superior semantic consistency and detail preservation. DPGAN and ECGAN often result in blurred object edges and unnatural transitions between objects and backgrounds. Our method excels at preserving object boundaries and rendering clearer details, especially in complex scenes like buildings and objects. In the Cityscapes dataset, DPGAN and ECGAN struggle with boundary clarity, particularly in cars, roads, and pedestrians. Our method improves the relationship between objects, producing sharper details and more natural textures. For CelebA-HQ, our method outperforms others in generating facial details. OASIS and DPGAN generate blurred facial features, especially in areas like eyes, mouths, and skin textures, with unrealistic lighting. Our method delivers more refined skin textures, realistic lighting, and overall more lifelike facial features.
Visual comparison of our LMCGAN method with other methods on the cityscapes dataset.
Visual comparison of our LMCGAN method with other methods on the CelebA-HQ dataset.
D. Impact of Each Component on LMCGAN
To evaluate the effectiveness of the proposed MSCA and FFBL components, experiments were conducted using various configurations, with results summarized in Table 2. LGAN refers to the baseline without either component, LGAN+ uses only FFBL, LMCGAN- applies only MSCA, and LMCGAN employs both MSCA and FFBL.
From Table 2, we see that using only the laplacian pyramid-based GAN (LGAN) results in lower mIoU scores by (3.2, 8.1, 7) and higher FID scores by (5.4, 8.8, 6) compared to LMCGAN. When adding FFBL to the decoder, mIoU differences decrease to (1.8, 5.9, 3.2) and FID to (3.7, 5.8, 3.2). Adding MSCA to the encoder further reduces mIoU differences to (1.1, 3, 1.9) and FID to (2.5, 2.1, 1.4). Without MSCA and FFBL, the generator struggles to properly utilize semantic information, leading to entangled class boundaries and poor image quality.
Using only FFBL improves global and local focus but still fails to utilize semantic maps effectively. MSCA helps retain important channel information and reduces the impact of redundant information, but ignoring inter-layer information in the decoder results in less optimal outcomes.
The visual comparison in Figure 9 highlights that LGAN produces the least detailed images across all three datasets, with poor representation of key features such as vehicles, faces, and curtains. LGAN+ and LMCGAN- improve on detail but still fall short in finer aspects. In contrast, LMCGAN generates the most natural and realistic images, excelling in detail quality, such as road markings, eye colors, and curtain textures. Overall, the combination of MSCA and FFBL greatly enhances image quality, achieving the best results when used together.
E. Ablation Study
1) Number of Layers
In order to assess the performance and determine the optimal approach of the model, three experiments were conducted on the Cityscapes dataset in this section. As shown in Table 3, it can be observed that the proposed method achieved the best results when the number of layers in the pyramid was 4. This indicates that the number of pyramid layers is not necessarily better with more or fewer layers. Instead, there is an optimal level of pyramid layers that benefits the proposed network in this paper.
The visual results are shown in Figure 10, where it can be seen that the headlights of the preceding vehicle and the zebra crossing appear more realistic.
2) Combination of Loss Functions
In this section, comparative experiments were conducted on the Cityscapes dataset to select the loss function for the four-layer branch. The following combinations of loss functions were evaluated: all layers using pixel loss, all layers using perceptual loss, pixel loss for the first three layers combined with perceptual loss for the last layer, and perceptual loss for the first three layers combined with pixel loss for the last layer.
As shown in Table 4, the best results were obtained when using perceptual loss for the entire texture map and pixel loss for the low-frequency residual map. This indicates that perceptual loss is more suitable for fitting the texture images, while pixel loss is more effective for fitting the low-resolution residual images in the lower layers.
3) Unconditional Guidance Experiment
In this section, two sets of experiments were conducted on three different datasets to assess the advantages of unconditional guidance. The experimental results are presented in Table 5.
The visual effect is shown in Figure 11. In the cityscapes dataset, the color of the sky appears more natural, in the ADE20k dataset, the details of the furniture are clearer, and in the CelebA-HQ dataset, the color of the face appears more realistic. Therefore, the advantages of unlabeled guidance can be reflected from the experimental indicators and visual effects.
Experiments were conducted on the hyperparameter
Conclusion
While Generative Adversarial Networks (GANs) have achieved remarkable progress in image synthesis, generating high-quality images from semantic maps remains challenging. Existing methods can produce realistic images to some extent but often struggle with detail preservation, class separation, and global consistency. To address these issues, this paper proposes a novel GAN architecture, LMCGAN, which integrates a laplacian pyramid structure with a multi-scale channel attention mechanism to improve the transformation of semantic maps into realistic images.
The key innovation of LMCGAN is its use of the laplacian pyramid to progressively generate sub-images at multiple scales, refining semantic map details step by step. Specifically, the network generates different scale sub-images through multiple branches, with each branch focusing on different levels of detail within the semantic map. This laplacian pyramid structure facilitates a coarse-to-fine transformation, where the semantic map evolves into a high-quality image. The divide and conquer approach enhances object separation, improves class representations, and optimizes detail processing, while reducing the burden on the network. Additionally, we introduce a multi-scale channel attention (MSCA) mechanism that extracts rich semantic features from different scales, enabling the network to make better use of the input semantic map. This mechanism captures both fine details and global context, improving the overall quality of the synthesized images.To further enhance sub-image generation, we propose a feature fusion block (FFBL), which effectively combines features from various levels of sub-images. This module allows each sub-image to incorporate global context, preserving key image details. By providing more accurate guidance at each level, the FFBL ensures that the final image is both visually realistic and detailed.
We evaluated LMCGAN on three public datasets. The results show that LMCGAN outperforms multiple baselines in terms of mIoU and surpasses state-of-the-art models in FID. Our method produces images that are highly similar to real-world scenes, closely matching the input semantic maps. Overall, LMCGAN not only delivers significant improvements in visual quality but also excels in optimizing details and maintaining global consistency, demonstrating its effectiveness for semantic image synthesis tasks.