Journals & Magazines >Intelligent and Converged Net... >Volume: 5 Issue: 4

Honeycomb Lung Segmentation Network Based on P2T with CNN Two-Branch Parallelism

Abstract:

Aiming at the problem that honeycomb lung lesions are difficult to accurately segment due to diverse morphology and complex distribution, a network with parallel two-bran...Show More

Metadata

Abstract:

Aiming at the problem that honeycomb lung lesions are difficult to accurately segment due to diverse morphology and complex distribution, a network with parallel two-branch structure is proposed. In the encoder, the Pyramid Pooling Transformer (P2T) backbone is used as the Transformer branch to obtain the global features of the lesions, the convolutional branch is used to extract the lesions' local feature information, and the feature fusion module is designed to effectively fuse the features in the dual branches; subsequently, in the decoder, the channel prior convolutional attention is used to enhance the localization ability of the model to the lesion region. To resolve the problem of model accuracy degradation caused by the class imbalance of the dataset, an adaptive weighted hybrid loss function is designed for model training. Finally, extensive experimental results show that the method in this paper performs well on the Honeycomb Lung Dataset, with Intersection over Union (IoU), mean Intersection over Union (mloU), Dice coefficient, and Precision (Pre) of 0.8750,0.9363,0.9298, and 0.9012, respectively, which are better than other methods. In addition, its IoU and Dice coefficient of 0.7941 and 0.8875 on the Covid dataset further prove its excellent performance.

Published in: Intelligent and Converged Networks ( Volume: 5, Issue: 4, December 2024)

Page(s): 336 - 355

Date of Publication: December 2024

Electronic ISSN: 2708-6240

DOI: 10.23919/ICN.2024.0023

Funding Agency:

Contents

SECTION 1

Introduction

Among respiratory diseases, Interstitial Lung Disease (ILD) is an important cause of patient mortality, and its incidence is significantly on the rise in China[1]. Typical imaging features of ILD include fibrillar and lattice-like changes, which eventually evolve into honeycomb-like pathological remodeling during the advanced stages of the disease. Computed Tomography (CT) scanning serves an indispensable function in accurately identifying honeycomb lung patterns and in providing a comprehensive evaluation of the extent and distribution of the associated lesions. By scrutinizing the area, distribution, and other pathological features within honeycomb lung CT images, physicians can efficiently gauge the activity of the lesion and subsequently offer tailored therapy. Hence, precise segmentation of honeycomb lung tissue is of paramount importance for clinical diagnosis. Nevertheless, due to the intricate lesion distribution and diverse anatomical structures present in CT images of honeycomb lungs, attaining highly accurate segmentation remains a formidable challenge.

In recent years, investigations into the application of Artificial Intelligence (AI) in honeycomb lung image analysis have garnered substantial interest. While considerable strides have been made in honeycomb lung recognition and automatic classification tasks[2], enhancing their diagnostic precision and accuracy significantly, the research continues to confront a challenge when addressing the intricate and detailed issue of accurately segmenting the lesion area. To tackle this conundrum, Wei et al.[3] ingeniously proposed the MCSC-UTNet automatic segmentation algorithm, rooted in SepViT, which bolsters the representation capacity of the lesion component by incorporating a global information enhancement module within the model's bottleneck layer, thereby yielding enhanced segmentation outcomes. Nonetheless, it encounters certain constraints when confronted with fine-structured boundary localization. Li et al.[4] conceived MCAFNet, a honeycomb lung segmentation network integrating a multiscale cross-attention mechanism and incorporating a bidirectional attentional gating technique, with the objective of more accurately pinpointing the target segmentation region. Despite MCAFNet achieving a leap in segmentation precision via numerous sophisticated attention mechanisms, its relatively intricate architecture gives rise to a relatively sluggish optimization pace and convergence efficiency during the training phase. Furthermore, Zhao et al.[5] put forth an uncertainty-guided semi-supervised segmentation network framework for honeycomb lung. Although harnessing uncertainty information can efficaciously augment segmentation quality and broaden the extent of dataset exploitation, inaccurate uncertainty estimation or substantial noise interference might result in the production of erroneous pseudo-labels, which, in turn, impairs the learning trajectory of the model, diminishing both learning efficacy and the robustness of the ultimate segmentation performance.

Although the algorithms discussed above have achieved notable progress in honeycomb lung image segmentation, they continue to face several limitations and challenges. (1) Lack of effective integration of global and local contexts: Existing approaches frequently struggle to efficiently incorporate both global contexts and local details. Global contexts, crucial for discerning inter-lesion relationships and overall tissue organization, and local details, vital for precise boundary demarcation, are often inadequately harmonized. This shortcoming impairs the accurate representation of lesions exhibiting diverse sizes and distributions[6]. (2) Inadequate addressing of category imbalance: The inherent imbalance in pixel categories within medical datasets, characterized by a preponderance of healthy tissue pixels compared to lesion pixels, frequently results in biased model learning and compromised segmentation accuracy[7]. (3) Insufficient resilience to noise and image variability: Honeycomb lung images commonly contain substantial noise, which can significantly impair segmentation performance[8]. Current algorithms may not exhibit sufficient robustness against such variations, necessitating the development of more resilient models.

To address the above issues, this paper proposes an innovative network architecture that combines the powerful global information capturing capability of Transformer with the efficiency of Convolutional Neural Networks (CNN) in local feature extraction. The network constructs a dual-encoder system that systematically integrates the global contextual information captured by the Transformer branch with the precise local details refined by the CNN branch through the proposed feature integration strategy. Specifically, instead of using simple feature concatenation, the features extracted from the two branches are deployed with a Transformer Merge CNN (TMC) module at the encoding stage. This module not only facilitates the deep fusion of cross-branch features, but also effectively mitigates the problems of information redundancy and feature conflicts that may result from direct fusion. TMC ensures the effective complementarity of global context and local structure by hierarchically combining feature mappings at different scales, which enhances the ability of the model to comprehend the complex scene. Channel Prior Convolutional Attention (CPCA) is applied to the decoder, a mechanism that dynamically adjusts the importance of features in the channel dimension according to the needs of the prediction task, while refining the prediction output in the spatial dimension, thus greatly improving the accuracy of the segmentation maps and boundary localisation. CPCA suppresses extraneous information by emphasising the feature channels that are critical to the final prediction, enabling the model to focus more on key image details. The contributions of this study are organised as follows:

A parallel two-branch network architecture is devised, synergistically combining convolutional and Transformer features, to optimize honeycomb lung image segmentation performance.
In the encoder component of the model, the Pyramid Pooling Transformer (P2T) backbone serves as the Transformer branch, dedicated to extracting global lesion features. Concurrently, the convolutional branch is assigned the task of extracting local feature information. A feature fusion module is incorporated to efficaciously harness the complementary nature of the dual branches, effectively suppressing irrelevant noise and preserving the integrity of the semantic structure. Moreover, the decoder section is fortified through the employment of the channel prior convolutional attention mechanism, thereby augmenting the model's capacity to precisely localize the lesion region.
An adaptive weighted hybrid loss function based on Focal and Binary Cross Entropy (BCE) loss function is proposed to enhance the contribution of lesion information in the loss value, which effectively solves the problem of model accuracy degradation due to the imbalance of dataset categories.

The subsequent sections are structured as follows: Section 2 presents a detailed discussion of the relevant methodologies. Section 3 elucidates the methodology and implementation specifics of the proposed approach. Section 4 sequentially presents the dataset, the experimental protocol, and the ensuing outcomes. Section 5 delves into a thorough discussion of the findings. The conclusions are presented in Section 6.

SECTION 2

Related Work

2.1 CNNs for Image Segmentation

CNN have found widespread application in medical image processing ^[9]–[12]. In the context of various medical image segmentation tasks, UNet[13] and numerous innovative networks based on it have been introduced. Among these, Xu et al.[14] proposed an end-to-end multi-scale feature extraction and fusion network MEF-UNet. This network incorporates a selective feature extraction encoder, bifurcated into detail extraction and structure extraction stages, aimed at capturing lesion edge details and overall shape characteristics. Additionally, a contextual information storage module is integrated into the skip connection, tasked with assimilating feature map information from adjacent layers. Furthermore, a multi-scale feature fusion module is designed within the decoder to consolidate feature maps across disparate scales. Iqbal et al.[15] put forth a U-shaped pyramidal expansion network, PDF-UNet, tailored for breast tumor image segmentation as an enhanced UNet variant. More classical enhancements to UNet include ResUNet by Alom et al.[16] and Attention-UNet by Oktay et al.[17]. ResUNet effectively introduces residual connection into UNet, enhancing segmentation performance when training deep architectures. Meanwhile, Attention-UNet pioneers the integration of attention gating into a U-shaped architecture, empowering the model to automatically learn to focus on target structures of diverse shapes and sizes. Han et al.[18] employed large convolutional kernels and depth-separable convolutions to enhance UNet's convolutional blocks, followed by the addition of residual connection in both the encoder and decoder. They further devised a lightweight attentional mechanism to filter out low-level semantic noise and suppress irrelevant features. Xia et al.[19] augmented UNet with a shape-enhancing branch to compute discriminative representations. Through this branch, shape boundaries are learned, allowing the model to selectively focus on pertinent boundary features, thereby enhancing the accuracy of tumor segmentation. Inspired by these pioneering works, this paper constructs a fundamental CNN-based encoder-decoder framework, and introduces skip connection in the network. These connections facilitate the direct transfer of feature maps from diverse layers of the encoder to their corresponding layers in the decoder. Consequently, this design not only preserves the intricacy of local details and the integrity of shape information but also enables the decoder to leverage the nuanced, fine-grained features extracted in the initial stages to guide the reconstruction of more abstract, high-level features.

2.2 Transformer for Image Segmentation

While the above CNN-based methods have demonstrated advancements in segmentation accuracy, intricate boundary detection, and computational efficiency, inherent limitations of the convolutional operation[20] render CNN-based networks' attention capabilities inadequate when confronted with extremely complex or minuscule target regions, falling short compared to Transformer and other models based on the self-attention mechanism. These limitations pose challenges in meeting the stringent demands of accurately segmenting honeycomb lungs. Compared with CNN, Transformer performs well in capturing remote feature dependencies[21], [22]. In particular, it demonstrates stronger feature capture and modeling capabilities when dealing with medical images that are highly nonlocal correlated and have drastic scale variations[23]. Wang et al.[24] proposed the first pyramid structure model PVT applying Transformer, demonstrating their viability as backbone components in semantic segmentation networks. Despite this, PVT encounters practical challenges, including computational overhead, rigid positional encoding, and other constraints. To counter these obstacles, Wu et al.[25] devised P2T, which applies pyramid pooling to multi-head self-attention, thereby capturing potent contextual features while concurrently curtailing input sequence lengths. Relative to earlier Vision Transformer (ViT) backbones, P2T demonstrates superior performance in downstream tasks like segmentation. Nevertheless, the hierarchical nature of Transformers imparts limitations on generalization capabilities, leading to an incomplete engagement with neighboring feature information during local feature modeling[26], [27].

2.3 Channel-Spatial Attention Mechanism

In the field of computer vision, research on attention mechanisms has made great progress, with two main research directions: channel attention and spatial attention. Channel attention is responsible for focusing on the channel information in the CNN to ensure that the model can focus more on the channel features carrying key information, while spatial attention focuses on identifying and highlighting the most task-relevant spatial regions in the input image to improve the relevance of the feature representation. SENet[28] is a typical channel attention, which calibrates the channel feature responses of the CNN and enhances the ability of the network to make discriminative decisions. Spatial Transformer Networks (STN)[29] represents spatial attention, which can transform various deformation data in space and automatically capture important regional features, demonstrating the strong potential of the spatial attention mechanism. Woo et al.[30] proposed Convolutional Block Attention Module (CBAM), which combines channel attention and spatial attention in a concatenated manner to capture the complex dependencies between channel and space, further enhancing the capacity of the model to understand complex scenes. In this context, this paper introduces CPCA[31] into the network, which not only integrates the channel and spatial attention mechanisms, but also dynamically adjusts the distribution of the attentional weights, this strategy effectively strengthens the ability of the network to perceive the subtle structure and pathological changes of the honeycomb lung.

SECTION 3

Proposed Network

The network proposed in this paper is based on the unique characteristics of honeycomb lung lesions. Its complexity arises from the morphological variability, inconsistent spatial dispersion, and heterogeneity of the lesions, posing challenges that conventional segmentation methods struggle to address effectively. The modelling approach in this paper focuses on addressing the following key challenges.

Morphological heterogeneity: the atypical, cystic structures exhibit diverse sizes and arrangements, necessitating a model with heightened sensitivity to local variations and the capability to discern intricate architectures. Spatial anisotropy and distribution: the uneven scattering of lesions throughout the pulmonary tissue, from scattered instances to dense conglomerates, calls for a feature extraction methodology capable of integrating both broad contextual understanding and local idiosyncrasies. Structural complexity and interconnectivity: the intricate interconnectivity of lesions and their variable abundance pose challenges in capturing long-term dependencies and maintaining spatial coherence in segmentation.

The theoretical foundation of the approach rests upon recognizing that, while traditional CNNs excel in processing structured data, they inherently fall short in comprehensively grasping global context. Conversely, Transformer models are less efficient in handling the nuanced, locally-focused features characteristic of the heterogeneous lung lesion landscape. This prompted the integration of a two-branch framework that harmoniously combines the strengths of CNN for local feature extraction with Transformer for capturing global dependencies.

3.1 General Framework

Due to the complex morphological changes and irregular distribution of honeycomb lung lesions, it is difficult to obtain accurate segmentation results with ordinary segmentation models. The method proposed in this paper can be well applied to the variation of honeycomb lung characteristics to obtain ideal lesion segmentation results. Figure 1 shows the general framework of the honeycomb lung segmentation network based on P2T and CNN two-branch parallelism. The encoder consists of three parts: the CNN branch, the Transformer branch, and the feature fusion module TMC. The CNN branch is used to extract the local features of the honeycomb lung, which contains four convolutional layers, each consisting of a base convolutional block containing a convolution with a window of $3\times 3$ , a batch normalization operation with a ReLU activation function; the Transformer branch consists of a P2T backbone and focuses on extracting the global features of the honeycomb lung; the TMC module is responsible for organically fusing the features extracted by the two branches in stages to enrich the feature information captured by the model. The semantic features are recovered layer by layer through the decoder, which uses bilinear interpolation to gradually restore the low-resolution feature maps output in the encoder to the original image size, and uses a $1\times 1$ convolution in the last layer to perform pixel-level segmentation prediction, during which the output features of the TMC are spliced with the decoder features of the corresponding layers using skip connection to compensate for the information loss[32] during the decoding process; at the same time, the lightweight and efficient CPCA is used to refine the honeycomb lung prediction map and obtain accurate segmentation results.

Fig. 1

Honeycomb lung segmentation network with parallel two-branch structure.

Show All

3.2 Transformer Branch

Honeycomb lung lesions exhibit profound heterogeneity and nonuniformity stemming from their distinctive pathological attributes. Characterized by aberrant vesicular morphology, uneven spatial dispersion, marked size disparities, fluctuating counts, and intricate interconnections among certain vesicles, these lesions manifest throughout the pulmonary parenchyma, in myriad locations, sizes, and configurations. Their abundance ranges from sparsely scattered to densely packed, significantly exacerbating the challenge of accurate identification and segmentation. While conventional CNN-based segmentation frameworks generally excel in handling lesions of a regular and consistent nature, they struggle to fully apprehend and synthesize the vast array of both global and local features inherent in highly diverse honeycomb lung lesions, thus compromising segmentation precision and model generalization prowess. To solve this problem, the Transformer can be used as a functional branch for extracting global information to increase the network context-sensitive domain and capture the correlations between different regions. Consequently, this study uses the P2T backbone, characterized by its efficacious global feature learning prowess, as the Transformer branch within the model architecture.

The P2T backbone is shown in Fig. 2a, where the original input image is segmented into patches of $\frac{H}{4}\times\frac{W}{4}$ . Learnable positional coding is added and a sequence of patches is generated via the patch embedding module. It is then fed into the P2T network stacked by pyramid pooling Transformer blocks for global dependency capture. In this case, the output feature maps of the four stages are of sizes $\frac{H}{4}\times\frac{W}{4}, \frac{H}{8}\times\frac{W}{8}, \ \frac{H}{16}\times\frac{W}{16}$ , and $\frac{H}{32}\times\frac{W}{32}$ . The pyramid pooling Transformer block is specifically shown in Fig. 2b, where the input features are used to establish associations between patches by pooling multi-head self-attention operation and feature projection using a feed-forward network after layer normalization operation. The complex distribution of lesions is learned efficiently to improve the accuracy of segmentation of honeycomb lung lesions. The above process is shown in Eq. (1):

$\begin{align*}X_{\text{out}}= & \text{LN}(\text{LN}(X+\text{PMHSA}(X))+\\ & \text{FFN}(\text{LN}(X+\text{PMHSA}(X)))\tag{1}\end{align*}$ View Source

where

$X$

and

$X_{\text{out}}$

are the inputs and outputs of the pyramid pooling Transformer block,

$\text{PMHSA} (\cdot)$

is the pooled multi-head self-attention,

$\text{FFN} (\cdot)$

is the feed-forward network, and

$\text{LN} (\cdot)$

is the layer normalization operation, respectively.

In the pooling-based multi-head self-attention mechanism, exemplified in Fig. 2c, an initial average pooling operation is performed on the input feature map $X$ to yield a pyramid-shaped set of pooled feature maps. This procedure diminishes spatial dimensions while conserving crucial global and local statistical characteristics by averaging over distinct regions within the feature map. Equation (2) elucidates the manner in which these pooled feature maps are utilized to generate novel feature representations incorporating relative position coding:

$\begin{equation*}P= \text{DWConv} (\text{AvgPool}_{i}(X))+ \text{AvgPool}_{i}(X)\tag{2}\end{equation*}$ View Source

where

$\text{DWConv} (\cdot)$

denotes the

$3\times 3$

depthwise separable convolution[33], which is employed for local feature extraction and spatial association modeling on the

$i-\text{th}$

layer pooled feature map

$\text{AvgPool}_{i}(X)$

, thereby introducing relative position information among adjacent feature elements. Subsequently, the feature map processed via depthwise separable convolution is element-wise summed with the feature map

$\text{AvgPool}_{i}(X)$

obtained through direct pooling without depthwise separable convolution. This ensures that the resultant feature representation

$P$

integrates both intricate local feature structures and global contextual awareness, with the relative position relationships between feature elements being thoroughly accounted for.

Fig. 2

Transformer branch of the proposed model, (a) the architecture of P2T stem, (b) the pyramid pooling Transformer block, and (c) the pooling-based multi-head self-attention mechanism.

Show All

After that, the feature map $P$ is reshaped into a sequence containing contextual abstract information for multi-head self-attention computation by layer normalization operation. In multi-head self-attention of P2T, the linear transformation weight matrices of Query $(Q)$ , Key $(K)$ , and Value $(V)$ are generated by $W_{q},\ W_{k}$ , and $W_{v}$ , respectively, which are then fed into the multi-head self-attention mechanism as shown in Eq. (3):

$\begin{equation*}\text{Attn} = \text{SoftMax} \left(\frac{Q\times K^{\mathrm{T}}}{\sqrt{d_{k}}}\right)\times V\tag{3}\end{equation*}$ View Source

where

$d_{k}$

represents the number of channels of

$K$

, while

$\sqrt{d_{k}}$

denotes the approximate normalization. The SoftMax function acts upon the dot product of

$Q$

and

$K$

to produce a probability distribution that elucidates the strength of interdependencies, or attention weights, among the positions within the input sequence. This function has a propensity to emphasize those positions demonstrating the strongest correlation with the current query vector

$Q$

. Additionally, the SoftMax function generates a distinct probability distribution for each position in the input sequence, each revealing the interdependency strengths between respective positions. Through this mechanism, locations in the input sequence are differentially weighted according to their varying significance, enabling effective modeling of contextual information.

Via the pooling operation, the lengths of $K$ and $V$ are reduced, resulting in sequence lengths that are shorter than those of the input $X$ . Consequently, the pooled multi-head self-attention mechanism introduced by P2T exhibits reduced computational complexity compared to conventional self-attention. Furthermore, as $K$ and $V$ conserve essential contextual abstract information, this pooling-based multi-head self-attention demonstrates heightened proficiency in modeling global contextual dependencies. This enhanced capability contributes to a deeper understanding of intricate honeycomb lung images by the model.

3.3 Feature Fusion Module (TMC)

Given the size heterogeneity of honeycomb lung lesions, the model needs to use diverse feature representations from both Transformer and CNN branches to enhance segmentation precision. However, simple concatenation of features derived from these two branches may engender feature redundancy[34], which in turn affects the model performance. As depicted in Fig. 3, to circumvent feature redundancy and ensure judicious exploitation of feature information, this work proposes the Transformer Merged CNN module.

Firstly, the multiscale attention computation is performed on the feature maps $T$ and $C$ , respectively, and the output feature maps of one of the depthwise separable convolutions are Sigmoid-activated. Then, the activated feature map is multiplied element-by-element with the output of the other depthwise separable convolution to encode the positional information of neighbouring pixels in the space. Finally, the attention feature maps $T^{\ast}$ and $C^{\ast}$ are output by Concat with convolution operation.

To enhance the feature representation of $T^{\ast}$ and $C^{\ast}$ , they are merged into the convolutional layer for feature weight calculation to obtain the weight $W$ , and then $W$ is nonlinearly transformed by the GeLU activation function to obtain the fused feature weight $\overline{W}$ . The GeLU activation function, grounded in the Gaussian distribution, excels at preserving the extensive dynamic range of the original features and mitigating information loss due to excessive smoothing, outperforming activation functions like ReLU in this regard. GeLU's ability to maintain feature diversity, contributed by both the Transformer and CNN during fusion, is thus ensured. Moreover, GeLU displays near-linear properties around zero values, with its slope diminishing as it moves away from zero. This non-monotonic characteristic aids in suppressing overly dominant or extraneous feature signals, effectively mitigating the influence of redundant information.

Fig. 3

Proposed Transformer merged CNN module.

Show All

In this module, $\overline{W}$ is multiplied with $T^{\ast}$ and $C^{\ast}$ , respectively, and then multiplied element-by-element with $W$ to suppress irrelevant features. The above process is shown in Eqs. (4)–(6):

$\begin{gather*}W= \text{Conv} (\text{Concat} (T^{\ast},C^{\ast}))\tag{4}\\ \overline{W}=\text{GeLU}(W)\tag{5}\\ T^{\ast\prime}=W\cdot(T^{\ast}\cdot\overline{W}),C^{\ast\prime}=W\cdot(C^{\ast}\cdot\overline{W})\tag{6}\end{gather*}$ View Source

$T$ and $C$ are residual-connected to $T^{\ast\prime}$ and $C^{\ast\prime}$ , respectively, and are fed into the MLP for nonlinear feature transformation to obtain the output $\text{TMC}_{\text{out}}$ that fuses global and local features, $\text{TMC}_{\text{out}}$ will be fed into the CNN branch at the next level of hierarchy for feature encoding. In addition, it will also be fed into the decoder of the corresponding level for feature information supplementation through skip connection.

3.4 Channel Prior Convolutional Attention

Attention mechanisms are widely used in medical image segmentation tasks to enable models to focus on the most relevant regions for feature learning[35], but existing attention mechanisms cannot often adapt to changes in image content[36]. Honeycomb lung lesions possess complex morphology and texture, and if only channel or spatial attention is used, it will not only result in the model's lack of attention to features in another dimension but also lead to the degradation of segmentation performance due to the lack of adaptive ability of the attention mechanism[37], [38]. To solve this problem, CBAM integrating channel and spatial attention, guiding the model to emphasize crucial regions in both dimensions. However, the adaptive ability of the model is limited by the consistent spatial attention weight distribution. Specifically, its spatial attention feature maps are computed following channel compression, rendering the weights non-dynamically adjustable based on channel information. In contrast, CPCA exhibits dynamic weight distribution across spatial and channel dimensions and uses multi-scale deep convolutions for efficacious feature extraction. Thus, the proposed method incorporates CPCA into the decoder, enabling dynamic enhancement of honeycomb lung structural information and refinement of lesion segmentation outcomes.

As shown in Fig. 4, CPCA consists of channel attention and spatial attention. In the channel attention part, the information in the feature map $F$ is firstly aggregated using average pooling and maximum pooling operations, and then the feature information is processed by MLP and the output is summed at the element level, and finally activated by the Sigmoid function to obtain the attention feature map $\text{CA}_{\text{out}}$ . In addition, the prior information CP in the spatial attention part is obtained by the element-by-element multiplication of the input features $F$ and $CA_{out}$ , and the above procedure as shown in Eqs. (7) and (8):

$\begin{gather*}\text{CA}_{\text{out}}=\zeta (\text{MLP} (\text{AvgPool} (F))+ \text{MLP}(\text{MaxPool} (F)))\tag{7}\\ \text{CP} =F\cdot \text{CA}_{\text{out}}\tag{8}\end{gather*}$ View Source

where

$\zeta$

is the Sigmoid activation function.

The spatial attention feature maps for this module are implemented instead by extracting the spatial mapping of CP. The key spatial features of each channel are extracted using depthwise separable convolution with a multi-scale structure, thus generating dynamically distributed spatial attention maps on each channel. These dynamically distributed spatial attention maps are very close to the actual feature distribution, effectively improving the network segmentation performance. Then a $1\times 1$ convolution is used to complete the channel feature mixing and generate the feature maps $\text{SA}_{\text{out}}$ . The above process is shown in Eqs. (9)–(11):

$\begin{gather*}\text{SA}_{\text{out}}^{(1)}= \text{DWConv}_{5 \times 5}\left(\sum\limits_{i=0}^{3} \text{DWConv}_{i}(\text{CP})\right)\tag{9}\\ \text{SA}_{\text{out}}^{(2)}=\text{DWConv}_{5\times 5} (\text{CP})\tag{10}\\ \text{SA}_{\text{out}}= \text{Conv}_{1\times 1}(\text{SA}_{\text{out}}^{(1)}+\text{SA}_{\text{out}}^{(2)})\tag{11}\end{gather*}$ View Source

Finally the spatial attention feature map $\text{SA}_{\text{out}}$ is multiplied element-by-element with the channel prior CP to refining the features and enhancing the representation of the feature map.

3.5 Loss Function

The honeycomb lung CT image dataset is marred by a pronounced class imbalance phenomenon, wherein the pixel count associated with honeycomb lung lesions is substantially lower than that of background pixels. This skewed distribution poses a formidable challenge in image analysis. To tackle this issue, an adaptive weighted hybrid loss function is devised for model training, blending the Focal loss function with the BCE loss function. The Focal loss function, tailored for imbalanced classification tasks, downweights the contribution of readily classifiable instances to the overall loss. The mathematical expressions for the Focal loss function and the BCE loss function are presented in Eqs. (12) and (13):

$\begin{gather*}\text{Loss}_{\text{Focal}}=- \sum\limits_{i=0}^{n}[\alpha(1-\hat{y}_{i})^{\gamma} y_{i}\log\hat{y}_{i}+\\ \qquad\qquad\qquad (1 -\alpha)\hat{y}_{i}^{\gamma}(1- y_{i})\log(1-\hat{y}_{i})]\tag{12}\\ \text{Loss}_{\text{BCE}}=-\sum\limits_{i=0}^{n}[y_{i}\log\hat{y}_{i}+(1- y_{i})\log(1-\hat{y}_{i})]\tag{13}\end{gather*}$ View Source

where

$\hat{y}_{i}$

is the predicted segmentation result,

$y_{i}$

is the labeled value,

$n$

is the number of samples, and

$\alpha$

and

$\gamma$

are hyperparameters in the Focal loss function.

Fig. 4

Channel prior convolutional attention module.

Show All

The newly designed adaptive weighted hybrid loss function effectively counteracts the detrimental effect of category imbalance on segmentation accuracy. By flexibly integrating these two complementary loss functions, it attenuates the interference of background regions on the model prediction and thus improves the ability of the model to accurately segment lesions. The mathematical expression for this custom loss function is given in Eq. (14):

$\begin{equation*}\text{Loss} =w_{1}\text{Loss}_{\text{Focal}}+w_{2}\text{Loss}_{\text{BCE}}\tag{14}\end{equation*}$ View Source

where

$w_{1}$

and

$w_{2}$

are the weight coefficients that are adaptively assigned during the training process.

SECTION 4

Experiment

4.1 Dataset Acquisition and Pre-Processing

The Honeycomb Lung Dataset was retrospectively collected and provided by the CT Div of Shanxi Bethune Hospital, containing 7050 CT images of the honeycomb lungs acquired from 163 patients, each of which was jointly annotated by two experienced thoracic radiologists guided by uniform criteria. The CT images underwent a series of preprocessing procedures: sequential desensitization, Gaussian noise reduction, and resizing to a uniform dimension of $256\times 256$ pixels. Subsequently, data augmentation was employed through adjustments to image saturation, brightness, and contrast, thereby mitigating the risk of overfitting without necessitating additional image acquisitions. In the dataset partitioning phase, it was ensured that augmented images were exclusively assigned to the training and validation subsets, while the test subset contained unaltered, independent instances to facilitate a rigorous evaluation of the model's generalization capabilities on unseen data. Ultimately, the dataset was stratified into training (70%), validation (15%), and test (15%) partitions.

In this paper, Kaggle's publicly available COVID-19 CT scans dataset is used to construct a dataset for testing the stability of the proposed method. This original dataset covers high-quality CT scans of 20 patients diagnosed with COVID-19 and their corresponding lesion segmentation labels. These complete CT scans were meticulously sliced during the creation of the Covid dataset, resulting in a resource containing 1350 CT images of the lungs in the COVID-19 infected state. To further ensure the effectiveness and fairness of model training, the pre-processing steps of the previous Honeycomb Lung Dataset were borrowed, and these images were resized to a uniform resolution of $256\times 256$ pixels and the data were augmented by a series of image enhancement techniques, which improved the diversity and robustness of model training and generalization. In the dataset partitioning stage, this paper maximally excludes the interference of intra-patient data correlation within the same patient on the accuracy of model performance assessment by avoiding the crossover of consecutive CT slices from the same patient between the training, validation, and test sets. Finally, the Covid dataset was stratified into training (70%), validation (15%), and test (15%) partitions.

4.2 Evaluation Indicator

To quantitatively analyze the model segmentation results, the commonly used Intersection over Union (IoU), Dice coefficient, Mean Intersection over Union (mIoU), and Precision (Pre) are selected as the evaluation metrics of the model. They are used to assess the similarity between predicted and actual values. IoU is used to assess the model segmentation ability, and takes the value in the range of [0, 1], with the best result being 1 and the worst result being 0. mIoU denotes the average of the IoU values for each category on the dataset. The Dice coefficient is usually used to calculate the similarity between two samples and takes a value in the range of [0, 1], the higher scores the better. Pre is the number of pixels with a positive predictive percentage of pixels, and the closer the value is to 1 in the range of [0, 1], the better the prediction. The equations are expressed as

$\begin{gather*}\text{IoU}=\frac{\text{TP}}{\text{FP}+\text{FN}+\text{TP}}\tag{15}\\ \text{Dice} = \frac{2\text{TP}}{\text{FP}+\text{FN}+2\text{TP}}\tag{16}\\ \text{mIoU}=\frac{1}{k+1} \sum\limits_{i=0}^{k}\frac{\text{TP}}{\text{FP}+\text{FN}+\text{TP}}\tag{17}\\ \text{Pre} = \frac{\text{TP}}{\text{FP}+\text{TP}}\tag{18}\end{gather*}$ View Source

where TP is the number of true positives, FP is the number of false positives, FN is the number of true negatives, and

$k$

is the number of segmentation categories.

4.3 Validation of Model Effect

4.3.1 Experimental Environment

The programming environment in this paper is implemented using Python 3.9 under the PyTorch framework, the operating system is Ubuntu 20, and the models are all trained on an NVIDIA RTX 3090 (24 GB) graphics card. During training, the optimizer is chosen to use Stochastic Gradient Descent (SGD)[39], with the initial learning rate set to 0.001, momentum set to 0.9, and batch size set to 16. The model ends the training after 200 iterations, during which the best-performing model weights are automatically saved based on the training results. To be fair, the loss function in the comparison experiments uses the adaptive weight mixing loss function proposed in Eq. (14).

4.3.2 Module Ablation Experiment

To rigorously validate the effectiveness of each constituent module within the proposed model, ablation studies are conducted on the Honeycomb Lung Dataset. Specifically, the performance contributions of the P2T backbone, TMC module, CPCA module, and custom loss function are individually assessed through this systematic approach. The outcomes of these ablation experiments are summarized in Table 1, where the best-performing results are emphasized in bold.

Table 1 Ablation experimental results of P2T backbone, TMC module, CPCA module with weighted adaptive hybrid loss function.

Specifically, a baseline network architecture is first established for baseline experiments, which integrates a CNN branch encoder, decoder, and skip connection, aims to provide a performance reference standard for subsequent experiments. Next, to verify the usefulness of the CPCA module, the CPCA module is integrated on the baseline network, and the results show that it improves the IoU by 0.82% compared to the baseline, effectively confirming the value of the CPCA module. To further optimize the feature extraction strategy, the P2T backbone is introduced as the Transformer branch in the encoder and feature maps from the dual branches are directly concatenated in the network. This adjustment drives the IoU and mIoU to increase by 1.35% and 1.2%, respectively, over the baseline network, highlighting the contribution of the P2T backbone to improving segmentation accuracy. On this basis, the use of the TMC module instead of the direct concatenation operation achieves another 1.3% gain in Pre compared to the previous experiment, which emphasises the superiority of the TMC module in feature integration. Considering the above improvements together, we construct a composite network integrating the CPCA module, and a two-branch encoder with the P2T backbone, CNN branch, and TMC module, which achieves improvements in all four key evaluation metrics compared to the baseline network, with the IoU, mIoU, and Dice coefficient improved by 3.05%, 2.08%, and 2.25%, respectively, which comprehensively validates the synergistic effect among the modules. Finally, to explore the effect of loss function on model training, adaptive weighting hybrid loss function is adopted for training, and the experimental results show that compared with the previous experimental results, this adjustment brings further improvement in all the four evaluation metrics, which once again proves the comprehensiveness and innovativeness of the proposed method.

Figure 5 illustrates the visual segmentation results of the ablation study, visualising the performance differences between the networks in each configuration. The baseline network performed the worst in segmenting lesions, especially subtle lesion edges, with occasional edge misclassification. The network incorporating the CPCA module still falls short in accurately outlining lesion edges, occasionally misclassifying lesion areas as healthy tissue. The network with the addition of the P2T backbone but without the introduction of the TMC module can roughly define the extent of the lesion, but the improvement in the accuracy of the edge definition is limited, and a certain amount of noise interference can still be seen in the image. The model that uses the TMC module to replace the directly concatenated two-branch features makes some progress in lesion localisation, but the problem of marginal noise still exists and the improvement is not significant enough. In contrast, the composite model combining the CPCA module, and the two-branch encoder consisting of the P2T backbone, the CNN branch, and the TMC module provides a more refined segmentation output, and demonstrates higher performance especially in the identification of tiny lesions. The final network, i.e., the network trained with the adaptive weights hybrid loss function based on the aforementioned composite model, has the highest segmentation results with the best fit to the real labels, and not only performs excellently in lesion edge prediction, but also significantly reduces the noise in the prediction maps, with optimal overall performance. In summary, the gradual optimization of the experimental sequence from initial to final, especially the excellent performance of the final model, strongly verifies the effectiveness of the design of the modules proposed in this paper and the efficiency of the synergies among them, which together contribute to the high-quality and low-noise lesion segmentation results.

Fig. 5

Qualitative results of module ablation experiments on the Honeycomb Lung Dataset.

Show All

4.3.3 Comparative Experiment

To verify the effectiveness of the proposed method in this paper, the current segmentation networks UNet[13], ResUNet[16], ConvUNext[18], TransUNet[40], SwinUNet[41], ScsoNet[42], and PVTFormer[43] are selected as the comparison models to conduct experiments on the Honeycomb Lung Dataset. To ensure the fairness of the experiment, the experimental environment and hyperparameters of each model are uniformly configured.

The segmentation results are shown in Table 2, where the bolded data are the optimal results. In IoU metrics, the proposed method improves by 5.14%, 3.73%,2.48%,0.91%,2.99%,3.98%, and 1.27% over UNet, ResUNet, ConvUNext, TransUNet, SwinUNet, ScsoNet, and PVTFormer, respectively. Similarly, in mIoU, Dice coefficient, and Precision metrics, the proposed method has the highest scores of 0.9363, 0.9268, and 0.9012, which are 4.02%, 4.31%, and 2.96% higher than that of UNet, respectively, which is a significant improvement.

As shown in Fig. 6, the proposed method demonstrates outstanding performance on the Honeycomb Lung Dataset. Compared to other segmentation approaches, it consistently achieves either the highest or second-highest median scores across all evaluation metrics, indicating superior performance on these measures. Concurrently, the method's box length is notably shorter, signifying a smaller range of variation in the corresponding metrics and hence greater stability in its performance. Furthermore, the upper and lower whiskers of the proposed method are also among the shortest or second shortest, evidencing enhanced consistency and stability in the scores for the respective indicators. By examining the relative positioning within the box plots, it can be seen that the proposed method consistently outperforms its counterparts. In conclusion, the analysis presented in Fig. 6 substantiates that the proposed method excels on the test set, manifesting higher median scores, shorter box lengths, and reduced upper and lower whisker lengths in each metric, thus demonstrating clear advantages in performance, stability, and consistency.

Fig. 6

Metrics score performance of different segmentation methods on the Honeycomb Lung Dataset.

Show All

Table 2 Quantitative results of comparative experiments performed with different segmentation methods on the Honeycomb Lung Dataset.

Figure 7 visually presents qualitative results from the comparative experiments, where the green line demarcates Ground Truth, while the red line signifies the predicted output of each respective network. UNet emerges as the poorest performer, manifesting not only in its susceptibility to noise interference in focal predictions, but also in its tendency to overlook genuine lesion areas and erroneously classify normal tissue as diseased. These observations underscore UNet's inadequacies in noise suppression, lesion detection, and edge localization. ResUNet and SwinUNet, while demonstrating improved segmentation performance over UNet, still exhibit several issues. Despite their ability to mitigate noise impact in predictions, the lesion edges in their output remain discontinuous, pointing to their misinterpretation of intricate boundaries and imprecise lesion localization. ScsoNet, although deploying a strategy of deep fusion between spatial and channel features to boost overall segmentation, remains insufficient in capturing and leveraging global contextual information from minute, scattered, and intricate lesions. This limitation engenders a pronounced divergence between ScsoNet's predictions and true labels, particularly in the fine-grained portrayal of lesion edges. ConvUNext and TransUNet, in contrast, display enhanced noise suppression and smoother lesion edges in their predictions. However, they too fall short of achieving a perfect overlay with the labels, suggesting scope for refinement in accurately reproducing lesion morphology. PVTFormer, utilizing a PVT backbone as its encoder and coupling advanced residual up-sampling techniques with a hierarchical decoder, is designed to precisely delineate fine lesion boundaries through meticulous feature processing and staged reconstruction. Despite generally outlining lesions better, PVTFormer's predictions still exhibit label inconsistencies. This may stem from PVT's relatively coarser positional encoding, which struggles to satisfactorily accommodate the task of fine lesion segmentation, particularly when confronted with lesions replete with intricate details and complex textures, leading to potential missed or inaccurate segmentation. Compared to these networks, the novel method proposed herein exhibits superior noise suppression and edge refinement. Although not entirely congruent with labels in isolated regions, the lesions predicted by this method generally align more closely with real lesion labels in terms of shape and extent, with higher prediction accuracy, particularly when dealing with complex honeycomb lung lesions. This underscores the method's segmentation prowess in tackling such intricate pathological patterns.

Fig. 7

Qualitative results of comparative experiments with different segmentation methods on the Honeycomb Lung Dataset.

Show All

4.3.4 Validity Experiment

To deeply validate the effectiveness of the proposed method, the model is re-trained and tested on the Covid dataset, and not only a comprehensive experimental procedure is implemented, but also an exhaustive comparison of the proposed method with other methods is carried out in this paper. Table 3 presents the detailed results of this series of comparison experiments, where the best performance metrics are marked in bold.

Due to the small number of samples and tiny scattered lesions in this dataset, the results of most methods are not ideal, but the proposed method still has a good performance, with the IoU and Dice coefficient of 0.7941 and 0.8875, which are 1.54% and 1.13% higher than that of the better-performing TransUNet, respectively. Meanwhile, the mIoU with Pre of 0.8907 and 0.8734 are also better than other methods, which further proof of its excellent performance. The segmentation results are shown in Fig. 8, and the performance of proposed method of this paper in terms of image contour and detail restoration is more suitable to the Ground Truth compared to other methods, which further proves its effectiveness.

Fig. 8

Qualitative results of validity experiments conducted by different segmentation methods on the Covid dataset.

Show All

Table 3 Quantitative results of validity experiments with different segmentation methods on the Covid dataset.

4.3.5 Analysis of Computational Complexity

The present investigation examines the intricate correlation among computational complexity (quantified in Giga Floating-point Operations Per second (GFLOPs)), the number of parameters, and segmentation efficacy to holistically evaluate the proposed method's effectiveness and operational efficiency. Notably, the parameter is intimately tied to storage demands and the likelihood of overfitting, whereas GFLOPs gauge the computational intensity of the model operations. As evidenced in Table 4, the proposed model features 38.12 million parameters, entailing a computational load of 16.93 GFLOPs, positioning its parameter size intermediate between those of UNet and SwinUNet, and its computational intensity falls neatly between the less demanding ResUNet and the higher TransUNet. This design choice underscores the method's meticulous balancing act between model scalability and computational expedience, all while pursuing high performance.

Of note, the proposed method excels over the majority of comparative models in achieving higher IoU and Dice coefficient in both honeycomb lung and COVID-19 segmentation tasks, despite the fact that the proposed method does not have the least parameters. This observation underscores how a meticulously crafted network architecture can yield exceptional segmentation outcomes with a moderate parameter budget, thereby validating the efficacy of the algorithmic design. In addition, comparing models with similar computational complexity, such as the PVTFormer with larger computation, the proposed method can still achieve similar or even better segmentation results while reducing about half of the GFLOPs.

Ultimately, the method put forth herein strikes equilibrium between the number of model parameters and computational intricacy, successfully enhancing segmentation performence while upholding a pragmatically manageable storage requirements and computationally economical demands.

Table 4 Comparison of parameters, computational complexity and performance of different models on honeycomb lung and Covid-19 segmentation tasks.

SECTION 5

Discussion

This study proposes a honeycomb lung segmentation network adopting a parallel two-branch architecture integrating P2T and CNN. Within this architecture, the P2T backbone serves as the Transformer branch, leveraging its potent global feature extraction capabilities to effectively discern contextual information and spatial correlations within lesions, particularly those displaying diverse morphologies and intricate distributions, to which the model demonstrates strong adaptability. Concurrently, the CNN branch is dedicated to extracting local features of the lesions, meticulously depicting fine structural details and edge characteristics. The design of the TMC module facilitates the efficient fusion of features from both branches, ensuring that the model comprehends the overall lesion features while simultaneously capturing local nuances. The CPCA module strengthens the model's capability to localize lesion regions, with particular emphasis on edges and small lesions. Unlike traditional attention mechanisms, CPCA emphasizes achieving a dynamically tailored distribution of information weights across channel and spatial dimensions, guiding the model to concentrate on lesion-relevant features, thereby reducing background noise interference and enhancing segmentation boundary precision. Lastly, the introduced adaptive weight hybrid loss function strategically attenuates the model's excessive focus on background regions by adaptively adjusting the weighting of positive and negative samples, contributing to a decrease in the probability of missed diagnoses.

Ablation experiments conducted on the Honeycomb Lung Dataset validate the pivotal contributions of the two-branch architecture, feature fusion, channel prior convolutional attention mechanism, and adaptive loss function in addressing the complexities inherent in honeycomb lung segmentation. Comparative evaluations reveal that the proposed method consistently enhances the segmentation accuracy of honeycomb lung pathologies when juxtaposed against established approaches. It surpasses conventional CNN-based models, such as UNet, ResUNet, and ScsoNet, across all assessed performance metrics. While traditional convolutional layers struggle to efficiently capture distant contextual cues, the employed P2T backbone excels at extracting global contextual information, with the subsequent feature fusion module consolidating and optimizing the utilization of both global and local feature representations. This synergy results in a more precise and holistic segmentation of honeycomb lung abnormalities. Furthermore, when benchmarked against techniques incorporating the Transformer mechanism or its underlying principles, such as ConvUNext, TransUnet, SwinUNet, and PVTFormer, the present method exhibits superiority due to its innovative design. Beyond the adept harnessing of global context via the P2T backbone and the optimized information fusion via the feature fusion module, the inclusion of the CPCA module enables the model to adaptively prioritize salient features, further enhancing its discriminatory capacity. The method's exceptional performance on the Covid dataset serves as an additional testament to its generalizability and effectiveness. The quantitative and qualitative analyses of the experimental results show that compared with other existing methods, the proposed method has improved in terms of robustness and accuracy.

Although the proposed method has achieved remarkable results in the experiments, it still has some limitations. In terms of the integrity and smoothness processing of lesion edges, this method is still deficient, which may be because the method is based on 2D images to carry out the analysis, and cannot make full use of the continuity information of lesions in 3D space. Therefore, future work aims to extend the existing segmentation method to a 3D model, which comprehensively analyses data information from multiple views such as cross-sectional, sagittal, and coronal planes, to better capture and preserve lesion edge features during lesion segmentation. Second, the relatively high computational complexity of the model may limit its deployment and application in resource-limited environments. Therefore, future work will be devoted to exploring and integrating efficient computational methods, to improve the operational efficiency of the algorithm while maintaining the segmentation accuracy, to meet the requirements of fast response and low resource consumption in clinical practice. In addition, incorporating more medical knowledge or multimodal data may further enhance the model segmentation performance, which is an aspect worth exploring in depth in the future.

SECTION 6

Conclusion

In this study, an innovative honeycomb lung segmentation network based on the two-branch parallel structure of P2T and CNN is proposed, which successfully fuses the global feature extraction capability of the Transformer and the local feature capture capability of CNN. Through the implementation and introduction of the feature fusion module, the channel prior convolutional attention module, and the adaptive weighted hybrid loss function, the segmentation performance of the model for honeycomb lung lesions with different sizes, complicated textures, and uneven distributions is improved. The experimental results show the advantages of the method in dealing with complex lesion distribution, tiny lesion segmentation, and noise suppression. This is expected to provide a powerful tool for the accurate segmentation of honeycomb lung lesions, assisting physicians to improve the accuracy and efficiency of clinical diagnosis, which is of great value for personalized treatment decisions and disease management.

ACKNOWLEDGMENT

This study was jointly supported by the Central Leading Local Science and Technology Development Fund (Nos. YDZJSX2021C004 and YDZJSX20231C004) and the Natural Science Foundation of Shanxi Province (No. 20210302124554).

References is not available for this document.

MIT Libraries

MIT Libraries

Honeycomb Lung Segmentation Network Based on P2T with CNN Two-Branch Parallelism

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction