Introduction
While unmanned aerial vehicles (UAVs) have been used in many recreational, photography, commercial, and military applications, their flight safety may be threatened by the widespread power lines (PLs) [1]. Hitting a PL may not only destroy the UAVs but also damage power grids and electrical properties as well. Given their very thin structures, however, PLs are prone to be missed by many detection sensors. To enable UAVs to detect and localize PLs during flight, this paper presents a new computer-vision approach aiming to accurately segment PLs from aerial images that are taken by the cameras mounted on UAVs.
PL segmentation from aerial images is very challenging. From a bird’s-eye view, the background of aerial images can be any place, e.g., desert, lakes, mountains, and cities, which shows significant variety and complexity. Moreover, PLs and their surrounding background may share a very similar color in many cases and, therefore, are difficult to distinguish from local image information. Finally, PLs have very thin structures and only cover a very small portion of the image, e.g., one- or few-pixel wide in aerial images. As a result, the PL segmentation is vulnerable to being fragmented, leading to poor segmentation performance.
There have been many deep-learning based algorithms developed for achieving state-of-the-art performance on general-purpose line-segment detection [2], [3], [4], [5], most of which rely on the saliency of lines and joint inference of junctions. Both of these properties, however, do not hold for PLs in most aerial images. The recent AFM model [6] detects line segments by constructing an attraction field map instead of inferring junctions. Nevertheless, it cannot handle complex backgrounds in aerial images, as validated in our later experiments. PL segmentation can be treated as a kind of semantic image segmentation, for which many advanced deep neural networks, such as FCN [7] and DeepLab [8], [9], have been developed with state-of-the-art performance on public image dataset, such as Cityscape and PASCAL VOC. However, without considering the shape and inter-pixel relations, these semantic segmentation networks cannot accurately capture very thin PLs with a similar color to the surrounding background in aerial images.
To find the inter-pixel relations and enforce the global consistency between pixels, in this paper, we propose employing generative adversarial networks (GANs) as a backbone for PL segmentation. The main motivation is to leverage the min-max loss of GANs to help 1) generate a natural (real) image that accurately reflects the relationship between adjacent pixels, and 2) create a high-quality feature embedding for semantic image segmentation. Specifically, this paper presents a new PLGAN (PL Generative Adversarial Networks) to segment PLs from aerial images by employing adversarial learning. In the proposed PLGAN, we first include a multi-task encoder-decoder network to generate an image with highlighted PLs. Then, we extract the last feature representation (i.e., the one right before the output layer) of the decoder network and embed it in a semantic segmentation network to improve PL segmentation. We define comprehensive loss functions, including adversarial, geometry, and cross-entropy ones, for PLGAN training. Furthermore, we include a loss function in the Hough transform parameter space to highlight the long-thin nature of PLs. Extensive experiments, including ablation studies and comparison experiments with prior methods, on the public TTPLA dataset [10] and Massachusetts roads dataset for road segmentation [11], validate the effectiveness of the proposed method.
Our main contributions are summarized as below.
A novel PLGAN network has been developed to segment extremely thin lines, such as power lines (PLs) from aerial images, under complex backgrounds. To the best of our knowledge, this is the first generative adversarial network (GAN) developed for line structure segmentation. The novelty of this approach lies in utilizing PL-highlighted images for the discrimination process and incorporating a semantic decoder, which takes highly representative embedding vectors as inputs to generate semantic images. The purpose of employing adversarial training is to produce realistic PL-highlighted images that further differentiate PL pixels from complex backgrounds. Leveraging the advantages of GANs [12], an embedding vector can be produced through adversarial training to capture the features and structural information of the input image, followed by a semantic decoder to learn and perform semantic segmentation based on this embedding vector.
A new loss function is introduced in the Hough transform parameter space. We use the Hough transform to map each pixel in the segmentation image to a sinusoidal curve in the parameter space. If a pixel on PLs is missing in the segmentation-image space, multiple points on the related sinusoidal curve will be missing in the parametric space. The Hough transform loss is defined to penalize those missing points in the parameter space, instead of one missing pixel in segmentation-image space. By doing so, the penalty for missing pixels in the segmentation-image space will be amplified, which forces the model to correct the flawed pixels. Meanwhile, the intersection of sinusoidal curves at the same points in the parameter space indicates that the associated pixels belong to the same PL. If any curves are missing in the parameter space, it will lead to a reduction in the intensity of the intersection points. Such a reduction implies that some pixels are missing in the segmentation-image space and the network must be penalized to learn how to identify and recover those missing curves. Thus, the proposed Hough transform loss can enhance global consistency for PLs in the segmentation-image space.
Extensive experiments have been conducted to evaluate the performance of PLGAN on the TTPLA dataset and the Massachusetts Roads dataset. PLGAN outperforms the state-of-the-art semantic segmentation models under most evaluation metrics. In particular, compared with models with similar sizes, PLGAN achieves the best scores under all metrics.
Related Work
The related work is discussed in four parts: power lines (PLs), line segment detection, semantic segmentation, and GANs.
A. Power Lines
Most existing PL-related datasets were designed with specific properties to simplify PL detection, such as synthetic PLs [13], manually cropping aerial images to obtain sub-images focused on PLs [14], and capturing images from ground [15], to name a few. Compared with these datasets, the TTPLA dataset we use in this paper is more challenging and practical. It includes aerial images with very complex backgrounds and wide varieties in zoom levels, view angles, time during a day, as well as weather conditions [10].
Most existing work on PL detection adopts traditional computer vision methods [16], [17], [18], [19], [20], which have multiple drawbacks. First, it is often assumed that the PLs are parallel and straight so that context-assisted information can be used to extract PLs [17], [19], while this assumption may not hold in practice. Second, extracting edge maps with traditional approaches requires good contrast between the PLs and the surrounding background, which can only be achieved in ideal cases [21]. In practice, the color of the PLs and the background could be very similar in aerial images. Third, traditional methods usually rely on predefined hyper-parameters to generate meaningful results. However, defining these hyper-parameters is very challenging, especially for those datasets with images taken in a wide range of conditions (e.g., different zoom levels, points of view, background, light, and contrast).
Recently, deep-learning based methods were investigated [13], [14], [22], [23], [24], [25], [26] for PL detection. Yetgin et al. [22] proposed an end-to-end CNN architecture with a randomly initialized softmax layer for jointly fine-tuning the feature extraction and binary classification – PL and non-PL background are classified at the image level. Yetgin et al. further developed a feature classification method for PL segmentation, where features are extracted from the intermediate stages of the CNN. In [27], a CNN-based classifier is developed to identify the input-image patch that contains PL and then uses Hough transform as the post-processing to localize the PLs in each patch. In [28], a deep CNN architecture with fully connected layers is proposed for PL segmentation, where the CNN inputs are histogram-of-gradient features – a sliding window is moved over each patch to get a classification of PL or not. In [29], a UNET architecture is trained to segment PLs based on a generalized focal loss function that uses the Matthews correlation coefficient [30] to address the class imbalance problem. In [23], an attentional convolutional network is proposed for pixel-level PL detection, and it consists of an encoder-decoder information fusion module and an attention module, where the former fuses the semantic information and the location information while the latter focuses on PLs. In [24], dilated conventional networks with different architectures are tried to find the best architecture over a finite space of model parameters. Choi et al. [25] proposed a weakly supervised learning network for pixel-level PL detection using only image-level classification labels. However, besides the simplicity of the datasets as mentioned before, most of these CNN-based works formulate the problem as pixel-wise classification with convolutional neural networks (CNNs) and do not sufficiently consider global consistency in detection, which is essential in detecting very thin structures [26].
B. Line Segment Detection
Significant progress has been achieved in line segment detection using deep neural networks in recent years. Most deep line detection approaches rely on junction information to locate valid line segments: some methods jointly detect the junctions and line segments [2], [3], while others detect only the junctions and then employ sampling methods to deduce the line segments [4], [5]. However, these methods are not applicable to our task since PLs in aerial images may not always be straight and often lack junctions.
C. Semantic Segmentation
Deep neural networks for semantic segmentation [31], [32] rely on pooling layers to reduce the spatial resolution in the deepest FCN layers. Consequently, predictions around segmentation boundaries often suffer from inadequate contextual information [26], [33], [34], [35]. Dilated convolutions have been introduced to capture larger contextual information [8], [31], [36], [37], [38], which, however, still cannot generate global context just from a few neighboring pixels [34]. Encoder-decoder structures have emerged to overcome the drawbacks of atrous convolutions [39], [40], [41]. However, the prediction accuracy is still limited when recovered from the fused features [42]. In addition, the softmax cross-entropy loss limits semantic segmentation performance [26], [33] by ignoring the correlation between pixels. Many of these limitations can be observed in segmenting very thin PLs, and we will include several of the above methods in our comparison experiments.
D. Semantic Segmentation Based GAN
Generative Adversarial Networks (GANs) [43] have been widely used in image translation [44], [45], super-resolution [46], inpainting [47], salient object detection [48], and image editing/manipulation [49]. There are also models that utilize GANs for creating semantic segmentation images. In the early research presented in [50], the authors introduced an approach that utilizes adversarial networks for performing semantic segmentation for the colored input image. In [51], the authors employ GANs and transfer learning for the segmentation of biomedical cell images. Hung in [52] proposed an adversarial learning scheme for semi-supervised semantic segmentation. It is worth mentioning, however, that directly applying GANs for segmentation may not be desirable for two reasons: (
PLGAN Approach
Notations: Let
An illustration of PLGAN framework. PLGAN consists of the PL-aware generator
A. PLGAN Structure ${(}G_{PL}{)}$
Our objective is to develop a deep neural network that predicts the semantic image
The PL-aware generator (
The semantic decoder (
The adversarial discriminator (
B. Objective Formulation
The loss functions for different modules in PLGAN are defined as follows.
1) Adversarial Loss:
The adversarial loss is applied to encourage \begin{align*} &\mathcal {L}_{adv}(G,D;I_{r},I_{p}) \\ &\qquad =\frac {1}{2} \mathbb {E}_{I_{p}} \left [{ (D(I_{p}))^{2} }\right]+\frac {1}{2} \mathbb {E}_{I_{r}} \left [{(1-D\circ G(I_{r}))^{2}}\right] \tag{1}\end{align*}
2) Semantic Loss:
The cross entropy loss between \begin{align*} & \mathcal {L}_{spl}(E_{m},S;I_{r},I_{s}) \\ &= \frac {\sum _{(i,j) \in \mathcal N} \left ({[I_{s}]_{ij} \log ([\hat I_{s}]_{ij}) + (1-[I_{s}]_{ij}) \log (1-[\hat I_{s}]_{ij})}\right)}{-|\mathcal N|} \tag{2}\end{align*}
3) Hough Transform Loss:
The motivation of using Hough transform loss is to force PLGAN to find and correct the flawed pixels along PLs so as to ensure global consistency for each PL. Each pixel in the semantic image is mapped to a sinusoidal curve in the parametric space by the modified Hough transform as shown in Figure 2 \begin{equation*} \mathcal {HT}([I_{s}]_{ij})= p_{ij} (i \cos \theta + j \sin \theta) \tag{3}\end{equation*}
\begin{equation*} \mathcal {L}_{\mathcal {HT}}(E_{m},S;I_{r},I_{s})= \mathbb {E}_{I_{r}} \left [{\| {\mathcal {HT} (I_{s}) - \mathcal {HT} (\hat I_{s})}\|_{1}}\right] \tag{4}\end{equation*}
Illustration for applying our proposed Hough Transform Loss on the parameter space. The pixels located on the same line (red, green, and blue) in the segmentation-image space should intersect at the same point in the parameter space. The set of angles
It is worth noticing that our proposed HT loss function differs from the HT loss function proposed in [58], although both loss functions are applied on HT parameter space. The HT loss function presented in [58] is restricted with two assumptions, including a pre-defined number of lines in the scenes, and a single lane line is predicted in each output channel. The intersecting points are identified in the parameter space for each channel separately, and the loss function is calculated based on these intersection points. Additionally, the HT loss function optimizes only when the predicted probability of a segmented lane is larger than a specified threshold which means that it may not be applied to all pixels in the segmentation space. In contrast to our method, all segmentation PL lines are predicted in a single channel and are mapped into a single HT parameter space without constraints on the number of power lines per scene. Moreover, our proposed HT loss function is applied on the whole sinusoidal curves in the parameter space and optimizes regardless of the model’s confidence level.
4) Geometry Loss:
According to [10], the PLs, on average, take 1.68% of the total pixels in an aerial image. In addition, the color of PLs in aerial images may be close to the background. Both facts indicate that the visual evidence of PLs is very weak. There is a possibility of generating trivial \begin{align*} & \mathcal {L}_{pgeo}(G;I_{r}) \\ &= \mathbb {E}_{I_{r}} \left [{\| G(I_{r}) - \phi ^{-1}\circ G\circ \phi (I_{r})\|_{1}}\right] \\ &\quad + \mathbb {E}_{I_{r}} \left [{\| G\circ \phi (I_{r}) - \phi \circ G(I_{r})\|_{1}}\right] \\ &\mathcal {L}_{sgeo}(E_{m},S;I_{r}) \\ &= \mathbb {E}_{I_{r}} \left [{\| S\circ E_{m}(I_{r}) -\phi ^{-1}\circ S \circ E_{m} \circ \phi (I_{r})\|_{1}}\right]\\ &\quad + \mathbb {E}_{I_{r}} \left [{\| S\circ E_{m} \circ \phi (I_{r}) - \phi \circ S\circ E_{m}(I_{r})\|_{1}}\right].\end{align*}
The overall loss function can be defined as follows:\begin{align*} & \mathcal {L}_{G_{PL}}(G,D,D^{t},S;I_{r},I_{s},I_{p}) \\ &= \mathcal {L}_{adv}(G,D;I_{r},I_{p}) + \mathcal {L}_{adv}(G,D^{t};I_{r}^{t},I_{p}^{t}) \\ &\quad + \lambda _{spl} \left ({\mathcal {L}_{spl}(E_{m},S;I_{r},I_{s}) + \mathcal {L}_{spl}(E_{m},S;I_{r}^{t},I_{s}^{t})}\right) \\ &\quad + \lambda _{\mathcal {HT}} \left ({\mathcal {L}_{\mathcal {HT}}(E_{m},S;I_{r},I_{s}) + \mathcal {L}_{\mathcal {HT}}(E_{m},S;I_{r}^{t},I_{s}^{t}) }\right) \\ &\quad + \lambda _{geo}\left ({\mathcal {L}_{pgeo}(G;I_{r})+\mathcal {L}_{sgeo}(E_{m},S;I_{r})}\right) \tag{5}\end{align*}
Experiments
The experimental results are presented in this section, with comparisons to the state-of-the-art methods.
A. Datasets
TTPLA [10] is a public dataset that contains aerial images for PLs from different zoom levels and view angles, collected at different times and locations with different backgrounds. TTPLA dataset contains 8,083 instances of PLs, which take only 154M pixels, 1.68% of the total number of pixels in this dataset [10]. This dataset contains about 1,100 images. We used 905 training images, augmented by vertical/horizontal flipping, and 217 images for the test set. Each instance of PL is carefully annotated by a polygon using LabelME [63]. TTPLA also provides polygonal annotations of all the transmissions present in each image, and an instance of PL is usually considered to be ended when it enters the annotated polygon of the transmission tower, as shown in the second column of Figure 3. Since there are few public PL datasets available [10], [64], we also considered Massachusetts Roads dataset [11] instead to further evaluate the performance of PLGAN. Although Massachusetts Roads dataset has different context and features compared to TTPLA, it shares some similarities with PL datasets, such as: (
Sample PL segmentation results produced by the proposed PLGAN and comparison methods on TTPLA dataset. The blue and red colors indicate the missing and false predictions, respectively. Appearing both colors for the same line means that this line has slight curvature, which can not detected correctly. Two pixels relaxation are used for all models to make the visualization more clear.
B. Implementation Details
The proposed PLGAN is implemented using PyTorch. The weights of all sub-nets are initialized based on normal distribution using the Xavier method with zero mean and gain 0.02. They are jointly optimized using Adam with the first and the second momentum setting to 0.5 and 0.999, respectively. The entire model is trained for 200 epochs with the image size of
C. Evaluation Metrics
We adopt a total of eight metrics to evaluate the detection performance of our model. Precision, recall, and intersection-over-union (IoU) are the widely used metrics in semantic segmentation [65]. Also, we consider
D. Comparison With Existing Methods on TTPLA Dataset
We compare the performance of PLGAN on TTPLA with a number of existing methods that can be grouped into three different categories. (
Table I shows the quantitative results of the proposed PLGAN and all the above comparison methods on the test set of the TTPLA dataset. Figure 3 shows the segmentation results of sample images from both the proposed PLGAN and the comparison methods.
1) Comparison With Deep Semantic Segmentation Models:
It is shown in Table I that PLGAN outperforms most of the baselines. Compared with PLGAN, we found that those baseline models produce more false positives in PL segmentation. For instance, UNET and FPN (columns 3 and 4 in Figure 3) misclassify many non-PL structures, such as sidewalks and lanes, as part of PLs. This observation can be interpreted from two aspects. First, most of these models are built upon the encoder-decoder structures, while the decoders fail to appropriately augment the complex background information when making pixel-wise predictions from the low-resolution feature maps generated by the encoder [42]. Second, the networks are trained based on the Softmax cross-entropy loss and ignore the interconnections between pixels as discussed in context [26], [33]. Therefore, it is hard to preserve global consistency [53]. Even though Focal-UNET [29] uses Focal loss function instead of BCE loss function for addressing the class imbalance in PL segmentation, it still suffers from the same limitation by not capturing the relation between pixels. We also notice that, although UNET++ and MaNet using ResNet-34 outperform PLGAN in recall and correctness, respectively, it is at the cost of many more parameters than PLGAN.
2) Comparison With GANs:
As shown in column 5 of Figure 3, using pix2pix GAN to directly generate the semantic segmentation images reduces the performance by missing many PL pixels and generating false positives, resulting in many gaps along the segmented PLs. This is also reflected in the quantitative results shown in Table I. As discussed in the Related Work Section, this is the inherent limitation when generating/discriminating the semantic images directly: the discriminator pushes the generator to produce semantic images with sharp zeroes/ones and leaves a permanent possibility for the discriminator to examine the small, but always existing, value gap between the distributions of true labels and the predictions [53], which may hurt the performance of adversarial training. As shown in Table I, instead of directly using GAN to generate semantic images, the proposed PLGAN embeds features from GAN to a semantic segmentation network and can achieve much higher quality in PL segmentation.
3) Comparison With Line Segment Detector:
As shown in Figure 3 (column 6), most of the line detectors can capture many PLs with very clean segmentation. This is totally reasonable since PLs are very-thin line structures, and line detectors fully take advantage of this geometry prior to ensuring the global consistency in PL segmentation. However, in using deep neural networks to boost the capability of line segment detection, most line detectors conduct spatial-region partitioning for network computation and feature representation. This inherently reduces the spatial resolution of features and may cause dislocation between the segmented PLs and their corresponding GTs. As a result, a group of lines can be missing in Figure 3. In addition, the line segment detectors cannot handle the curved power lines, as shown in the image in column 6 and row 4. Therefore, while most line segment detectors produce quite clean PL segmentation in some cases, its quality is still much lower than our PLGAN, as shown in Table I.
4) Comparison Considering Parameter Scale:
It is important to highlight the observation that our model employs only half of the parameters used in the second-best models on the TTPLA dataset. In addition, when comparing our model (14.9M parameters) with models with similar scales (10.6M-18.5M parameters), our model outperforms all these models under every evaluation metric.
E. Comparison on Massachusetts Roads Dataset
Due to the lack of public PL datasets, we evaluate PLGAN on Massachusetts roads dataset for road extraction, which has the same nature as thin objects. We first follow the experiment setting in [65] and evaluate PLGAN using precision, recall, IoU, and
It can be found from both tables that PLGAN outperforms the state-of-the-art methods under most evaluation metrics. Our PLGAN achieves the highest precision, IoU, and
Road extraction by our proposed PLGAN on Massachusetts roads dataset. The blue and red colors indicate the missing and false predictions, respectively. Two pixels relaxation are used for all models to make the visualization more clear.
F. Ablation Study
We conducted an ablation study on the TTPLA dataset to evaluate the performance of different variants of our proposed PLGAN and to demonstrate the usefulness of its various components. For the first two variants, the PL-aware generator directly generated semantic segmentation images instead of PL-highlighted images since the semantic decoder was not included. The results of the first variant, including a PL-aware generator (
Ablation study for different variant of PLGAN. The blue and red colors indicate the missing and false predication, respectively.
Conclusion
This paper introduces a novel GAN framework, PLGAN, specifically designed for power line segmentation in aerial images. PLGAN leverages adversarial training and effectively captures context, geometry, and appearance information for accurate prediction. In PLGAN, the generated PL-highlighted images are utilized by the discriminator, which compels PLGAN to emphasize power line regions within the images. By learning a joint representation in a shared latent space derived from the PL-highlighted image and the semantic image, PLGAN can generate more precise semantic images compared to state-of-the-art methods, as demonstrated through comprehensive experiments. As we aspire to improve our model in future work, it is worth mentioning that there are only a few small datasets on thin objects available in public, which may not be sufficient to fully train the PLGAN. To address this issue, weakly supervised learning, semi-supervised learning, or unsupervised learning techniques are promising. For instance, the work in [77] considers a region-to-region graph to capture spatial dependencies and local context. The method in [78] integrates recurrent layers to effectively capture temporal information and incorporates attention mechanisms that allow the model to focus on relevant regions. We will investigate these methods in our future work to enhance PLGAN’s performance with limited data. Another direction for future research involves extending the applicability of PLGAN to diverse applications, such as video object detection [79], and salient object detection [80], while also exploring its potential to reduce the dependency on manually annotated pixel-level saliency masks through the use of limited pixel-level labeled data [48].