Introduction
In recent decades, automated interpretation of 3-D scenes with LiDAR has been a popular research topic in fields of photogrammetry [1], remote sensing [2], computer vision [3], architecture [4], civil engineering [5], cadastral investigation [6], and robotics [7]. As a flexible and portable measurement technique, LiDAR can obtain point clouds via different acquisition platforms and systems, such as airborne laser scanning (ALS) using aerial platform, like airplanes or UAVs, terrestrial laser scanning (TLS) using fixed tripod platform, and mobile laser scanning (MLS) using moveable platforms, like vehicles or boats. ALS measures point clouds with a far observation distance and a relatively low density, which is usually used for mapping and monitoring a large area since the flying aerial platform can easily cover a large investigation area. While, for accurate analysis and interpretation of 3-D urban scenes, TLS and MLS, providing higher scanning density with close observation distance and more precise points with static carrier platform or stations, are more competent.
However, only having applicable platforms and systems is far from enough to accomplish the analysis of 3-D scenes. To fully understand and extract 3-D information from the scene, unique labels of given categories should be used to mark the acquired 3-D points in the scene, which reveals the semantic meanings of the objects represented by the points. Therefore, assigning semantic tags to points belonging to objects of the different categories is crucial in the overall 3-D scenario analysis workflow [8]. Supervised classification is one of the solutions to accomplish this labeling task. Although a lot of related research has been presented, this work is still a challenge. For the majority of supervised classification methods, a well-trained classifier is required, with a large amount of training dataset needed [9]. Nevertheless, the conventional way of generating training data of point clouds still depends on manual work, which is a more challenging endeavor than manually labeling the 2-D image due to the complex 3-D structures. Fortunately, 3-D point clouds encode more precise topological and geometric information of the real 3-D scene than those of 2-D images, if we can provide additional constraints or assumptions based on these topological relations between structures and geometric characters of points, we could achieve a supervised classification without using a large percentage of the training dataset. Especially, point clouds of urban scenes encapsulate some form of regularity with specific structures, which can be exploited to improve the accuracy of semantic labels [10].
To this end, we present a supervised semantic labeling method designed for classifying 3-D points, with supervoxel-based detrended features calculated in the local neighborhood and graph-based optimization. The effort spent on the reduction of the required training dataset considers two aspects. On one hand, we conduct a preclustering of points with geometric homogeneity, which provides constraint at the local level. Points of the cluster are forced to share the same label. In our work, this preclustering is achieved by supervoxels with refined boundaries. For the estimated geometric features, we proposed a novel detrending algorithm removing redundant and insignificant information in the local environment of the points. On the other hand, a regularization of labeled points is also applied to improve the accuracy of initial labels estimated by the classifier. In our work, this is achieved by the global graph optimization of labels. Here, we have an assumption that the perception-weighted graphical model has a natural relation to the representation of the 3-D scene, which reflects both the geometric and topological information. Therefore, by constructing and optimizing the graphical model, wrong labels can be corrected.
The following are innovative contributions in our proposed approach. 1) A novel detrended geometric feature, removing the local tendency of the geometric characteristic of the local neighborhood, is proposed. Our novel feature extraction method is effective for delineating the local geometry in the 3-D scene. Different from our previous work [11], apart from the geometric and height features, in this work, we also took surface and contextual features into consideration, which are concatenated in various ways. Moreover, in this work, the intensity values of the points are not used. Instead, it is a purely geometric solution. 2) Without using points as fundamental elements, the supervoxel-based context is designed in the local vicinity to encode geometric attributes of points, achieving a preclustering of homogenous points. Conventional segment-based classification methods heavily rely on the quality of obtained segments, so that we compromise this issue by using the refined supervoxels. In this step, a boundary refined supervoxelization algorithm is developed, which is an improvement of the classic voxel cloud connectivity segmentation (VCCS) algorithm by refining the boundaries between supervoxels. 3) A perception-weighted graphical model is constructed and optimized to improve the results of the initial classification result, which can reassign those wrongly labeled fragments with correct semantic labels. The remainder of this article is organized as follows. A brief literature view of point clouds classification is given in Section II. The methodology of the proposed point cloud classification method is provided in Section III. Subsequently, the tested point cloud datasets, experimental results, and related discussions are given in Sections IV and V. Finally, Section VI concludes the article.
Related Work
Semantic labeling of 3-D point clouds, aiming at tagging a unique semantic label to an individual point, is a fundamental task for urban mapping and remote sensing. Currently, a well-designed point cloud classification method could involve three core steps: Extraction of features, classification using extracted features, and smoothing of labels. Based on the various derived features, initial labels can be assigned to points by applying classifiers. Then, benefiting from label-smoothing techniques, the initial label of every point could be further refined according to external information or constraints, with those wrong labels corrected.
A. Extraction of Features
The extraction of features is for abstracting local geometric information of the given point and encapsulating the information into feature vectors [12]. Generally, there are two key factors influencing the extraction of features: 1) selection of neighborhoods for the investigated element (e.g., point or segment), and 2) parameterization of geometrical characteristics with appropriate descriptors for generating discriminative feature vectors.
It is essential to select appropriate neighborhood describing details in the near of a given point [8]. For various purposes, it is necessary to rely on different objective details of all points within the selected neighborhood. Neighborhood definitions in common use are divided into different categories, namely, the single-scale neighborhood and the multiscale neighborhood. The formal one extracts features using a constant neighborhood, while the later one utilizes multiple neighborhoods with flexible sizes and forms. A common way to generate the local neighborhood, which is defined by a specific number of neighbors (e.g., k-nearest neighbors (KNN) or neighbors with a given distance) [13] or 2-D projective distance [14]. Besides, spherical [15] and cylindrical [16] neighborhoods are also representatives of the single-scale neighborhood. For multiscale neighborhoods, different features are separately extracted from various neighborhoods with different forms and sizes, and then combine them to encode the output feature vectors. In [8], a neighborhood selection approach based on multiple individually optimized neighborhoods is proposed. Similarly, in [17], multiscale neighborhoods for selecting features are introduced to enhance the performance of 3-D point clouds classification. Although different classifiers will definitely influence the performances of the entire classification workflow, the importance of features also matters a lot, especially when the classifier is identified and has insufficient training samples. In [18], for detecting vehicles from the scene, the authors implement a multilayer model for feature generation consisting of partitioned octree structures of several levels. Moreover, in [19], a Latent Dirichlet allocation is adopted to generate features, which are derived from point-based hierarchical clusters, in order to classify objects having varying sizes.
For parameterizing geometrical characteristics with appropriate descriptors, using 3-D shape descriptors to encode local or global geometric information around the point of interests is a commonly adopted solution. Based on the spatial distribution of 3-D points around the point of interest, both local and global features of this point can be quantized into a vector having a fixed number of bins by the use of the 3-D shape descriptor. Here, the 3-D shape descriptor is actually an algorithm transforming 3-D coordinates to features via mathematical formulations, which can describe the geometry of the local area around the point of interests. In the recent decade, a number of representative feature descriptors have been developed [20], for instance, 3-D context shape [21], Fast Point Feature Histogram (FPFH) [7], and Signature of Histogram of Orientations (SHOT) descriptor as well as its variants [22]. However, since all these descriptors rely mainly on the detailed description of geometric properties, and they seldom focus on the generalized structural features. Thus, they cannot immune noise and lack a strong geometric and topological relationship. Therefore, for characterizing the general geometry of an object, novel methods are proposed. Instead of considering detailed geometric and texture distributions, they estimated eigenvalues calculated from 3-D coordinates of points, which can reflect and express essential geometric information and topological relationships between the point and its neighbors. In this respect, the geometry description based on eigenvalues [23], [24] is an example. The eigenvalues of the tensor of coordinates characterize the 3-D features of the shape. In addition to the original spatial coordinates, number of returns and intensity, it is also possible to enrich the point attributes used to describe the features. For instance, the RGB color [25] and the thermal information [26] are used for featuring the 3-D point cloud.
B. Classification Using Extracted Features
Point and segment based methods are typical strategies for classification using extracted features. Point-based classification will gain a label for each point when conducting classification [8], [27]. While the segment-based classification applies segmentation like preclustering to the point cloud for getting primitives having homogeneities [10], [28], [29] before the further classification. The segment-based strategy has the advantage that it can separate individual objects from the scene simultaneously. No matter which strategy is being used, classifiers also contribute a lot to the classification performance. Here, AdaBoost [30], support vector machines [31], [32], random forest (RF) [24], conditional random field [33], or its high-order variations [34], and deep neural networks [35], [36] are the representative ones.
The voxel-based data structure is a popular way of preparing point clouds. Unlike conventional point-based data structure, the voxel structure can easily cope with nonuniform point density and problems caused by varying observation distances of large-scale point clouds. Octree-based structure [37] is an example, simplifying the entire dataset and suppressing noise and outliers with rasterized 3-D grids. The octree can identify neighboring relations of created voxels and index the inside points simultaneously. The tree structure can facilitate the searching of neighbors as well. Voxels can also be aggregated into larger supervoxels by the use of methods, such as VCCS [38] algorithm, which can explore the potentialities of voxel structures. A supervoxel is generated by grouping neighboring voxels via a graphical model or k-means clustering. One of the major advantages of using supervoxel structures is that they can precisely discover boundaries between objects or different parts of an object. When supervoxels are utilized in classification task, the generation of supervoxels is actually a preclustering of voxels with common properties, like normal vectors, colors, or other geometric attributes [39]. After such a clustering, the boundary of voxels from different clusters likely corresponds to the edges of various objects. Since this clustering is an unsupervised process, the size and the shape of the cluster is only adjusted by the difference of the local geometry of those voxels belonging to different objects. Namely, in the same cluster, voxels have always similar local geometric characteristic, which can better reflect the geometry of the local area. If we use such clusters as the neighborhood for the extraction of features, it could be an optimized solution. Based on the over-segmentation idea, it is also possible for supervoxels to be further merged into larger local patches utilizing a predefined radius of vicinity [19], [29], to delineate geometric features in a local vicinity more completely.
C. Smoothing of Labels
The spatial regularity is an elementary prior knowledge about acquired measuring data of the real world [40], [41]. To derive a regularized labeling, smoothing of labels can be designed to refine the initial tags obtained by the classification results, with the assumption that adjacent pixels or points are likely to share the same object label (i.e., class). Generally, the smoothing approaches can be conducted with two different strategies, i.e., local label smoothing and global label smoothing [34].
Local smoothing of labels focuses on the weight assignment of adjacent entities, which is mainly implemented via local filtering, local graph optimization, or preclustering of points. In [11], the initial labels of classification results are corrected by a voting-based filter process via a local first-order graph. In [42], the segmentation of local graphical model is also adapted to optimize initial labels. Moreover, in [10], instead of directly labeling individual points, a nonparametric segmentation is first conducted to aggregate points into segments sharing common labels, which could be regarded as a geometric constraint between labels and points. However, the performance of local label smoothing largely depends on the quality of the initial classification and definition of the neighborhood [34]. To be specific, the major assumption of applying the local smoothing is that the wrong labels are mainly surrounded by the corrected ones [11], but such a constraint is difficult to set for irregular shaped point clouds because large incorrectly labeled regions remain after the smoothing. Besides, how to design appropriate neighborhoods is still a challenging task.
As an alternative to the local label smoothing strategy, global label smoothing methods are also explored by a wide variety of studies, which consider the labels of points in the entire scene, simultaneously. The global label smoothing is mainly achieved via the optimization or regularization under graphical models or Markov networks [41]. In [43], the initial classification is globally optimized by adopting a multiclass graph cut algorithm, followed by a refinement with a local optimization using an object-oriented decision tree. In [36], a regularization framework using a global graph structure is proposed and employed for smoothing labels of point cloud segments, with impressive results achieved. Similarly, in [34], the initial labels with relaxed probabilities are optimized via graph-structured regularization. The same as the local label smoothing strategy, the quality of initial labels matters, which usually serves as the prior knowledge for smoothing. Besides, the construction of graphical models or Markov networks also plays a vital role for the optimization or regularization, which should consider both the spatial relationship (e.g., topology) and defined weights (e.g., similarity or proximity) between points. The graph can be built not only based on a KNN structure [36], but also the one considering manifold structure, like Riemannian graph [44]. The optimization or regularization is achieved by solving the cost function formulated from the graphical models or Markov networks. However, the selection of appropriate solver of the cost function is still needed to be addressed when dealing with complex urban areas [41].
Methodology
In Fig. 1, a brief overview of the workflow is displayed, with key steps and representative results illustrated. The entire workflow includes four essential procedures: Supervoxelization and local neighborhood selection, extraction of detrended geometric features, supervised classification, and graph-based optimization. At first, an oversegmentation is implemented by the use of the VCCS method [38], but with boundaries refined. A local neighborhood is defined for every supervoxel, considering neighbors directly connected. Second, geometric features of every supervoxel and connected neighbors in the local neighborhood are calculated. Afterward, a local tendency of geometry is estimated for each supervoxel in the feature space. The estimated local tendency will detrend the geometric features of the center supervoxel. In the supervised classification step, RF classifier is used to distinguish points of objects utilizing the detrended features, with initial labels obtained. Finally, graph-based optimization is conducted to refine the initial labels with a perception-weighted graphical model.
A. Boundary Refined Supervoxelization
To generate the voxel structure for point clouds, the entire space is partitioned into small cubic grids employing octree, splitting each parent node into eight equal child nodes. Here, the octree structure is conducted through the approximate nearest neighbor [45] searching algorithm. The supervoxelization is conducted via VCCS algorithm. However, the supervoxels of VCCS always suffer from the “zig-zag” effect, namely, the edges are not smooth ones. Instead, they are twisted conforming the squared edges of basic voxels, because the fundamental element of VCCS is the cubic shaped voxel [11]. To be specific, the “zig-zag” effect that we want to remove is the one between/across two adjacent objects. For supervoxels generated within one single object (i.e., wall or ground), there would not have any effect for feature extraction. However, the edge between two objects at the meanwhile should also be the edge between two supervoxels, which means that such effect is originally caused by the “zig-zag” edges of each supervoxel generated by VCCS. Since the edges of a supervoxel are formed by cuboid-shaped voxels, the edges between/across two adjacent objects cannot reach a point-level accuracy. In such situations, supervoxel located at the edge of one object may be contaminated by points of other adjacent object, which may affect the feature extraction. To overcome this problem, we proposed a boundary refined supervoxel clustering algorithm for creating supervoxels with a point-level accuracy of their boundaries.
Our proposed boundary refined supervoxel is based on the original VCCS supervoxel, consisting of two major steps, namely, the detection of boundary points, and the refinement of boundary points. In the first step, all the points of one supervoxel will be measured by the distance from the point to the center of the supervoxel considering the local curvature [44] exploring the spatial proximity of adjacent supervoxels in geodetic space.
In the supervoxel
\begin{equation*}
D_{\text{proj}}=\sqrt{w_n \cdot ||N_i-N_b||_2 +w_d \cdot ||X_i-X_b||_2} \tag{1}
\end{equation*}
Boundary refined VCCS supervoxelization. (a) Local k-means clustering of boundary points in neighboring supervoxels. (b) Refined boundaries between supervoxels.
B. Detrended Feature Extraction
Considering a large number of 3-D points containing only spatial coordinates, we need to extract geometric features from the 3-D coordinates to describe the geometry of the object. Since the supervoxel and its local neighborhood are already known, it is necessary to properly use them to represent the local geometry. Therefore, geometric features based on eigenvalues [8], [24], as well as additional structural features, are introduced to delineate both the geometric and structural characteristics in the local area of the point of interests. The eigenvalue-based geometric features are used to represent the local geometry of the object (e.g., the size and shape of the local area), while additional structural features are added to describe the basic structure of the local neighborhood (e.g., height and topological information).
Eigenvalue-based features, including linearity
1) Local Neighborhood of the Supervoxel
Although supervoxel structures have preclustered voxels at lower levels, supervoxels prefer to partition objects into small pieces, which leads to dissimilarities among different patches belonging to the same surface. Therefore, the decision tree cannot be trained perfectly. To address this issue, we use the idea given in [19], utilizing information from all supervoxels in the first-order graph of a given supervoxel. To be specific, for each supervoxel, we will define a local neighborhood to gain contextual information. In Fig. 4, we show the illustration of the defined local neighborhood for a given supervoxel.
2) Detrending Local Tendency of Supervoxel-Based Context
Regarding the interpretation of complex 3-D scenes, there are typically various objects, and it is necessary to identify the exact boundaries between the objects. Furthermore, according to the analysis performed in [10], the contribution of each vector from the local descriptor in the generated vector of features (i.e., feature histogram) varies even for objects of the same type. This will result in the ambiguity when generating features for different types of objects. For instance, the geometric features of natural ground and artificial ground could be quite similar if we consider linearity, flatness, and orientation of normal vectors. In such a case, we need to use additional features, like the surface smoothness and roughness, to distinguish them. For the implemented vector of features, we carry out the process of enhancing the useful and salient feature vectors and weakening trivial feature vectors.
Enlightened by the edge detection operator (i.e., difference of Gaussian) of the domain of image processing, we used the idea given in [11] to estimate the local trend of the 3-D geometry of each supervoxel in the local environment, and then removed this local tendency to obtain salient information about objects that represent unique structures and detail elements. Additionally, the local trend of the supervoxel background also acts as a vital role when finding supervoxels near the real boundary of objects having different semantic labels. In Fig. 5, we show the 1-D outline, which illustrates the estimation of the local trend of the geometric surface of the object. It is obvious that after removing local trends, two curves having originally similar layouts will be distinguishable.
Illustration of local tendency of geometric shapes. (a) For objects with smooth surface. (b) For objects with rough surface.
The geometric features based on eigenvalues basically reflect the general geometry relating to “low frequency” components of the geometry. In contrast, the detrended feature is also a kind of “high-pass” filter, removing background geometry information from nearby parts and leaving “high-frequency” components. Then, if we can describe the complete geometry of the object by integrating these two components, better uniqueness can be reached.
3) Detrended Geometric Features in the Local Neighborhood
For creating geometric features with local tendency detrended, we calculate dimensionality features, statistical features, height features, orientation features, and surface features from points of supervoxels in the local neighborhood. The dimensionality, statistical, and surface features mainly reflect the 3-D shape of the object, namely, those relatively detailed components. In contrast, the structural, height, and orientation features can provide contextual information of the object relating to fundamental components. In the meantime, contextual features encapsulate the interaction in the context. Thus, if we can combine these components together, better distinctiveness can be achieved for representing the geometry of objects.
The local tendency is expressed in the feature space and removed for each supervoxel. The vector of features of a given supervoxel
\begin{equation*}
H_{d}=H_{v}-H_{l}. \tag{2}
\end{equation*}
\begin{equation*}
H=[ H_{v}^T, k \cdot {H_{d}}^T, {H_{r}}^T] \tag{3}
\end{equation*}
It is noteworthy that compared with the method given in [11], in our approach, we do not use radiometric features (e.g., RGB color or intensity), and only 3-D coordinates are utilized. To some extent, our proposed detrended geometric feature uses a similar strategy like the difference of normal feature presented in [49], which generate the difference of angles between normal vectors estimated from various sizes of neighborhoods. The difference is that what we used is more than normal vectors, instead, also get the difference of local geometries, height values, verticalities, and densities. Besides, we also consider the contextual features representing the interaction between the supervoxel and its context.
C. Initial Labeling With RF
The supervised classification using the classical RF algorithm [50] is implemented to distinguish supervoxels with different predefined labels. An RF classifier combines a number of decision trees that are created by randomized vectors that are sampled independently of the input vector of features. Each decision tree votes to elect the most probable label of the sample of the input vector [50]. Besides, the RF classifier will grow a tree at each node, which enforce it to be insensitive to overfitting following the strong law of large numbers. During training, the bagging method is applied to each combination of features, generating a training dataset by drawing
\begin{equation*}
P_{i,k}=\frac{N_k}{N_t} \tag{4}
\end{equation*}
D. Graph Structured Global Smoothing of Labels
To refine the results of RF, we adopt a global optimization for spatial smoothing based on the perception-weighted graphical model, as advocated in [51]. This optimization aims to find an improved labeling result
1) Construction of Perception-Weighted Graphical Model
The idea of using perception-weighted graphical model is that we assume that the graphical model could naturally represent the spatial space of 3-D scenes. Here, perceptual grouping, referring to the determination of regions and parts from visualization and perception that should belong to the same piece of higher-level perceptual elements [52], is adopted. This is because, for all the points of the same object, they are likely to form a smooth, continuous, and convex surface. Thus, to encode the spatial constraint from the aspect of perceptual grouping laws, we consider cues of proximity, continuation, and similarity, measuring the spatial distance, the difference angles of normal vectors, and the difference between feature vectors.
More specifically, to structure the objective functional of this optimization, we first construct a weighted graphical model
\begin{equation*}
\Delta {X_{ij}}={||\vec{X_i}-\vec{X_j}||}_2 \tag{5}
\end{equation*}
\begin{equation*}
\Delta {A_{ij}}=\angle {(\vec{N_i}, \vec{N_j})}. \tag{6}
\end{equation*}
\begin{equation*}
\Delta {H_{ij}}=\sum _{k=1}^{8}{\left(\frac{h_i(k)-h_j(k)}{h_i(k)+h_j(k)}\right)}^2. \tag{7}
\end{equation*}
\begin{equation*}
{W}(i,j)=\text{exp}{\left(- \frac{\delta _x \Delta X_{ij} + \delta _a \Delta A_{ij} + \delta _h \Delta H_{ij}}{2\theta ^2}\right)} \tag{8}
\end{equation*}
Global graph structure. (a) 3-D scene. (b) Supervoxelized 3-D space. (c) Generated global graph of labeled supervoxels.
2) Optimization of Graphical Model
The above-mentioned graphical model can be formulated to an optimization problem.
\begin{equation*}
P^\ast \in \arg \min _{Q \in \Omega } {\sum _{i \in V} \phi (P_i, Q_i) + \sum _{(i,j) \in E} \lambda \cdot \psi (Q_i-Q_j)} \tag{9}
\end{equation*}
\begin{equation*}
\psi (a, b)= {\begin{cases}0 &\quad \text{if } P_a=P_b \\
1 &\quad \text{if } P_a \ne P_b \end{cases}} \tag{10}
\end{equation*}
\begin{equation*}
\lambda = \text{exp} \left({- \frac{(d_{ij})^2}{\delta^2}} \right) \tag{11}
\end{equation*}
\begin{equation*}
\phi (p, q) = -\sum _{k \in \mathcal {K}} q_k \text{log} \left(\frac{\alpha }{k}+\alpha p_k\right) \tag{12}
\end{equation*}
The minimization problem is solved by a graph-cut strategy using the alpha expansion, which can quickly find an approximate solution with a few graph-cut iterations. The implementation of the alpha expansion is achieved by the use of GCO-V3.0 library [54]–[56]. Here, the labeling cost is not considered since we assume that labels of all the objects are independent so that all elements in the labeling cost matrix are set to one, except the diagonal ones setting to zero. The optimization results automatically adapt to the underlying scene without the need for predefined features of certain potential objects.
Experiments
A. Test Datasets
Experiments are conducted on two different datasets in urban scenes, including TUM city campus dataset [57] and Semantic3D dataset [58]. For the TUM MLS dataset, the testing site is in the area of the city campus of the Technical University of Munich, covering around 80 000
For the evaluation process, the ground truth is generated by manually labeling of point clouds. As a consequence, a highly precise reference of the entire city campus is made. In Fig. 8, the labeled scene with eight semantic classes is rendered by eight different colors, including building, high vegetation, low vegetation, vehicles, human-made terrain, natural terrain, hardscape, and scanning artefacts [27].
While the TLS dataset is used to further test the versatility of our classification method on point clouds with varying point density. Here, we used the popular Semantic3D dataset published by ETH Zürich [27]. This dataset is manually labeled into eight different classes, namely building, hight vegetation, low vegetation, vehicles, human-made terrain, natural terrain, hardscape, and scanning artefacts. In this experiment, the scans we used are Bildstein and Untermaederbrunnen. Each scene has three scans, and we use two of them as training dataset and the rest one as test dataset. In Fig. 8, we show the manually labeled reference of these two scenes, with labels provided by www.semantic3d.net.
B. Evaluation Metric
For the evaluation of the classification, we follow the Pascal VOC challenges [59] and use Intersection over Union (IoU) averaged over all classes. The evaluation measure for class
\begin{equation*}
\text{IoU}_i=\frac{\text{TP}_i}{\text{TP}_i+\text{FP}_i+\text{FN}_i}. \tag{13}
\end{equation*}
\begin{equation*}
\overline{\text{IoU}}=\frac{1}{N}\sum \limits _{i=1}^{N}{\text{IoU}_i} \tag{14}
\end{equation*}
\begin{equation*}
\text{OA}=\sum \limits _{i=1}^{N}\left(\frac{\text{TP}_i}{\text{TP}_i+\text{TN}_i+\text{FP}_i+\text{FN}_i}\right). \tag{15}
\end{equation*}
Results and Discussion
A. Preprocessing
Although the dataset suits well for the work, it has some drawbacks. For instance, the original dataset is exceptionally dense, including more than one billion points only for the Arcisstrasse, which could not be handled efficiently. Besides, the alignment between scans is not so accurate, that the objects are usually with variable thickness. Therefore, appropriate processing for later work is necessary. The original raw point cloud is preprocessed first by the statistical outlier removal filter, and then down sampled. Preprocessed points have been reduced to around 50 million points, namely about only 5% of the original point cloud. In Fig. 9, a comparison between a subscene of the raw and preprocessed point clouds is illustrated. It is shown that apparent noise and outliers are removed, while the structures of objects are well preserved.
Preprocessing of point clouds. (a) and (d) Original point clouds. (b) and (c) Preprocessed point clouds.
B. Experimental Results
In semantic labeling experiments, our algorithms used for feature extraction and initial classification are implemented via C++ which is run on an Intel i7-6700 CPU @ 3.4 GHz and with 32.0 GB RAM.
1) Results of Using TUM Datasets
When using the TUM datasets for conducting a supervised classification, we use only the area along the Acissstrasse as the training set, nearly 30% of the entire dataset, while the rest of the dataset (around 70%) is used as the testing set. For assessing the effectiveness of our proposed detrended features and the graph-structured optimization, we compared the method (termed as SV, i.e., using
With respect to the OA, methods using our detrended geometric features can overweight other methods, having an improvement of around 1% and 7%, respectively. It means that using the features removing the local tendency can get equivalent or even better performance as the one of using features from single supervoxel, with merely 1% improvement. However, when combining these two vectors of features, the accuracy can be drastically improved with 7%. This is because that in the combined situation, the original geometric features are kept, and at the meanwhile the details are enhanced, which facilitates the feature expression. For the IoU measures, our approach outperforms others as well. In particular, for scanning artefacts and vehicles, which are likely to be labeled as the building facades and low vegetation, our detrended feature can gain better IoU measures. The effectiveness of our detrended features can be backed by the analysis of features as well. In Fig. 10, we also provide a summary of feature importance of different vectors in the RF process. As shown in the figure, we can observe that the last three vectors representing the contextual features play an important role in decision trees. Besides, the vectors of detrended features from the bins 15 to 30 in the vector of features are also essential to the creation of decision trees, with a higher averaged value of importance compared with those of the features from the supervoxel itself.
While the comparisons of classification results of using different graph cut thresholds
Benefiting from the preclustering, one advantage of segmentation-based classification methods over classical point-based classification methods is the in-sensitiveness to outliers and noise. Meanwhile, the supervoxel structure also has some disadvantages. For instance, choosing the appropriate voxel size is a compromise between noise suppression and detail preservation and unify uneven density. The larger the voxel, the smoother the details. Obviously, for traditional boundaries, such as the right-angled corners of the corners formed by smooth walls, our fine-grained boundary supervoxels can accurately find out these boundaries. However, when irregular edges are involved, for example, due to the presence of French windows, the edges between the walls and the ground, the boundaries found by supervoxels are biased. Besides, for small objects like minor scanning artefacts, if the size of the supervoxel is too large, they are easily blurred by their background and cannot be described correctly. This can be seen from the initial result that large objects like buildings and high vegetation can always achieve a good IoU value.
To have a complete view of the classification performance, we apply the trained classifier mentioned above to the entire TUM dataset, which is nearly four times larger than the training data of the Arcisstrasse. The classification result of our method is given in Fig. 11. As seen from the figure, corresponding to the performance shown in Table III, the semantic labeling result of the entire experimental area is given. In the result, we can find that majority of buildings, man-made terrain, and high vegetation are labeled correctly. Besides, we can find that our purposed method reveals excellent potential in positioning stationary vehicles, although this can only be achieved after the optimization. In Fig. 12, we give a detailed view of the optimization of vehicles and buildings. The initial classification result of vehicles is questionable because in fact, parts of points of vehicles are occluded due to the view direction of observation. These occluded observations of vehicles make the supervoxel of cars and buses look like part of hardscapes. After the optimization, those wrongly labeled supervoxels can be corrected. However, if the initial label of the local area is biased, as the lower-left corner of the campus scene in Fig. 12, the optimization may make the situation worse. In this area, large parts of the road surface are initially labeled as natural terrain, so that the optimization is failed because for nodes in the graphical model, all their neighbors have wrong labels. Besides, due to the complexity of the large testing area, there are misclassifications, such as “sparkling effects” [11], where the area within the facade of the building is incompletely measured and too sophisticated for classification. Thus, certain parts of the inner building structure are considered to be disturbing objects. Although part of the fragmentation error label is normalized, there are still a large number of areas with incorrect labels.
Classification results. (a) Initial classification result of TUM campus. (b) Optimized classification result of TUM campus. (c) Optimized classification result of Bildstein. (d) Optimized classification result of Untermaederbrunnen.
Comparison of results before and after the optimization (the area in the boxes of Fig. 12). (a) Initial classification result. (b) Optimized classification result. (c) Ground truth of TUM campus. (d) Initial classification result. (e) Optimized classification result. (f) Ground truth of Untermaederbrunnen. (g) Initial classification result. (h) Optimized classification result. (i) Ground truth of Bildstein.
For a further evaluation of the performance of our proposed method, we also compared our proposed method with other two baseline methods PointNet [60] and PointNet++ [61], which are both renown deep learning based classification methods for 3-D points. In Table IV, we illustrate the comparison of the classification results. Here, the training and testing datasets we used for PointNet and PointNet++ are the same as those used in our method. Here, to fulfill the requirement for input in PointNet and PointNet++, the entire point cloud is subdivided into thousands of subpoint chips, in which 10 000 points are contained. These chips are downsampled to 8192 points, which represent the main structure of each chip, and the downsampled chips serve as the input for PointNet and PointNet++. Each point in the chip is represented by a 3-D vector, containing the coordinates
Comparison of classification results using different methods. (a) Classification result using PointNet. (b) Classification result using PointNet++. (c) Classification result using our proposed method.
2) Results of Using ETH Datasets
We tested our method using the semantic3D dataset provided by ETH Zürich [27], in order to explore the full potentiality of our feature extraction method. It is worth noting that the semantics 3-D dataset is measured by TLS that causes the density of the points to vary with the distance between the station and the object. Here, the voxel size is 0.3 m and the supervoxel seed resolution is 1.0 m. For the weight factors in the boundary refined process,
The evaluation of using two datasets is given in Tables V and VI. In the experiments of Bildstein scene, we used the scans of Bildstein 3,5 for training. Then, the point cloud of Bildstein 1 is used for validation, classifying points into eight classes. As given in the table, we can observe that our method can still get an outcome having an OA of 0.811. Interestingly, the classification of vehicles, low vegetation, and hardscapes are almost failed in the test. The primary reason is due to the insufficient number of training samples, which is a frequent problem for segment-based point cloud classification, namely, for objects without enough training samples, like hardscapes and vehicles, the labeling is always failed. In contrast, for those objects having enough training samples, for instance, buildings and artificial terrain [see Fig. 12(e)], satisfying results can be achieved.
In the experiments of the Untermaederbrunnen scene, the scan we used for training is Untermaederbrunnen 1. For evaluation, the scan of Untermaederbrunnen 3 is used, involving all eight classes. As shown in the table, our proposed method can achieve an OA of 0.804. Similar to the result of the previous scene, buildings, man-made terrain, high vegetation (see Fig. 12) can obtain good OA, but for the natural terrain and vehicles both the initial labeling and optimization failed due to the insufficient training samples. It is noted that for vehicles, TLS laser scanning may cause more severe occlusions so that the scanning points of vehicles have confusing geometry like that of hardscapes.
Indeed, the result of using the ETH dataset cannot compare with the methods, which have higher OA reaching 0.9, but considering that our proposed method is supervised and requires much less training dataset, the results are satisfying. Besides, for the object of our primary concern (i.e., buildings), both our proposed detrended features and graph-structured optimization perform well, and the voxel-based data structure significantly accelerates the processing speed. For practical applications, LiDAR points are always textured with reliable intensity or RGB color, the performance of our method can be further improved by this radiometric information like in [11]. Moreover, in our proposed method, we only use the RF classifier for obtaining the initial labels, but in fact, any classifier providing soft labels can be used in our methods. An excellent initial labeling result can significantly improve the effectiveness of graph-structured optimization. In future work, a better initial labeling method could be utilized.
Conclusion
In this work, we proposed a supervised point cloud classification method using a novel detrended geometric features, removing redundant and insignificant information in the local neighborhood of the supervoxels. The proposed feature extraction method uses the supervoxel-based local neighborhood instead of points as basic elements, encapsulating the geometric features of local points. Based on the initial classification results, the graph-based optimization is used to spatially smooth the labeling results, based on the graphical model using the perception weighted edges. The optimization of labeling can be naturally represented in terms of energy minimization [62]. We minimize the energy function constructed with a data term and piecewise smoothness term via a graph cut algorithm, i.e., alpha-expansion proposed in [54], which finds a good approximate solution by iteratively running min-cut/max-flow algorithms on an appropriate graph. This move making algorithm iteratively selects a label and considers moves increasing the clique of pixels that are given this label if the movement has lower energy. Besides, the constructed energy function can be justified in the context of maximum a posteriori estimation of Markov Random Fields. The discrete multilabel MRF are solved by applying min-cut/max-flow algorithms iteratively to binary-labeled piecewise MRF [63]. At the moment, our result of using the Semantic3D dataset cannot be compared with the state-of-the-art methods with OA reaching 0.9 yet, but for the object of our primary concern (i.e., buildings), both our proposed detrended features and graph-structured optimization perform well, and the voxel-based data structure significantly reduce the number of elements needed to be processed, which is similar to a downsampling procedure. However, we should also admit that the handcrafted feature has naturally drawbacks compared with those generated from supervised learning, because it is always difficult to manually design the feature expression algorithm and tune the parameters, although they may perform in some specific situations. In the future, features generated by the learning based methods should be more promising and of wide use.
ACKNOWLEDGMENT
This work was carried out within the frame of Leonhard Obermeyer Center (LOC) at Technische Universität München (TUM).