Introduction
As a next-generation telecommunication technology, 5G brings a novel perspective and innovative solutions to the increased demand of humans and autonomous devices [1]. This technology mainly focuses on five areas, including dense-device structures, high carrier frequency bands such as millimeter wave (mmWave), multi-connectivity such as massive MIMO, smart devices, and massive machine-type communications. To fulfill these requirements in such a dynamic digital world, the next-generation cellular networks must be adaptable with predictive capabilities due to the ever changing environment of the nested services interacting with each other [2]. Hence, artificial intelligence (AI) has garnered increased interest as a potential 5G technology to handle the dynamic environment in analyzing and contributing to the execution of the network. There are examples of intelligent applications on massive machine-type communications (mMTC) in 5G or massive MIMO in wireless sensor networks (WSNs), to achieve better service quality through improved IoT connectivity as well as to extend battery life and boost spectral efficiency by utilizing channel aware decision fusion methodology [3], [4].
In previous mobile networks, such as 4G, autonomous mobile networks or self-organizing networks (SONs) with capabilities such as self-planning, self-configuring, self-optimizing, and self-healing, have been shown to significantly contribute to reducing network failures and boosting performance without human intervention such as AirHop’s eSON [5]. However, current solutions in the market generally lack smart functionalities, especially for cell outage management (COM) to heal autonomically [6]. If there is no traffic due to a specific network issue, it signals an anomaly where one or more cells may be in outage and it is vital to detect the outage to resume service within the shortest time possible. For instance, Ericsson lost at least
This paper has four main contributions to the field as summarized below:
We present a comprehensive analysis of a conventional machine learning method for anomaly detection in self-organizing 5G networks (5G-SONs) and compare it with a popular deep learning alternative using different learning representations, including one-class and binary learning.
We claim state-of-the-art performance on a publicly available dataset [8], which investigates multiple use case scenarios for anomaly detection in 5G-SONs. We demonstrate an average improvement of 15% over the best recent performance which was achieved by a deep auto-encoder-based setup.
We demonstrate for the first time that data augmentation methods can further boost anomaly detection performance in binary mode, even when utilizing conventional algorithmic methods such as support vector machines on a sufficiently large dataset.
Finally, we achieve nearly two orders of magnitude improvement in computational speed and an order of magnitude reduction in trainable parameters using conventional machine learning to provide a robust alternative for 5G self-organizing networks especially when the execution and detection times are critical.
The rest of this paper is organized as follows. Section II provides a brief summary of prior work on anomaly detection in current mobile and self-organizing networks. Section III introduces the methods used in this study. Section IV describes the experimental setup in detail and provides the hyper-parameters, dataset characteristics, implementation, evaluation metrics, and necessary details for repeatability. Finally, the results and discussions are presented in Section V, followed by the conclusions in Section VI. The abbreviations used in this paper have been provided in Table 1 for easy reference.
Related Research
Anomaly detection in communications has been an active research area over the last decade. For instance, in [9], abnormal activity in the wireless spectrum has been explored. Specifically, the authors used power spectral density (PSD) information to identify and pinpoint anomalies in the form of either undesired signals present in the licensed band or the absence of a desired signal. The information obtained from the PSD was processed using a combination of adversarial auto-encoders, convolutional neural networks, and long short-term memory recurrent neural networks.
In another example, [6] utilizes the measurements and handover statistics (inHO) from adjacent cells in a mobile communications network to expose abnormalities and outages. Monitoring in this way provides the potential status of a cell outage where the inHO information becomes zero. In [10], a novel online anomaly detection system was proposed in mobile networks to identify anomalies in key performance indicators (KPIs). The proposed system consists of a training and detection/tracking block. The system learns the most detrimental anomalies in the training block as each recent KPI is sourced, and monitors its status until the end of the second block. Thus, the detection of anomalies has been set to prefer highly possible anomalies in the long term. Moreover, the system objects to provide the minimum amount of anomalies by maintaining a low positive rate on behalf of network operators’ efforts to deal with only real anomalies. In addition, the system can be extended to next-generation networks via automatic adaptation features to a new network behavior profile.
The authors in [11] proposed unsupervised learning to detect anomalies related to mobility using mobility robustness optimization (MRO), which is an important use case of SONs for modern 4G and 5G networks. A similar study in [12] brings a different perspective by using reinforcement learning to degrade call drop rates.
In [13], the authors used an autoencoder-based network anomaly detection method to detect the ratio of changes in features that reflect non-linearity to increase detection accuracy. The authors used a convolutional autoencoder for dimensionality reduction and outperformed more conventional methods in wireless communications to detect cyberattacks.
Artificial intelligence (AI) applications supporting 5G technologies can be implemented both in the physical and network layers. For instance, authors in [14] present an excellent overview of deep learning (DL) applications for the physical layer. One of the applications is a novel modification of the standard autoencoder called the “channel autoencoder.” In a typical autoencoder, the goal is to find the most compressed representation of the inputs in the encoding layer with minimal loss at the output layer. In [14], the authors aim to find the most robust representations of the input (messages) to account for the channel degradation by adding redundancies instead of removing them. The authors further extend this concept to an adversarial network of multiple transmitter/receiver pairs aiming for increased capacity. Furthermore, they discuss augmented DL models using radio transformer networks (RTNs). The RTNs carry channel domain information and simplify the receiver’s job for symbol detection using a neural network estimator to obtain the best parameters for symbol detection. The augmented DL models on complex I/Q samples for modulation classification demonstrated that the DL models outperformed the classification methods based on expert features. Another study discussed in [15] introduces designing mobile traffic classifiers based on the DL utilization. A systematic framework of new DL-adopted Traffic Classification (TC) structures is introduced and analyzed. Rather than mobility, the study includes a wide allocation range to encrypt the TC.
These studies generally rely on multi-class applications of deep neural network methods learning on features such as key performance indicators, handover statistics, reference signal received power and quality (RSRP and RSRQ), number of connection drops and failures. However, there exists a discussion in the research community where other learning representations such as one-class learning, and deep unsupervised methods have the potential to become strong alternatives for anomaly detection in SONs [8]. Similarly, an ever-present debate investigates the comparative effectiveness of empirical deep learning models and conventional approaches such as feature engineering for a range of temporal applications [16]. A categorical analysis of literature discussed in this section can be found in Table 2, which displays the important characteristics and features such as the dataset type, topology of the network used, which learning methodology is applied for which application, etc.
Methods
A. Anomaly Detection Problem Definitions
In this paper, we look at anomaly detection scenarios where a total of 105 mobile users are uniformly spread around 7 base stations (BS) and the BS can fail to communicate with the user by performing below the expected key performance indicator (KPI) margins. Each BS is represented as a circle and in each circle, there are 3 cells regularly spread out shown in Fig. 1, for a total of 21 cells. Minimization of drive test (MDT) report is generated for each user at a 5kHz sampling rate (i.e., once every 0.2 ms) with a total recording time of 50 seconds to obtain the associated data. The KPIs including the RSRP and RSRQ measurements are collected from each of the 7 base stations at this time. The data for each MDT report is then assigned a class label from the numerical set 0,1,2,3.
The three labels; 0,1, and 2 indicate normal network status with varying levels of discrete transmitter power, whereas label 3 indicates that the base transceiver station (BTS) has failed to communicate with the user, which is classified as an anomaly in the dataset. Three other scenarios are implemented in a similar way with the only differences being the numbers of users (210, 105, 105) and cells (57, 57, 57) to be able to consider the anomaly detection (AD) application for a wider range of networks and users with different numbers of observations.
Fig. 1 shows the structure of the communications network for the case of 7 BS and 3 cells within each grid used in the first scenario for outage detection. This figure is provided to help with the visualization of the AD application and is similar for different scenarios using a range of BS, cell and user numbers.
B. Support Vector Machine
The support vector machine (SVM) [17] is a popular machine learning algorithm that plays important roles in binary classification and unsupervised feature learning. An SVM model represents and separates various classes by building a set of hyperplanes in a multi-dimensional space. Separation of the classes from each other is performed iteratively by the SVM algorithm through finding the optimal hyperplane (the decision region between a group of objects from different classes). In this context, the optimal hyperplane maximizes the distance gap (margin) between the two lines closest to the data points from different classes based on the support vectors (data sample points where the classification error is minimum).
We implement SVM in two different ways:
1) One Class Anomaly Detection With SVM
One-class SVM is a popular method for unsupervised anomaly detection [18]–[21]. In general, an SVM is applied to binary classification tasks. In the one-class anomaly detection case, the algorithm is trained only with observations from the majority class to learn the “normal” samples. When new data are presented to the SVM algorithm the “anomaly” samples generate a lower decision probability compared to the observations from the majority class under ideal conditions.
2) Binary Anomaly Detection With SVM
A binary SVM classifier provides the most accurate results when trained on balanced datasets. When the dataset is imbalanced, the SVM classifier model is inclined to overfit to the majority class and shows poor performance for the minority class with reduced generalization. To overcome the imbalance in a dataset, different techniques have been introduced in the past one of which, the synthetic minority oversampling technique or SMOTE, has gained increased popularity in cases of severe imbalances such as anomaly detection.
C. Smote Algorithm
The synthetic minority oversampling technique, or SMOTE [22], is one of the most popular methods for oversampling to overcome the imbalance problem. Its goal is to balance class distribution by randomly increasing the minority class by generating “synthetic” patterns based on features instead of raw data. The oversampling process of the minority class (\begin{equation*} X' = X+rand(0,1)^{*}|X-X_{k}|\tag{1}\end{equation*}
D. Autoencoder
Autoencoder is a particular artificial neural network topology which has the same input and output layers where the training is performed by presenting the same input data to both layers simultaneously. The general structure of the auto-encoder consists of a visible input layer
During training, the auto-encoder maps the input \begin{align*} H\equiv&f_{W_{H}}(x) = f(W_{H}x + b_{H})\tag{2}\\ z\equiv&g_{W_{z}}(x)= g(W_{z}H + b_{Z})\tag{3}\end{align*}
\begin{equation*} {L_{AE}=min \frac {1}{N} \sum _{k=1}^{N}|| x_{k} - z_{k} ||^{2}}\tag{4}\end{equation*}
Within the context of this application, the autoencoder is used for anomaly detection by training the network using normal observations and once trained, comparing the reconstruction error of the normal and anomalous samples to a threshold for detection. It is hypothesized that the normal samples will provide lesser reconstruction errors compared to the anomalous samples which can easily be characterized using the receiever-operating-characteristics (ROC). More specifically, in [8], the autoencoder model structure start with an input vector of size 20 (which corresponds to 10 RSRP and 10 RSRQ measurements), followed by four hidden layers including 12, 6, 6 and 12 neurons to implement a general topology as follows: 20-12-6-6-12-20.
Experimental Setups
A. Datasets
In this work, we use the dataset created by a SON simulator previously introduced in [8] and made available to the public for further research [23]. The dataset includes four different application scenarios where the data is collected periodically (with a 5 kHz sampling rate) from a minimization of drive test report [24], which includes mobile user information regarding the user activities recorded in the enclosed regions around the base stations (cells) in certain measurement periods. There are four different datasets with different numbers of users and use-cases. Fig. 2 demonstrates the basic structure of the dataset for the first use case (dataset 1), which consists of the measurement time, a unique ID assigned to each user, the coordinates of user locations in two dimensions at the time of measurement, the reference signal received power (RSRP), the reference signal received quality (RSRQ), and the label that shows if the associated entry (i.e., the collection of measurements associated with that user) is anomalous (1) or not (0). RSRP and RSRQ are significant measures of signal level and quality in LTE networks designated as key performance indicators (KPIs) in identifying whether the collected information is anomalous. The feature vector consists of the RSRP and RSRQ measurements from different cell locations (i.e., RSRP1, RSRQ1, RSRP2, RSRQ2,…, RSRP10, RSRQ10). The dataset has 11674 observations where only 60 of them are anomalous (i.e., 1 in 200) measurement samples and 25 features including the feature vector, user ID, location, and class labels.
The second dataset is similar to the first one in terms of the number of features and measurements except that it has a much less frequent anomaly rate with 8382 observations, where only eight of them are anomalous (i.e., approximately 1 in 1000).
Datasets 3 and 4 represent different use cases, both of which have 114 features with a longer (80s) recording time, which resulted in a much larger observation base (42000). As in datasets 1 and 2, they represent different anomaly rates where dataset 3 has a much larger sampling of anomalous measurements (9635) compared to dataset 4 (22) because only the entries below the -120 dB RSRP measurement threshold were labeled as such.
B. Hyper-Parameters
The most significant hyper-parameters in this study are the oversampling rate N and the number of nearest neighbors k, specifically for the SMOTE algorithm in applying oversampling for the binary classification case. In applying SMOTE, we tested two different sets of hyper-parameters with N = 300, k = 4 and N = 500, k = 5.
C. Implementation
All implementations were performed using MATLAB 2020b. We tested both one-class and binary SVM models on all datasets. The SMOTE algorithm was used with the binary SVM model to adjust for the imbalanced datasets. The datasets were preprocessed, where the time, UserID, and location features were not included in the training process for fairness. Approximately 10 % of the normal and anomalous samples were separated for testing. Both the training and testing samples were normalized to between (0,1].
The one-class SVM model was trained with only normal samples using Gaussian RBF kernels. After training, we generated the SVM probability outputs with test samples, including both normal and anomalous samples, to obtain receiver operating characteristic (ROC) curves along with area-under-the-curve (AUC) scores as performance metrics.
The binary SVM model is trained in exactly the same fashion except that in addition to the above process, the anomaly samples are oversampled with SMOTE to generate balanced datasets prior to training and testing the algorithm.
D. Evaluation Metrics
In this study, we evaluated performance by looking at ROC curves and AUC scores. The ROC is a probability curve that shows the model’s ability to identify the positive class appropriately. It is plotted with the true positive rate (TPR) on the y-axis and false positive rate (FPR) on the x-axis, where TPR is the percentage of correctly classified positive outputs and FPR is the percentage of incorrectly classified positive outputs, as expressed below:\begin{equation*} TPR=\frac {TP}{TP+FP} FPR=\frac {FP}{FP+TN}\tag{5}\end{equation*}
\begin{equation*} AUC=\frac {TP+TN}{TP+FP+FN+TN}\tag{6}\end{equation*}
Results and Discussion
The ROC curves for each of the four datasets are shown in Fig. 3. The red curve in each figure represents the latest state-of-the-art reported on this dataset using a deep autoencoder [8]. For comparison, different SVM implementations are represented in different colors, including binary and one-class combinations with and without augmentation using SMOTE. The results are generally consistent except in the case of dataset 4, where all SVM combinations outperformed the deep autoencoder at all levels of TPR & FPR. In the case of the first three datasets, the deep autoencoder outperformed the SVM implementations without augmentation. However, when SMOTE is used, both one-class and binary modalities of the SVM demonstrate higher performance compared to the original paper, and in some cases significantly so.
The ROC curves for the four datasets for the comparative analysis of modern and conventional machine learning with the results reported in the source paper.
The results are further summarized in Fig. 4, which provides an overview of AUC scores across different datasets and algorithms where both binary and one-class SVMs (with the three curves at the top) clearly outperform the deep autoencoder approach shown in red. Table 3 provides the absolute numbers in terms of the AUC with 5% significance. The best combination of SMOTE and SVM modality outperformed the deep autoencoder by 19.75%, 15.5%, 15.5%, and 13% for datasets 1, 2, 3, and 4, respectively. This corresponds to an average performance improvement of over 15% across all the application scenarios. It is important to note that even without artificial augmentation of the dataset, conventional machine learning using SVM still outperforms the previous state-of-the-art methods reported in the literature. However, the difference in performance was noticeably less.
In order to study whether the SMOTE algorithm would provide a similar performance boost for the autoencoder setup, we focused on the first two datasets where the imbalance is much more significant as described in the Asghar et. al. [8] study. We observed that the average detection accuracy with the following SMOTE hyperparameters: N=300, k=4 and N=500, k=5 were 64.03% and 57% respectively on dataset 1 and 68.87% and 51.64% respectively on dataset 2. These results are either nonsignificant improvements over the baseline performance or in fact even worse compared to not applying SMOTE on the dataset. There are many explanations for this but the most likely one is that the operational structure of the latent space in an autoencoder is very similar to the way SMOTE calculates augmented samples – in other words, the advantage of using SMOTE is negated in the latent coding layer of the autoencoder itself.
It is important to discuss the nonsignificant or in some cases negative impact of SMOTE on the performance of one-class SVM topology. SMOTE works by generating samples based on the nearest neighbor similarities of intraclass samples and the differences of interclass samples. This method works best when training a binary classifier where one class may be less represented than another class as evident by the performance boost observed in binary SVM training. However, in one-class learning representation, only the majority class is used in the training which would mean that SMOTE could only have an indirect effect on the number and quality of the samples generated for the normal class and subsequently does not have a direct role in boosting performance. The drop in performance in some cases can similarly be linked to the quality of the anomaly class samples being generated (and thus effecting testing performance) not making up for the additional information which now cannot be used in the training process.
A. Computational Complexity Analysis
Computational complexity analysis is an important step in identifying the strengths and weaknesses of conventional algorithm such as one-class and binary SVMs when compared to the more modern approaches such as the autoencoders used in anomaly detection. In this paper we focused on both the raw computational times specifically for testing phase for each of the four datasets scenarios as well as the number of trainable parameters for both algorithms. We also compared how SMOTE affected the complexity.
All measurements are done using the latest version of MATLAB at the time of this writing (2021b) using the standard time measurement scripts. On the first dataset, for the one-class SVM algorithm, the testing time took 90 ms without SMOTE and 140 ms with SMOTE, whereas it took 9000 ms to run the autoencoder, almost a 100-fold increase. Additional complexity of SMOTE is more than outclassed by its significant contribution to the accuracy. We have observed similar computational times for the rest of the use-case scenarios (i.e., for the second dataset one-class SVM testing times were 110 - 130 ms (SMOTE) compared to 7120 ms for the AE) where there were orders of magnitude improvement in testing times. We have performed the computational speed analysis using one-class SVM due to the fact that the implementation was done using a native script whereas for the binary SVM, a GUI toolbox was used with better visualization capabilities which affects speed. However, there should be no difference in testing times since the SVM topologies are identical with the only difference being the way the data is represented to the algorithms.
In terms of the trainable parameters as an indicator of computational complexity, the SVM models on average had 23 support vectors compared to the 660 weight and 76 bias parameters (total of 736) that need to be trained for the autoencoder. Based on the above measurements for computational times and complexity, SVM-based approaches are less complex even when using SMOTE, have competitive AUC scores, and are thus more suitable for time-critical scenarios such as anomaly detection / outage recovery compared to the AE.
Conclusion
In this study, we explored the premise of conventional machine learning when compared to deep learning for anomaly detection in SONs. Anomaly detection was a popular application area of deep learning for cell outages in communication networks. However, as in other domains, conventional methods can still provide strong statistical alternatives to the right learning representations. In this paper, we focused on SVMs with one-class and binary learning scenarios on a previously published and publicly available dataset. We found that while deep learning was highly competitive, standard SVMs using RBF kernels, can be trained to outperform a deep autoencoder approach. Both one-class and binary classification can benefit immensely from synthetic augmentation of the dataset using SMOTE with improvements in detection accuracy by as much as 15% on average over four different application scenarios.
Future work will study the impact of augmentation on other learning algorithms, specifically statistical deep learning, such as variational auto-encoders. Work presented in this paper can further be extended to other applications beyond anomaly or outage detection. Specifically, there has been increased attention to modulation detection in next generation mobile wireless networks where fast, robust, and light machine learning models could enable time-critical applications in signal classification and modulation detection. Improvements in speed can be realized both at the algorithm level and data preprocessing stages using techniques such as principal component analysis to identify the most relevant features for classification and detection. Finally, statistical learning algorithms, such as Gaussian Process Regression, which have gained immense popularity as alternatives to deep learning can be applied to different scenarios especially when data is not present in sufficiently large volumes to properly train DL models with many parameters.