Journals & Magazines >IEEE Access >Volume: 9

An Online Network Intrusion Detection Model Based on Improved Regularized Extreme Learning Machine

An Online Network Intrusion Detection Model.

Abstract:

Extreme learning machine (ELM) is a novel single-hidden layer feedforward neural network to obtain fast learning speed by randomly initializing weights and deviations. Du...Show More

Metadata

Abstract:

Extreme learning machine (ELM) is a novel single-hidden layer feedforward neural network to obtain fast learning speed by randomly initializing weights and deviations. Due to its extremely fast learning speed, it has been widely used in training of massive data in recent years. In order to adapt to the real network environment, based on the ELM, we propose an improved particle swarm optimized online regularized extreme learning machine (IPSO-IRELM) intrusion detection algorithm model. First, the model replaces the traditional batch learning with sequential learning by dynamically adapting the new data obtained in the training network instead of training all collected samples in an offline manner; second, we improve the particle swarm optimization algorithm and compare it with typical improved algorithms to prove its effectiveness; finally, to solve the random initialization problem of IRELM, we use IPSO to optimize the initial weights and deviations of IRELM to improve the classification ability of IRELM. The experimental results show that IPSO-IRELM algorithm has better generalization ability, which not only improves the accuracy of intrusion detection, but also has certain recognition ability for minority class samples.

An Online Network Intrusion Detection Model.

Published in: IEEE Access ( Volume: 9)

Page(s): 94826 - 94844

Date of Publication: 29 June 2021

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2021.3093313

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

The widespread use of information technology and the emergence and development of cyberspace have greatly contributed to economic and social prosperity and progress, but at the same time brought new security risks and challenges. Network intrusion detection system is the first step of network security situation awareness and an important part of comprehensive network security defense, which perceives whether there are behaviors and signs of network intrusion by analyzing various traffic data of key nodes in the network, so as to be fully prepared for network defense [1]. Traditional network intrusion detection systems build models based on pattern matching methods [2], [3], collecting attack samples first and then training all collected samples in an offline manner, which has an inherent drawback of not detecting emerging types of attacks and lacks adaptiveness and scalability. With the development of network automation and intelligence, new types of adaptive and dynamically scalable intrusion detection systems have emerged [4], [5]. With the advantages of adaptive, self-learning, self-organization, better fault tolerance and the ability to perform massively parallel computation and nonlinear mapping, neural networks are very suitable for variable intrusion detection systems [6]. Moreover, the most important thing in switching from offline learning to online learning is the time problem [7], if the learning time is too long, the online intrusion detection system will have the problem of not being able to detect the attack in time, which leads to the system paralysis. Therefore, this paper uses the extreme learning machine (ELM) with fast learning characteristics as the basis.

Extreme learning machine [8] is a typical classification algorithm in machine learning, characterized by randomly selected input layer weights and hidden layer deviations. It is based on a single hidden layer feedforward neural network and computationally resolves the output layer weights according to Moore-Penrose generalized inverse matrix theory, and thus has the advantages of few training parameters, fast learning speed, and strong generalization ability. Its good generalization performance and extremely fast learning speed have been successfully applied to many real-world problems [9], [10]. However, ELM is vulnerable to the interference of outlier sample points during training, which affects the classification accuracy of the model. Therefore, Jose’ et al. [11] proposed a regularized extreme learning machine (RELM) based on ELM. RELM considers the structural error while solving the least squares error, which effectively avoids the overfitting problem caused by the excessive number of hidden layers and can further improve the classification. Although RELM solves the overfitting problem compared to ELM, RELM also has some problems, such as the selection of the appropriate regularization factor is random and time-consuming. Several scholars have also studied the improvement of RELM: to automatically select a satisfactory regularization factor, Zhang et al. [12] proposed an adaptive RELM with a function instead of a regularization factor; Gautam et al. [13] proposed an ELM with a regularization kernel, used a single-class ELM classifier based on the regularization kernel to detect outliers, and extended it to adaptive online learning, whose experimental results show that the classifier has faster learning capability and is more suitable for real-time anomaly detection; reference [14] proposed a new binary grey wolf optimization-regularized extreme learning machine wrapper; reference [15] used adaptive whale optimization algorithm (AWOA) to determine the input weights and hidden layer deviations of ELM.; Kumar et al. [16] and Zhi et al. [17] proposed biogeography-based extreme learning machine (BBO-ELM)model and a GAPSO-Enhanced ELM method respectively. They also compared them with genetic algorithm (GA)-based ELM, and particle swarm optimization (PSO)-based ELM, to verify the effectiveness of the proposed methods. Kanimozhi and Singaravel [18] and Wang et al. [19] have combined PSO with ELM and applied it to different scenarios, both of which have yielded good results. However, the improved RELM, if used directly for intrusion detection, still uses batch learning and trains all samples obtained once, and subsequently does not learn new knowledge and still fails to detect new attacks in the network.

To address this problem, this paper continues to propose an online regularized ELM (IRELM) based on RELM, which has the ability of dynamic sequential learning and adaptively learns the constant flow of traffic in the network for intrusion detection; meanwhile, we improve the PSO algorithm to reduce the probability of falling into local extremum points, which in turn better optimizes IRELM. In summary, IPSO-IRELM has better classification performance in all experiments. The research in this paper will improve the adaptiveness and expandability of the intrusion detection system.

The main contributions of this paper are as follows:

Proposing an online regularized ELM based on the RELM, which dynamically trains the added data instead of the traditional batch learning.
Adding a perturbation mechanism to PSO, and the improved PSO is compared with other improved classical PSO to prove its effectiveness.
IPSO optimizes the initial weights and deviations of IRELM to avoid the impact of randomly generated initial weights and biases on the final results.

The complete model is used in intrusion detection and the model has better recognition capability.

The remainder of this paper is organized as follows.

In Section II, we present the current studies related to our work and summarize the parameters used by others’ techniques and the limitations in Table 1. In Section III, we describe the IPSO-IRELM algorithm in detail. In Section IV, we derive the comparison results between IPSO-IRELM and other algorithms on the UCI dataset, NSL-KDD binary classification dataset, NSL-KDD multivariate classification dataset, and UNSW-NB15 multivariate classification dataset. In Section V, we summarize the research of this paper and point out the next research directions afterwards.

TABLE 1 Comparison of Various Improved ELM Models

SECTION II.

Related Work

The RELM is proposed mainly to solve the problem that the standard ELM is affected by outlier points resulting in low generalization ability and lack of stability [20]. Regularization is essentially a structural risk minimization strategy that adds a regularization term representing the complexity of the model to the empirical risk. The standard mathematical model of RELM is represented as follows:\begin{equation*} \min \frac {1}{2}\left \|{ {{ \boldsymbol { \beta }}} }\right \|_{p}^{\sigma _{1}} +\frac {C}{2}\left \|{ {{ \boldsymbol { \xi }}} }\right \|_{q}^{\sigma _{2}}\tag{1}\end{equation*} View Source where $\sigma _{1}>0,\sigma _{2}>0;p,q=0,\frac {1}{2},1,2,\ldots,+\infty;\,\,\|\cdot \|_{p}$ is the $L_{p} $ norm of the vector or matrix; $\left \|{ {{ \boldsymbol { \beta }}} }\right \|_{p}^{\sigma _{1}} $ is the regularization term, indicating the complexity of the model; $\left \|{ {{ \boldsymbol { \xi }}} }\right \|^{\sigma _{2}} $ is the total error of training and represents the empirical risk; $C$ is a regularization parameter to balance the empirical risk and model complexity. When $\sigma _{1}=\sigma _{2}=2$ , (1) is a quadratic programming problem under the equation constraint. When the regularization term is the $L_{2} $ norm and $L_{1} $ norm of the parameter vector, it is called ridge regression ($L_{2} $ regularization) and LASSO ($L_{1} $ regularization), respectively, which are the two most typical regularization methods. Deng et al. [21] studied $L_{2} $ RELM with hidden layer neurons as Sigmoid functions, and proposed the Unweighted RELM and Weighted RELM (WRELM) algorithms for the presence of noise in the dataset. WRELM uses weighted least squares to calculate the output weights, which has some noise anti-interference ability, but the training process adds the process of calculating the weights of errors, which leads to increased time consumption when the training data is large. Huang et al. [22] proposed Semi-Supervised ELM (SS-ELM) and Unsupervised ELM (US-ELM) algorithms based on stream regularization theory to deal with the relationship between unlabeled samples, which greatly extended the applicability of ELM. Yu and Sun [23] proposed a Sparse coding ELM algorithm (ScELM), which uses a sparse coding technique instead of random mapping to map the input feature vector to the hidden layer to improve the classification accuracy. An optimization method based on gradient projection and $L_{2} $ norm is used in the coding stage, while the output weights are derived by the Lagrange multiplier method. Zhao et al. [24] proposed the Robust ELM algorithm (RRELM) by introducing both the deviation and variance of the model into the objective function for optimization and keeping the $L_{2} $ penalty term constant. RRELM considers both the variance and deviation of the model and seeks to achieve the best compromise between them to enhance the network generalization performance and robustness. For the classification problem of imbalanced data, Xiao et al. [25] proposed the Class-specific Cost Regulation ELM algorithm (CCR-ELM), which achieves a compromise between the number of misclassified samples and the generalization ability of the model by applying different penalty factors to misclassified samples of different classes. However, since the number of hidden layer nodes, positive and negative sample weights, and kernel parameters in CCR-ELM have a large impact on the performance of the model, how to develop a more effective method to determine these parameters needs to be further investigated.

To address the problem that the input weights and hidden layer deviations of ELM are randomly generated, some scholars have also optimized the network structure of ELM. Zhu et al. [26] introduced differential evolution to ELM and proposed an Evolutionary ELM algorithm (E-ELM), which uses differential variation and crossover operators with simple structure and searches for optimal input weights and hidden layer deviations according to the dynamic adjustment of the population to obtain a more compact network structure. Xu and Shu [27] used the good global search ability of particle swarm optimization algorithm to optimize the number of hidden layer neurons of ELM, and proposed a PSO-ELM algorithm, which encodes the input weights and implied layer deviations of ELM as particles in the PSO search space, and in each iteration all particles update their positions by their own historical optimal solutions and the current global optimal solutions of the whole population to achieve the search for the optimal value in the solution space. The effectiveness of PSO optimization depends on its topology. Figueiredo and Ludermir [28] studied the effect of eight different topologies on PSO-ELM performance, and the results also did not find the best topology for all problems.

The above research on RELM shows that introducing regularization into ELM can solve the overfitting problem to a certain extent and improve the robustness and generalization ability of the model. However, the learning efficiency of the algorithm is reduced due to the addition of the regularization parameter in the objective function that needs to be optimized. Studies on the optimization of network structures for ELM have shown that there is no optimal topology for all problems, but there is always a topology that is suitable for a particular problem. Meanwhile, all the above studies do not adopt a dynamic way to train the data. Therefore, this paper combines both RELM and optimization of ELM network structure, and proposes an online regularized extreme learning machine model, which not only solves the initialization problem of ELM and improves the generalization ability of the model, but also transforms offline learning into online learning without reducing the learning efficiency of the algorithm, and has the ability to be more adaptable to modern intrusion detection. We have summarized the current partial RELM models, the optimized ELM network structure models, and our proposed model in Table. 1, and compared the algorithmic ideas, regularization methods used, feature mapping, robustness, evaluation data, and the drawbacks.

SECTION III.

IPSO-IRELM Algorithm

The idea of RELM is to solve the pathological problem of the hidden layer matrix when ELM fails by limiting the parity of the output weights of ELM, while sacrificing the bias to solve the overfitting problem, thus improving the overall generalization ability [29]. However, RELM still suffers from the generic problem of ELM, where random input weights and hidden layer deviations can be potentially unreliable and unstable, affecting the classification performance. Therefore, we use the improved classical PSO algorithm for the initialization of RELM, and combine it with the actual situation by replacing the batch learning idea with the sequential learning idea, so that the matrix multiplication and inverse decomposition in the output weight calculation process are gradually updated, while adding various mechanisms to reduce the computational load and maintain the efficiency of RELM.

A. IRELM Based on Sequential Learning

RELM adds a regularization parameter to ELM to adjust the coefficient $\beta $ . The objective function is as follows:\begin{equation*} \min \left \|{ {{ \boldsymbol { H\beta }}-{ \boldsymbol { T}}} }\right \|^{2}+\frac {1}{C}\left \|{ {{ \boldsymbol { \beta }}} }\right \|^{2}\tag{2}\end{equation*} View Source where ${ \boldsymbol { H}}$ is the hidden layer output matrix; ${ \boldsymbol { \beta }}$ is the output weight matrix;${ \boldsymbol { T}}$ is the objective matrix. Therefore, the output weight can be expressed as:\begin{align*} { \boldsymbol { \beta }}=&{ \boldsymbol { H}}^{\dagger }{ \boldsymbol { T}}=\left({{ \boldsymbol { H}}^{T}{ \boldsymbol { H}}+\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}{ \boldsymbol { H}}^{T}{ \boldsymbol { T}} \\=&{ \boldsymbol { H}}^{T}\left({{ \boldsymbol { HH}}^{T}+\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}{ \boldsymbol { T}}\tag{3}\end{align*} View Source

When the input data size $n$ is less than the number of neurons in the hidden layer $L$ , because the dimension of $\left({{ \boldsymbol { HH}}^{T}+\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}$ is $\boldsymbol {n}\times \boldsymbol {n}$ and the dimension of $\left({{ \boldsymbol { H}}^{T}{ \boldsymbol { H}}+\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}$ is $L\times L$ . Therefore, when $n < L$ , the IRELM is constructed by ${ \boldsymbol { H}}^{T}\left({{ \boldsymbol { HH}}^{T}+\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}{ \boldsymbol { T}}$ .

Define a recursive hidden layer output matrix ${ \boldsymbol { H}}_{k} $ , where $k$ is 0 or any positive integer, representing the number of sequential updates in the matrix. Assuming that the first output matrix is ${ \boldsymbol { H}}_{0} $ , according to (3), the inverse matrix is obtained as follows:\begin{equation*} { \boldsymbol { K}}_{0}^{-1} =\left({{ \boldsymbol { H}}_{0} { \boldsymbol { H}}_{0}^{T} +\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}\tag{4}\end{equation*} View Source herefore, the output weight matrix ${ \boldsymbol { \beta }}_{0} $ is:\begin{equation*} { \boldsymbol { \beta }}_{0} ={ \boldsymbol { H}}_{0} { \boldsymbol { K}}_{0}^{-1} { \boldsymbol { T}}_{0}\tag{5}\end{equation*} View Source

When $\gamma N_{1} $ number of new samples arrives, $\gamma N_{1} =1,2,3\ldots $ , the new hidden layer output matrix ${ \boldsymbol { H}}_{1} $ becomes:\begin{align*} { \boldsymbol { H}}_{1} =\left [{ {{\begin{array}{cccccccccccccccccccc} {{ \boldsymbol { H}}_{0}} \\ {\gamma { \boldsymbol { H}}_{1}} \\ \end{array}}} }\right]\tag{6}\end{align*} View Source where $\gamma { \boldsymbol { H}}_{1} $ is the $\gamma $ number of output matrix. Similarly, according to (3), the inverse matrix ${ \boldsymbol { K}}_{1}^{-1} $ can be constructed as:\begin{align*} { \boldsymbol { K}}_{1}^{-1}=&\left({{ \boldsymbol { H}}_{1} { \boldsymbol { H}}_{1}^{-1} +\frac {{ \boldsymbol { I}}}{C}}\right)^{-1} \\=&\left [{ {{\begin{array}{cccccccccccccccccccc} {{ \boldsymbol { H}}_{0} { \boldsymbol { H}}_{0}^{-1} +\dfrac {{ \boldsymbol { I}}}{C}} & \quad {{ \boldsymbol { H}}_{0} \gamma { \boldsymbol { H}}_{1}^{T}} \\ {\gamma { \boldsymbol { H}}_{1} { \boldsymbol { H}}_{0}^{T}} &\quad {\gamma { \boldsymbol { H}}_{1} \gamma { \boldsymbol { H}}_{1}^{T} +\dfrac {{ \boldsymbol { I}}}{C}} \end{array}}} }\right]^{-1}\tag{7}\end{align*} View Source

Therefore, the output weight matrix can be expressed as ${ \boldsymbol { \beta }}_{1} ={ \boldsymbol { H}}_{1} { \boldsymbol { K}}_{1}^{-1} { \boldsymbol { T}}_{1} $ . With the successive arrival of block data, the update of ${ \boldsymbol { K}}$ and ${ \boldsymbol { \beta }}$ can be implemented in Algorithm 1.

Algorithm 1 IRELM Update Formula When $n < L$

Input the first batch of samples;

${ \boldsymbol { K}}_{0}^{-1} =\left({{ \boldsymbol { H}}_{0} { \boldsymbol { H}}_{0}^{T} +\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}$ ;${ \boldsymbol { \beta }}_{0} ={ \boldsymbol { H}}_{0} { \boldsymbol { K}}_{0}^{-1} { \boldsymbol { T}}_{0} $ (if necessary);

Assuming that $n\ge L$ in step

for $i=1:m-1$ do

\begin{align*}{ \boldsymbol { S}}_{i+1}=&\gamma { \boldsymbol { H}}_{i+1} \gamma { \boldsymbol { H}}_{i+1}^{T} +\frac {{ \boldsymbol { I}}}{C}-\gamma { \boldsymbol { H}}_{i+1} { \boldsymbol { H}}_{i}^{T} { \boldsymbol { K}}_{i}^{-1} { \boldsymbol { H}}_{i} \gamma { \boldsymbol { H}}_{i+1}^{T}; \\ { \boldsymbol { A}}=&{ \boldsymbol { K}}_{i}^{-1} +{ \boldsymbol { K}}_{i}^{-1} { \boldsymbol { H}}_{i} \gamma { \boldsymbol { H}}_{i+1}^{T} { \boldsymbol { S}}_{i+1}^{-1} \gamma { \boldsymbol { H}}_{i+1} { \boldsymbol { H}}_{i}^{T} { \boldsymbol { K}}_{i}^{-1}; \\ { \boldsymbol { B}}=&-{ \boldsymbol { K}}_{i}^{-1} { \boldsymbol { H}}_{i} \gamma { \boldsymbol { H}}_{i+1}^{T} { \boldsymbol { S}}_{i+1}^{-1}; \\ { \boldsymbol { C}}=&-{ \boldsymbol { S}}_{i+1}^{-1} \gamma { \boldsymbol { H}}_{i} { \boldsymbol { H}}_{i}^{T} { \boldsymbol { K}}_{i}^{-1}; \\ { \boldsymbol { D}}=&{ \boldsymbol { S}}_{i+1}^{-1}; \\ { \boldsymbol { K}}_{i+1}^{-1}=&\left [{ {{\begin{array}{cccccccccccccccccccc} {{ \boldsymbol { A}}} &\quad {{ \boldsymbol { B}}} \\ {{ \boldsymbol { C}}} &\quad {{ \boldsymbol { D}}} \\ \end{array}}} }\right];\\ { \boldsymbol { \beta }}_{i+1}=&{ \boldsymbol { H}}_{i+1} { \boldsymbol { K}}_{i+1}^{-1} { \boldsymbol { T}}_{i+1} \quad (\text {if necessary});\end{align*} View Source

end

With the continuous learning, the dimension of ${ \boldsymbol { K}}$ becomes larger and larger. When $n>L$ , the speed advantage brought by $\left({{ \boldsymbol { HH}}^{T}+\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}$ disappears. Therefore, the update scheme is changed to $\left({{ \boldsymbol { H}}^{T}{ \boldsymbol { H}}+\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}$ . Assuming that $n\ge L$ in step $m$ , ${ \boldsymbol { K}}_{m+1}^{-1} $ can be expressed as:\begin{equation*} { \boldsymbol { K}}_{m+1}^{-1} =\left({{ \boldsymbol { H}}_{m+1}^{T} { \boldsymbol { H}}_{m+1} +\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}\tag{8}\end{equation*} View Source where:\begin{align*} { \boldsymbol { H}}_{m+1} =\left [{ {{\begin{array}{cccccccccccccccccccc} {{ \boldsymbol { H}}_{m}} \\ {\gamma { \boldsymbol { H}}_{m+1}} \\ \end{array}}} }\right]\tag{9}\end{align*} View Source

When new data is input in Step $i+1$ and $i>m$ , then ${ \boldsymbol { H}}_{i+1} $ is:\begin{align*} { \boldsymbol { H}}_{i+1} =\left [{ {{\begin{array}{cccccccccccccccccccc} {{ \boldsymbol { H}}_{i}} \\ {\gamma { \boldsymbol { H}}_{i+1}} \\ \end{array}}} }\right]\tag{10}\end{align*} View Source

Therefore, the inverse matrix ${ \boldsymbol { K}}_{i+1}^{-1} $ can be calculated as follows:\begin{align*}&\hspace {-1.4pc} { \boldsymbol { K}}_{i+1}^{-1} \\=&\left [{ {{ \boldsymbol { K}}_{i} +\gamma { \boldsymbol { H}}_{i+1}^{T} { \boldsymbol { I}}\gamma { \boldsymbol { H}}_{i+1}} }\right]^{-1} \\=&{ \boldsymbol { K}}_{i}^{-1} -{ \boldsymbol { K}}_{i}^{-1} \gamma { \boldsymbol { H}}_{i+1}^{T} ({ \boldsymbol { I}}+\gamma { \boldsymbol { H}}_{i+1} { \boldsymbol { K}}_{i}^{-1} \gamma { \boldsymbol { H}}_{i+1}^{T})\gamma { \boldsymbol { H}}_{i+1} { \boldsymbol { K}}_{i}^{-1} \\\tag{11}\end{align*} View Source

When $n>L$ , the sequential updating formula of IRELM can be implemented according to Algorithm 2.

Algorithm 2 IRELM Update Formula When $n>L$

Connect Algorithm 1;

Assuming that $n\ge L$ in step;

$\begin{aligned} { \boldsymbol { H}}_{m+1} =\left [{ {{\begin{array}{cccccccccccccccccccc} {{ \boldsymbol { H}}_{m}} & {\gamma { \boldsymbol { H}}_{m+1}} \\ \end{array}}} }\right] \end{aligned}$ ;

${ \boldsymbol { K}}_{m+1}^{-1} =\left({{ \boldsymbol { H}}_{m+1}^{T} { \boldsymbol { H}}_{m+1} +\frac {{ \boldsymbol { I}}}{C}}\right)^{-1}$ ;

$i=m+1$ ;

while No termination do\begin{align*}{ \boldsymbol { H}}_{i+1}=&\left [{ {{\begin{array}{cccccccccccccccccccc} {{ \boldsymbol { H}}_{i}} &\quad {\gamma { \boldsymbol { H}}_{i+1}} \\ \end{array}}} }\right]; \\ { \boldsymbol { K}}_{i+1}^{-1}=&{ \boldsymbol { K}}_{i}^{-1} -{ \boldsymbol { K}}_{i}^{-1} \gamma { \boldsymbol { H}}_{i+1}^{T} ({ \boldsymbol { I}}+\gamma { \boldsymbol { H}}_{i+1} { \boldsymbol { K}}_{i}^{-1} \gamma { \boldsymbol { H}}_{i+1}^{T})\\&\gamma { \boldsymbol { H}}_{i+1} { \boldsymbol { K}}_{i}^{-1};\\ { \boldsymbol { \beta }}_{i+1}=&{ \boldsymbol { H}}_{i+1} { \boldsymbol { K}}_{i+1}^{-1} { \boldsymbol { T}}_{i+1} \quad (\text {if necessary});\\ i=&i+1;\end{align*} View Source

end

Algorithm 3 The Framework of IPSO

for each particle $k$ do

Initialize velocity $v_{kd}$ and position $s_{kd}$ for particle $k$ ;

Evaluate particle $k$ and set $p_{kd}=s_{kd}$ ;

end

$p_{gd} =\min \left \{{{p_{kd}} }\right \}$ ;

while not stop

for $k=1:a$

Update the velocity and position of particle $k$ using (14) and (13);

Evaluate particle $k$ ;

if fit ($s_{kd}$ ) < fit ($p_{kd}$ )

$p_{kd}=s_{kd}$ ;

if fit ($p_{kd}$ ) < fit ($p_{gd}$ )

$p_{gd}=p_{kd}$ ;

end

Select the top S elite particle and implement adaptive mutation for them by (16);

Get the current new position;

If the new position is better than before, update it;

end

According to Algorithm 1 and Algorithm 2, the advantages of IRELM are as follows:

Dimension reduction of inverse matrix. In RELM, the process that consumes the most computing power is the calculation of $({ \boldsymbol { H}}^{T}{ \boldsymbol { H}})^{-1}$ and ${ \boldsymbol { K}}_{i+1}^{-1} $ . In IRELM’s updating scheme, the dimension of inverse matrix ${ \boldsymbol { S}}_{i+1}^{-1} $ is only $\gamma N_{i} \times \gamma N_{i} $ , which is far less than $L\times L$ , avoiding the most complex part of RELM.
Calculate the value of ${ \boldsymbol { \beta }}$ if necessary. The calculation of ${ \boldsymbol { \beta }}_{i+1} $ value does not depend on the previous value, so it needs to be updated when necessary, and does not need to be updated when unnecessary, which reduces the calculation cost.
The batch learning is replaced by sequential learning, which is more suitable for the actual situation, not only maintains the generalization ability of RELM, but also maintains the efficiency of the algorithm.

B. Design PSO Algorithm

Particle swarm optimization [30] is one of the classic heuristic algorithms, and its main idea comes from the predatory behavior of a flock of birds. The core is to make use of the information sharing of individuals in the group so that the movement of the whole group has an evolutionary process from disorder to order in the problem solving space, so as to get the optimal solution of the problem. Suppose that there is a group of particles in a $D$ -dimensional search space, and each particle has an initial velocity $v_{k} $ , initial position $s_{k} $ and fitness value $g_{k} $ . In each iteration, each particle constantly updates its own velocity and position. At the same time, the fitness value is used to determine the optimal position $p_{k} $ of the updated individual and the optimal position $p_{g} $ of the population. Assuming that the optimal position is the initial position of particles in the first iteration, the update formula of the velocity and position of particles in the population are as follows:\begin{align*} v_{kd} (u+1)=&zv_{kd} (u)+c_{1} r_{1} (p_{kd} (u)-s_{kd} (u)) \\&+ c_{2} r_{2} (p_{gd} (u)-s_{kd} (u))\tag{12}\\ s_{kd} (u+1)=&s_{kd} (u)+v_{kd} (u+1)\tag{13}\end{align*} View Source where $k$ represents the $k$ th particle in the population; $u$ represents the current iteration times; $z$ is the inertia factor, whose value is nonnegative. When $z$ is large, the global search ability is strong; when $z$ is small, the global search ability becomes weak; $c_{1}$ and $c_{2}$ are the individual learning factors and social learning factors of particles, whose values are nonnegative constants; $r_{1}$ and $r_{2}$ are independent random numbers in the range of [0, 1].

In order to make the parameters of PSO adjust adaptively with the number of iterations, many scholars have proposed a variety of autonomous particle swarm optimization algorithms, which effectively balance the global and local search ability of particles, but still cannot solve the defect that particles are easy to fall into local optimum. Therefore, this paper proposes a Cauchy-Gaussian mutation strategy to make the particles which fall into local optimum get rid of stagnation and continue to search. Secondly, according to Clerc’s idea of shrinkage factor [31], only use a global contraction factor to replace other adaptive parameters. The improved particle velocity update formula and Cauchy-Gaussian mutation strategy are as follows:\begin{align*} v_{kd} (u+1)=&K[v_{kd} (u)+c_{1} r_{1} (p_{kd} (u)-s_{kd} (u)) \\&+ c_{2} r_{2} (p_{gd} (u)-s_{kd} (u))] \tag{14}\\ K=&\frac {2}{\left |{ {2-\varphi -\sqrt {\varphi ^{2}-4\varphi }} }\right |}, \\ where ~\varphi=&c_{1} +c_{2},\quad \varphi >4\tag{15}\\ x_{kd} (u)=&s_{kd} (u) \\&\times \left [{ {1+\lambda _{1} cauchy(0,\sigma ^{2})+\lambda _{2} Gauss(0,\sigma ^{2})} }\right], \\ \tag{16}\\ \sigma=&\begin{cases} 1,&\!\!\!\!\!\! f(s_{best})\! < \!f(s_{kd} (u)) \\ \exp \left ({{\dfrac {f(s_{best})-f(s_{kd})}{\left |{ {f(s_{best})} }\right |}} }\right),&\!\!\!\!otherwise \\ \end{cases} \\{}\tag{17}\end{align*} View Source where $s_{best} $ is the optimal individual position; $x_{kd} (u)$ represents the position of the optimal individual after mutation; $\sigma ^{2}$ is the standard deviation; $\lambda _{1} $ and $\lambda _{2} $ are the dynamic parameters of adaptive adjustment. In the search process, $\lambda _{1} $ is larger in the initial stage, and then gradually decreases, so that the algorithm can explore in a larger range with a larger mutation step; $\lambda _{2} $ is increasing, which is conducive to the algorithm to search the near optimal solution.

IPSO pseudo code is as follows:

C. Intrusion Detection Algorithm Based on IPSO-IRELM

Based on the above analysis and derivation, we introduce IPSO into IRELM algorithm and establish an intrusion detection model based on IPSO-IRELM. The role of IPSO is to find a group of particle swarm positions with the best fitness function, and assign them to the optimal initial weight and deviation of IRELM after the iteration to establish an intrusion detection model. The model description is shown in the Fig. 1:

FIGURE 1.

IPSO-IRELM algorithm intrusion detection framework.

Show All

SECTION IV.

Experimental Results

A. Parameter Setting

Experiment 1: The parameters of classic PSO and its derivative algorithms (MPSO [32], TACPSO [33], AGPSO [34]) compared with IPSO are as follows:

Among them, MPSO is to incorporate an asymmetric time-varying acceleration coefficients adjusting strategy, which maintains the balance between global and local search with the great advantages of convergence property and robustness compared with basic PSO algorithm; In TACPSO, random velocities are added to reinitialize the velocities of particles in order to avoid searching for particles with zero velocity prematurely. Also, to enhance early exploration and later exploitation, exponential time-varying acceleration coefficients are introduced, and the algorithm has a better probability of finding the global optimum and the average optimum than other algorithms; The main idea of the AGPSO algorithm is inspired by the diversity of individuals in bird or insect flocks, using different functions with different slopes, curvatures and intercept points to tune the social and cognitive parameters of the particle swarm algorithm in order to endow the particles with behaviors different from those of the natural population, further alleviating the problem of getting stuck in local minima and slow convergence of high-dimensional problems. The MPSO, TACPSO and AGPSO were chosen for comparison because these three improvement methods are more classical and more effective.

When doing the experiment, the number of particles is set to 30; the number of iterations is 1000. Take the average of 50 runs of the experiment.

Experiment 2-4: The parameters of the IRELM neural network model are: 41—39—1, and the number of iterations is 100. The model parameters of IRELM are determined by the two-classification experiment of GA-IRELM. First, set the number of hidden layer nodes to 10, 20, 30, 40, and 50 respectively. The experimental results show that the model with 40 hidden layer units has the best detection accuracy. Then set the number of hidden layer nodes to 36, 37, 38, 39, 40, 41, 42. The results show that the model with 39 hidden layer units has the best detection accuracy. When the number of hidden layer nodes is increased from 39 to 42, the accuracy of intrusion detection decreases. Keeping the hidden layer nodes unchanged, increasing the number of iterations will only cause overfitting and fluctuate the detection efficiency of the model.

B. IRELM Based on Sequential Learning

Experiment1: We choose 13 benchmark functions for testing, among which F1-F7 are single-peak benchmark functions, and F8-F13 are multimodal benchmark functions [35].

Experiment2: The UCI data sets [36] are as follows:

Experiment3: The NSL-KDD data set [37] is as follows:

Experiment4: The UNSW-NB15 data set [38] is as follows:

C. IPSO Experiment Results

In order to verify that IPSO has a better optimization effect, we use the 13 benchmark functions in Table. 3 to do experiments, and verify whether IPSO is applicable to any dimension from the perspective of low-dimensional and high-dimensional. Fig. 2 shows the images of thirteen benchmark functions and the convergence curves of the algorithms in 30 dimensions. Table. 7 and Table. 8 respectively show the best optimization result, average optimization result and standard deviation of optimization result of unimodal function and multimodal function in low dimension. Table. 9 and Table. 10 respectively show the best optimization result, average optimization result and standard deviation of optimization result of unimodal function and multimodal function in high dimension respectively. The best result represents the optimization ability of the algorithm, and the average result and standard deviation of the optimization results reflect the stability and robustness of the algorithm. For attention, the best values in Table. 7–10 are marked in bold.

TABLE 2 Updating Strategies

TABLE 3 Benchmark Functions

TABLE 4 UCI Datasets Details

TABLE 5 NSL-KDD Classification Situation

TABLE 6 UNSW-NB15 Classification Situation

TABLE 7 The Experimental Results of Single Peak When Dim = 30

TABLE 8 The Experimental Results of Multimodal When Dim = 30

TABLE 9 The Experimental Results of Single Peak When Dim = 300

TABLE 10 The Experimental Results of Multimodal When Dim = 300

FIGURE 2.

Convergence curve.

Show All

According to Table. 7–10, when the dimension of the benchmark function is 30, first, IPSO obtains 11 minimum values of 13 benchmark functions except F8 and F9. Secondly, 8 average optimal values are obtained in 13 benchmark functions. At the same time, IPSO’s standard deviation of the optimal results is also significantly better than other algorithms. When the dimension of benchmark function is 300, IPSO obtains 6 minimum values, 3 average optimal values and 2 standard deviations of optimal results, while PSO obtains 5 minimum values, 8 average optimal values and 7 standard deviations of optimal results. This shows that through the comparison of Table. 7 and Table. 8, IPSO can obtain better optimization results in low-dimensional benchmark functions, but the effect on multimodal functions is not as good as that of unimodal functions; through the comparison of Table. 9 and Table. 10, in high-dimensional benchmark functions, other improved PSO algorithms are inferior to the original PSO, and in the improved PSO, the comprehensive effect of IPSO is relatively the best; through the comparison of Table. 7 and Table. 9, the comparison of Table. 8 and Table. 10, we can conclude that the higher the dimensionality of the benchmark function, the worse the optimization effect. Therefore, the original PSO has a better effect on the high-dimensional benchmark function, and other improved PSOs are slightly inferior.

D. UCI Data Set Experimental Results

In order to verify the effectiveness of the proposed algorithm on the real classification data set, IPSO-IRELM was compared with PSO-IRELM, GA-IRELM, IRELM and RELM on the UCI data set. The performance evaluation indexes of each algorithm are listed in Table. 11–13, and the detection results of each algorithm are given in Fig. 3. It can be seen intuitively from Fig. 3 that whether it is the Iris data set or the Wine data set, the predicted value of IPSO-IRELM coincides perfectly with the true value, and the classification accuracy rate reaches 100%. It can be seen from Table. 11–13 that the performance evaluation indicators of other comparison algorithms are better, but not as good as IPSO-IRELM. This is because the two selected data sets are balanced data sets, so the detection results are better. IPSO-IRELM has the best performance evaluation index, which indicates that IPSO has better optimization ability than PSO and GA algorithm, and the necessity of introducing IPSO into IRELM; it also verifies that IPSO-IRELM algorithm has better classification performance. Therefore, it is further used in network intrusion detection to verify the feasibility of IPSO-IRELM.

TABLE 11 Accuracy of Each Algorithm on UCI Dataset (%)

TABLE 12 Performance Evaluation Index of Each Algorithm on Iris Dataset

TABLE 13 Performance Evaluation Index of Each Algorithm on Wine Dataset

FIGURE 3.

The detection results of each algorithm on UCI datasets.

Show All

E. NSL-KDD Data Set 2-Element Classification Experiment Results

We merged the four types of attacks in the NSL-KDD dataset into Abnormal and denoted as 2 and Normal as 1. The experiment changed from a multi-classification problem to a two-element classification problem. The experimental results of each algorithm are shown in Table. 14 and Table. 15. Fig. 4 is a comparison diagram of the confusion matrix and ROC curve of the binary classification results. From Table. 14, compared with other algorithms, IPSO-IRELM has the highest accuracy rate of up to 91.13%, but the NSL-KDD data set is an unbalanced data set. Therefore, in addition to the accuracy rate, the precision, the true positive rate (TPR), the false positive rate (FPR), the F-score, and the area under curve (AUC) are used to evaluate the classification [39]. It can be seen from Table. 15 that for category 1, the TPR of SVM, ELM, RELM, IRELM, and GA-IRELM is better, reaching more than 97%, while the TPR of PSO-IRELM and IPSO-IRELM is a little worse. This is because the first five algorithms misjudge a large amount of attack data as normal data, so for category 2, their TPR is very low compared to IPSO-IRELM, and IPSO-IRELM’s TPR can reach 94.24%. Looking at the F-score and AUC, IPSO-IRELM is the best. In terms of precision, for category 2, IPSO-IRELM is a bit worse than other algorithms, but the difference is not big.

TABLE 14 Accuracy of Each Algorithm on NSL-KDD Dataset (%)

TABLE 15 Performance Evaluation Index of Each Algorithm on NSL-KDD Dataset

FIGURE 4.

Binary classification confusion matrix and ROC curve comparison diagram.

Show All

The confusion matrix is used to summarize the records in the data set according to the actual results and prediction results to realize visualization. As can be seen from Fig. 4, IPSO-IRELM has the best performance in predicting category 2. At the time, the value corresponding to the second quadrant is the largest, which also corresponds to the TPR value of IPSO-IRELM corresponding to category 2 in Table. 15. ROC curve is a curve reflecting the relationship between TPR and FPR. The curve divides the graph into two parts, and the part below the curve is expressed as AUC, which is used to illustrate the accuracy of prediction. It can be seen from Fig. 4 that the AUC of IPSO-IRELM is the highest in both category 1 and category 2, which fully indicates the excellent performance of IPSO-IRELM and verifies that IPSO-IRELM has better binary classification detection effect than other algorithms. Overall, IPSO-IRELM has the best performance.

F. NSL-KDD Data Set Multivariate Classification Experiment Results

Divide Normal, Dos, Probe, U2R, and R2L into five categories, denoted as 1, 2, 3, 4, and 5 respectively. The experiment has changed from two classifications to multiple classifications. The experimental results are shown in Table. 16–21. The confusion matrix and ROC curve diagram of multiclass classification are given in Fig. 5.

TABLE 16 Accuracy of Different Algorithms (%)

TABLE 17 Performance Evaluation Index of Each Algorithm on Normal

TABLE 18 Performance Evaluation Index of Each Algorithm on Dos

TABLE 19 Performance Evaluation Index of Each Algorithm on Probe

TABLE 20 Performance Evaluation Index of Each Algorithm on U2R

TABLE 21 Performance Evaluation Index of Each Algorithm on R2L

FIGURE 5.

Multiple classification confusion matrix and ROC curve comparison diagram.

Show All

From Table. 16, IPSO-IRELM has the highest accuracy rate of 85.58%. Since the NSL-KDD multi-classification data set is still an unbalanced data set, the intrusion detection capabilities of each algorithm are further analyzed in terms of precision, TPR, FPR, F-score and AUC.

It can be seen from Table. 17 that for the Normal type data, the TPR of each algorithm is better, all above 97%. For other performance indicators, the precision, FPR, F-score and AUC of IPSO-IRELM are the highest among the comparison algorithms, indicating that IPSO-IRELM has better classification performance for Normal type data.

It can be seen from Table. 18 that for DOS data, IPSO-IRELM has the highest precision, TPR and F-score, but FPR is a little worse than SVM, ELM, RELM and IRELM. This is because in IPSO-IRELM, other types of data are predicted to have too many DOS data, resulting in slightly poor FPR. For AUC, IPSO-IRELM is second only to IRELM. In general, IPSO-IRELM has a strong ability to identify DOS type attacks.

From Table. 19, for Probe type data, the TPR of IPSO-IRELM is the highest, but the FPR is also the highest. This is because in addition to the better detection ability of IPSO-IRELM for Probe type data, other types of data will also be mistakenly detected as Probe type data, resulting in the highest FPR and low precision, but the F-score and AUC are still the highest.

It can be seen from Table. 20 that for U2R data, there are only 11 training data and 200 test data, so SVM, ELM, RELM, IRELM, GA-IRELM and PSO-IRELM all have poor recognition effect for U2R data. Although the TPR of IPSO-IRELM is only 5%, it is still the best, and the F-score and AUC are also the highest, with an precision rate of 76.92%, which other algorithms cannot do, indicating that IPSO-IRELM also has a certain ability to detect a few types of data, but it needs to be improved.

From Table. 21, for R2L type data, SVM and GA-IRELM still cannot recognize R2L data because it is still a minority type of data. Other algorithms have recognition capabilities, but they are still not prominent. IPSO-IRELM’s TPR, F-score and AUC are second only to RELM, but the precision rate is higher than RELM. This is because RELM recognizes that other types of data are more R2L data during classification, which leads to a higher TPR than IPSO-IRELM, but the accuracy rate is not as good as IPSO-IRELM.

Through the analysis of Table. 17–21, we can conclude that in the multivariate classification problem, it is not only necessary to consider some performance indicators. When the TPR conflicts with the precision rate, the comparison between models is relatively complicated. The F-score just reconciles the TPR and precision rate. In Table. 17–21, the F value is better, indicating that IPSO-IRELM has a better classification effect than other algorithms.

In the analysis of Table. 17–21, we found that when the classification error occurs, only looking at the precision rate and TPR can potentially see which category is more, which category is less. We do not know how many types of specific errors are classified into. Therefore, we still use the confusion matrix to show the difference of the data with its color difference, brightness, etc., which is easy to understand. The dark area indicates that the true value and the predicted value overlap more, and the light area is the opposite. It can be seen from Fig. 5 that the dark areas of the first three categories of each algorithm are concentrated on the diagonal, and the latter two categories are concentrated on the lower left corner, which is caused by the imbalance of the data. Because the latter two types of data are too small, the algorithm’s recognition ability is not sufficient for high-level recognition, which leads to the recognition result tends to the category 1. From the comparison between the algorithms, for category 1, the classification accuracy of each algorithm is not much different. For category 2 and category 3, the good classification performance of the IPSO-IRELM algorithm is reflected, which is the best among all the comparison algorithms. As for the minority categories 4 and 5, IPSO-IRELM also has a certain recognition ability, while the recognition ability of other algorithms is basically zero.

The ROC curve has a huge advantage. When the distribution of positive and negative samples changes, its shape can remain basically unchanged. Therefore, the ROC curve can reduce the interference caused by different test sets and more objectively measure the performance of the model itself. It can be seen from Fig. 5 that IPSO-IRELM has the best AUC for Normal, Dos, Probe and U2R. For R2L type, it is second only to RELM with a difference of 0.00472.

In summary, on the UCI balanced data set, the performance of IPSO-IRELM is better than that of IRELM and the traditional RELM algorithm, which shows the necessity of the improved method. On the NSL-KDD binary classification and multivariate classification data sets, it is also verified that the IPSO-IRELM algorithm has better classification performance.

G. UNSW-Nb15 Data Set Multivariate Classification Experiment Results

The UNSW-NB15 dataset contains new patterns of modern network features with nine types of attacks, so this paper also uses the UNSW-NB15 dataset to verify whether the proposed algorithm can have a better classification effect in the current network environment. Since the UNSW-NB15 dataset is also a non-equilibrium dataset, neither the proposed algorithm nor the comparison algorithms can identify the four attacks: Analysis, Backdoor, shellcode and Worms in the experiment, so the final experimental results are shown in Table. 22 –28 and Fig. 6.

TABLE 22 Accuracy of Different Algorithms (%) and time(s)

TABLE 23 Performance Evaluation Index of Each Algorithm on Normal

TABLE 24 Performance Evaluation Index of Each Algorithm on Dos

TABLE 25 Performance Evaluation Index of Each Algorithm on Exploits

TABLE 26 Performance Evaluation Index of Each Algorithm on Fuzzers

TABLE 27 Performance Evaluation Index of Each Algorithm on Generic

TABLE 28 Performance Evaluation Index of Each Algorithm on Reconnaissance

FIGURE 6.

UNSW-NB15 dataset multiple classification confusion matrix.

Show All

From Table. 22, it is concluded that IPSO-IRELM has the highest accuracy rate of 88.53%. Also, IRELM has the shortest training time, which indicates that IRELM is more efficient. The training time of IPSO-IRELM increases due to the optimization of IPSO, but it does not increase compared with GA-IRELM and PSO-IRELM, which indicates that our improvement of PSO does not increase its complexity. Similarly, since the UNSW-NB15 multiclassification dataset is also an unbalanced dataset, the intrusion detection capability of each algorithm is further analyzed in terms of precision, TPR, FPR, F-score, and AUC.

From Fig. 6, a portion of the Exploits data is mistaken for Dos; a portion of the Fuzzers data is mistaken for Exploits; and a portion of the Normal data is mistaken for Fuzzers in the final classification results obtained by all algorithms, which leads to the low precision rates in Table. 24–26. For Generic data, as in Table. 27, each algorithm can detect it well, but IPSO-IRELM has the smallest FPR, indicating that IPSO-IRELM can detect Generic while not mistaking Generic for other classes. For Normal data, as shown in Table. 23, the TPR, F-value and AUC of IPSO-IRELM are the highest, which indicates that IPSO-IRELM has better classification performance for Normal data. In summary, IPSO-IRELM not only has the highest accuracy rate, but also the training time does not increase. Although some classifications are not recognized for the UNSW-NB15 dataset, it has the best classification results for the remaining six classifications, indicating that IPSO-IRELM still has better classification ability in modern network environments as well.

SECTION V.

Conclusion

We propose an IPSO-IRELM intrusion detection model, which overcomes the shortcomings of offline learning of traditional intrusion detection models. It uses sequential learning to classify the types of attacks in the network, and jointly optimizes the initial weights and deviations of IRELM through improved particle swarms. The experimental results show that IPSO-IRELM has obvious advantages in all evaluation indicators, no matter on the UCI balance data set or the NSL-KDD intrusion detection data set, or the UNSW-NB15 data set, which contains many new types of attacks. The next step will be to study how to have a higher recognition rate for minority samples on unbalanced data sets, and apply the IPSO-IRELM algorithm to actual dynamic intrusion detection networks to test its classification effect in the real environment.

References is not available for this document.

MIT Libraries

MIT Libraries

An Online Network Intrusion Detection Model Based on Improved Regularized Extreme Learning Machine

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work