Introduction
Pre-trained transformer-based models (PM) have attracted significant attention in different domains including computer vision, natural language processing. By pre-training in a self-supervised manner on large-scale datasets, large models such as BERT [1], BART [2] and GPT2 [3], PM better handle downstream tasks. In the field of natural language generative, the remarkable success of ChatGPT [4] demonstrates that PMs have the ability to handle complex tasks. Different from natural language, network traffic is structured, semantically explicit, and represented in a binary form. Network traffic generators are used for network stress testing [5], [6], replay attack [7], application layer payload generation [8]. Existing network traffic generation methods are mainly classified into script-based and model-based methods. Script-based network traffic generators allow the user to manually modify the packet format and data content through complex scripting, such as Scapy [9], DPDK pktgen [10] and Moongen [11]. Model-based network traffic generators use network traffic logs for analysis. They create models by fitting random distributions to traffic parameters. These models then generate packets that statistically resemble actual packets. Examples include MGEN [12], Swing [13], and Harpoon [14].
However, Existing traffic generation tools and methods require the explicit protocol format of the target traffic, while the task of generating the traffic of unknown protocol format cannot be accomplished. The goal of proprietary protocol network traffic generation is to generate traffic loads that conform to the unknown protocol format and have the same semantics as specified. The generation of proprietary protocol network traffic presents several challenges: (1) There is an extreme scarcity of real-world datasets for proprietary protocol network traffic. (2) The formats of proprietary protocols are highly heterogeneous and varied, with different formats typically used to distinguish between functionalities and convey distinct semantic information. (3) Proprietary protocol network traffic payloads are often comprised of binary data with semantic content including function codes and parameters consisting of real-world numbers, symbols, and messages. Existing tokenization methods are inadequate for maintaining the semantic integrity of binary data and real numbers.
In this paper, we present PNetGPT (Proprietary Protocol Network Traffic Generation with Pre-trained Transformer), which is the first attempt to significantly improve the performance of a network traffic generation task using a transformer-based pre-trained model. We collected over 700,000 pieces of network traffic data from the EzSocket and FOCAS protocols, which are real-world proprietary protocols used in the Industrial Internet for Mitsubishi (EzSocket) and FANUC (FOCAS) brand industrial manufacturing equipment. Utilizing this dataset, we pre-trained a transformer-based model using tasks such as Masked Language Modeling (MLM) and Autoregressive Language Model (ALM). The model was then fine-tuned specifically for the task of network traffic generation. Experiments conducted on these two real-world proprietary protocol datasets demonstrate that PNetGPT, which incorporates both the encoder and the decoder, achieves SOTA performance in the proprietary protocol generation task, surpassing existing natural language pre-training models. Our key contributions are as follows:
We present the first large-scale dataset of real-world proprietary protocol network traffic.
We introduce a novel tokenization method that completely maintain the semantic information of function codes, real numbers, and binary payloads.
We propose PNetGPT, the first pre-trained model specifically designed for proprietary protocol network traffic, which achieves SOTA performance in the traffic generation task.
This article is organized as follows. Section II introduces the dataset composition and the PNetGPT model construction methodology. In section III we evaluate the quality of the dataset and the performance of PNetGPT. Insection IV we discuss and conclude this paper.
Method
In this section, we present the proprietary protocol dataset, a novel tokenization method, as well as the pre-training and fine-tuning processes of PNetGPT.
A. Dataset
We construct a real-world proprietary protocol network traffic dataset with the aim of exploring the performance of pre-trained models in the task of private protocol generation. This dataset also provides fundamental data support for future researchers in areas such as proprietary protocol analysis and security assessment. The dataset was sourced from the Smart Manufacturing Network Experimental Platform. We use the Wireshark tool for network traffic capture, setting up port filtering to specifically capture traffic for the proprietary protocols from the EzSocket and FOCAS protocols, which are real-world proprietary protocols used in the Industrial Internet for Mitsubishi (EzSocket) and FANUC (FOCAS) brand industrial manufacturing equipment. Meanwhile, we record the function name and all the parameters of the called Application Programming Interface (API) corresponding to each piece of traffic. We spent 5*24 hours to collect a total of 700,000 pieces of traffic data, including 102 functions and different random parameters. All captured data is initially filtered and desensitised to ensure that it does not contain any sensitive information.
There are a total of 700,179 flow data in this dataset species. EzSocket is a tcp-based application layer protocol. This dataset focuses on the private format portion of EzSocket, located in the data field of TCP, and does not care about the TCP header. As shown in Table I, the EzSocket protocol can be divided into 11 types with different functions, including system status reading, parameter reading and writing, sending control commands and other functions. Each function has several sub-functions as shown in the second column of the table I. The payloads are available in three different lengths: 160 bytes, 168 bytes and 176 bytes. We present this first large-scale dataset of proprietary protocol network traffic, and will continue to enrich it with different proprietary protocols in subsequent work.
B. Tokenization of Network Traffic
In the NLP, the three most common subword tokenization algorithms used with Transformer models are Byte Pair Encoding (BPE) [15], WordPiece [16] and Unigram [17]. BPE is an unsupervised subword segmentation method based on merging high-frequency character pairs by frequency iteration. The Unigram method performs text segmentation by selecting the sequence of subwords with the highest probability, which is suitable for dealing with unknown words and multiple languages. WordPiece is a subword segmentation method based on a greedy algorithm, which splits words into smaller, meaningful subword units by dynamically constructing a subword list, and is commonly used in pre-training models such as BERT. However, these three methods are designed primarily for textual designs and lack the sensitivity and ability to handle numerical data. Real and binary data have specific numerical structures and patterns, and these methods are not able to capture and utilise this information effectively. For example, for the sequence of real numbers [3.14, 2.71, 1.41], BPE may split it into meaningless subword units such as [‘3.’, ‘14’, ‘2.’, ‘71’, ‘1.’, ‘41’], which cannot reflect the correlation between values. In traffic data analysis, the segmentation results of these methods may lead to information loss, affecting the accuracy of subsequent processing.
We propose a parsimonious approach to uniformly tokenise real numbers and binary data. For binary data, we convert it to hexadecimal and separate it by two bits as a subword. Real numbers in text are recognised by regular expressions. Then, the recognised values are converted to hexadecimal representation and tokens are added according to the type of value (integer or float). For integers, the conversion to hexadecimal is done directly and separate it by two bits as a subword. For floating point numbers, the integer and decimal parts are combined and converted to hexadecimal, while the length of the decimal part is calculated and converted to hexadecimal representation. For hexadecimal data we separate it by two bits as a subword. Finally, we introduce [[NUM], [POS], +, -] as special symbols to denote the start of a real number, the position of the decimal point and positive or negative of numbers. For example, ‘-3.14159’ tokenised as ‘[NUM] - 04 cb 2f [POS] 05 [NUM]’. This approach provides a unified and structured tokenisation method for real numbers and binary data, requiring only the 260 subwords [00-ff, [NUM], [POS], +, -].
C. Model Architecture
We introduce PNetGPT and its detailed implementation in this section. Traffic generation task model training is accomplished through two steps: pre-training and fine-tuning. As shown in the figure 1 we designed three models with different structures based on Transformer, including PNetGPT-encoder consisting of encoder, PNetGPT-decoder consisting of decoder, and PNetGPT-en&decoder using both encoder and decoder. In the pre-training phase, the model is trained using unlabelled data on Masked Language Model (MLM) and Autoregressive Language Model (ALM). In the fine-tuning phase, the PNetGPT model was initialised using the pre-training parameters, and then all parameters were fine-tuned using the annotated data from the traffic generation task. Since the use of Transformer has become very common and our implementation is almost identical to the original, we omit an exhaustive description of the model architecture and refer the reader to the guide [18]. As shown in table II the PNetGPT-encoder consists of 12 encoder blocks, with a self-attention header of 12, a hidden layer size of 768, and a parameter count of about 43.8M. We set the same number of blocks and attention head and hidden layer sizes for PNetGPT-decoder and PNetGPT-en&decoder in order to make the number of parameters the same for all three models
D. Pre-training PNetGPT
We pre-train PNetGPT using two unsupervised tasks, MLM [1] and ALM [19], described in this section.
1) Task 1: Masked Language Model (MLM)
In the MLM phase, input sequences undergo preprocessing where 15% of tokens are randomly selected and replaced with a special [MASK] token. PNetGPT-encoder and PNetGPT-en&decoder have attention layers capable of capturing bi-directional contextual information. The primary objective of this task is to predict the original values of the masked tokens based on the surrounding unmasked context. The cross-entropy loss function is employed to compute prediction errors, with model parameters subsequently updated through a back-propagation algorithm.
2) Task 2: Autoregressive Language Model (ALM)
In the ALM phase, text sequences are input into the model in a strict left-to-right order. PNetGPT-decoder adopts an autoregressive approach, where predictions at each time step are based solely on preceding tokens. The primary objective of this task is to predict the probability distribution of the subsequent token at each step in the sequence. As with the MLM phase, a cross-entropy loss function is used to assess prediction accuracy, and model parameters are refined using an optimization algorithm.
This dual approach aims to leverage the strengths of both strategies, potentially enhancing the model’s contextual understanding and sequential generation capabilities.
E. Fine-tuning PNetGPT
We address the proprietary protocol network traffic generation task by fine-tuning the PNetGPT. Given the function name and parameters provided by the API of proprietary protocol, the task is to generate the payload of the corresponding TCP packet. We fine-tune for 3 epochs with a learning rate of 2e-5 and a batch size of 64.
Evaluation
In this section, we present a comprehensive evaluation of the effectiveness of PNetGPT across the large-scale proprietary protocol network traffic datasets. We compare our method with a state-of-the-art generative pre-training model. To ensure a sound assessment, we evaluate across several dimensions and metrics. All experiments were conducted using the V100 32G GPU. Approximately 50 GPU compute days were used during model training and evaluation.
A. Baselines and Datasets
We compare PNetGPT-encoder, PNetGPT-dncoder and PNetGPT-en&decoder with T5 [20]. We use the large-scale proprietary protocol network traffic dataset presented in this paper. The 700,000 pieces of data are divided into thirds, of which 500,000 are used for pre-training, 100,000 for fine-tuning, and 100,000 for evaluating the model.
B. Evaluation Metrics
We select METEOR [21], BLEU [22], and ROUGE [23], three common evaluation metrics used in NLP to measure the quality of generation. BLEU measures the accuracy of the output by comparing the overlap between the generated output and the reference. As shown in Equation 1, lenref is the length of the reference, lenMT is the length of the output, Numclip(n-gram) is the number of n-grams in the output, Numref(n-gram) is the number of n-grams in the reference, and N is the maximum length of the n-gram.
\begin{equation*}\operatorname{BLEU} = \exp \left( {\min \left( {0,1 - \frac{{{{\operatorname{len} }_{{\text{ref }}}}}}{{{{\operatorname{len} }_{{\text{MT}}}}}}} \right)} \right)\prod\limits_{n = 1}^N {{{\left( {\frac{{{{\operatorname{Num} }_{{\text{clip }}}}(n{\text{ - gram}})}}{{{{\operatorname{Num} }_{{\text{ref }}}}(n{\text{ - gram}})}}} \right)}^{\frac{1}{N}}}} \tag{1}\end{equation*}
ROUGE primarily emphasizes recall, assessing how effectively the n-grams in the reference are captured. METEOR integrates precision, recall, and fluency while also considering the sequence of words.
C. Effectiveness of PNetGPT
We evaluated the different classes of EzSocket functions separately. As show in Table III, ‘POS, SYS,…, ATC’ are EzSocket function classes, one-to-one corresponding to Table I. ‘ALL’ represents the overall evaluation of the model. We compare the performance of PNetGPT with T5 (baseline). we evaluate their performance on proprietary protocol network traffic generation tasks using three metrics: meteor, bleu, and rouge. We fine-tuned the T5-base model on huggingface [24] on the proprietary protocol dataset, using conventional tokenisation methods. Evaluation results demonstrate the terrible performance of T5 on the proprietary protocol traffic generation task. This demonstrates that conventional tokenisation methods are not effective in preserving semantic information when dealing with real numbers and binary data. We used the new tokenisation method proposed in this paper to process the dataset to pre-train PNetGPT. The results show that the performance of PNetGPT far exceeds that of T5. This demonstrates that the tokenisation method proposed in this paper is able to preserve the complete semantic information of real numbers and binary data. The best model is PNnetGPT-en&decoder, because the proprietary protocol network traffic generation task is essentially a sequence to sequence task. Evaluations of different classes of EzSocket functions demonstrate that the performance of PNetGPT is stable on heterogeneous unknown formats.
Conclusion
We present a novel tokenisation method and pre-trained model, PNetGPT, for generating proprietary protocol network traffic. We present the first real-world dataset of proprietary protocol network traffic. Through a series of comprehensive benchmarks, we demonstrate that our model achieves state-of-the-art generation quality. This quality is validated for the specific domain of Proprietary Protocol Traffic Generation. PNetGPT achieves superior understanding of real number and binary data. The scalability of our model for rapid application to the task of analyzing proprietary protocols is a significant advance.