Conferences >2024 5th International Semina...

An Ensemble LLM Framework of Text Recognition Based on BERT and BPE Tokenization

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Given how quickly artificial intelligence is developing, Large Language Models (LLMs) such as GPT and BERT have achieved significant performance in text generation and la...Show More

Metadata

Abstract:

Given how quickly artificial intelligence is developing, Large Language Models (LLMs) such as GPT and BERT have achieved significant performance in text generation and language understanding. These models, trained on massive data sets, can now highly imitative human writing styles and the logic of human thinking, making the text they generate difficult to distinguish from human-written text in certain contexts. As the text generation capabilities of LLMs become increasingly powerful, the text automatically generated by machines may lead to misinformation and the spread of false content. The ability to discriminate between writing created by machines and text written by humans is becoming more and more crucial in both academics and industry. Developing effective identification methods for precise judgment of text has significant practical significance. Therefore, this study aims to explore a method to differentiate between human-written and LLM-generated texts. This paper discusses an integrated framework that uses the Byte-Pair Encoding (BPE) tokenization method to segment text, separately trains and constructs a deep learning model based on BERT fine-tuning, and a traditional machine learning model, combined with ensemble learning algorithms, to train an efficient classifier. The purpose of this classifier is to identify whether a piece of text is generated by a machine. The final model performs excellently on the test sets. This research is not only of great significance to academic research but also has practical application value in the authenticity identification of texts in news media, the publishing industry, and even at the legal level.

Published in: 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT)

Date of Conference: 29-31 March 2024

Date Added to IEEE Xplore: 11 July 2024

ISBN Information:

DOI: 10.1109/AINIT61980.2024.10581466

Conference Location: Nanjing, China

Contents

I. Introduction

Large language models (LLMs), which are trained on a large number of data sets, have strong language understanding and generation capabilities [1]. Utilizing deep learning technologies, these models are capable of comprehending and producing human language, significantly propelling advancements in machine translation, question-answering systems, and text summarization. In 201 7, Google researchers Vaswani and other researchers proposed the Transformer model based on the self-attention mechanism that can handle long-range dependencies more efficiently [1]. The transformer model’ s debut laid the groundwork for creating extensive pre-trained language models. In 2018, Devlin et al. unveiled BERT (Bidirectional Encoder Representations from Transformers), a pre- trained model for language representation that leverages the transformer architecture. It captures the context of language more effectively through bidirectional training [2]. The emergence of the BERT model has not only achieved unprecedented results in various NLP tasks but also laid the foundation for more advanced subsequent research and model development. Since the emergence of BERT and its subsequent variations, large-scale language models have emerged as crucial instruments in both research and practical applications within the field of natural language processing.

References is not available for this document.

MIT Libraries

MIT Libraries

An Ensemble LLM Framework of Text Recognition Based on BERT and BPE Tokenization

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

An Ensemble LLM Framework of Text Recognition Based on BERT and BPE Tokenization

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References