Loading [MathJax]/extensions/MathZoom.js
CodeFuse-13B: A Pretrained Multi-Lingual Code Large Language Model | IEEE Conference Publication | IEEE Xplore

CodeFuse-13B: A Pretrained Multi-Lingual Code Large Language Model


Abstract:

Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. How...Show More

Abstract:

Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectivness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CODEFuSE-13B, an open-sourced pre-trained code LLM 2. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages. CODEFUSE achieves its effectiveness by utilizing a high-quality pre-training dataset that is carefully filtered by program analyzers and optimized during the training process. Extensive experiments are conducted using real-world usage scenarios, the industry-standard benchmark HUMANEvAL-x, and the specially designed CODEFUSEEvAL for Chinese prompts. To assess the effectiveness of CODEFUSE, we actively collected valuable human feed-back from the AntGroup's software development process where CODEFUSE has been successfully deployed. The results demonstrate that CODEFUSE-13B achieves a HUMANEvAL pass@1 score of 37.10%, positioning it as one of the top multi-lingual code LLMs with similar parameter sizes. In practical scenarios, such as code generation, code translation, code comments, and testcase generation, CODEFUSE performs better than other models when confronted with Chinese prompts.
Date of Conference: 14-20 April 2024
Date Added to IEEE Xplore: 18 June 2024
ISBN Information:

ISSN Information:

Conference Location: Lisbon, Portugal

1 Introduction

Code Large Language Models (Code LLMs) have attracted sub-stantial attention in the industry owing to their vast applications throughout the entire software engineering lifecycle. The release of Copilot, empowered by Codex [7], served as a significant testament to the imminent arrival of the era of intelligent code. One astonishing application, ChatGPT [6], [27], has captivated an incredible user base of over 100 million in two months since its launch. In recent code models such as AlphaCode[21], InCoder[13], SantaCoder[1], StarCoder[20], and Code Llama[30], the incorporation of fill-in-the-middle capabilities has proven to be particularly valuable for practical code completion scenarios.

Contact IEEE to Subscribe

References

References is not available for this document.