Loading [MathJax]/extensions/MathMenu.js
Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices | IEEE Conference Publication | IEEE Xplore

Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices

; ; ; ; ;

Abstract:

The application of Transformer-based large models has achieved numerous success in recent years. However, the exponential growth in the parameters of large models introdu...Show More

Abstract:

The application of Transformer-based large models has achieved numerous success in recent years. However, the exponential growth in the parameters of large models introduces formidable memory challenge for edge deployment. Prior works to address this challenge mainly focus on optimizing the model structure and adopting memory swapping methods. However, the former reduces the inference accuracy, and the latter raises the inference latency. This paper introduces PIPELoAD, a novel memory-efficient pipeline execution mechanism. It reduces memory usage by incorporating dynamic memory management and minimizes inference latency by employing parallel model loading. Based on PIPELoAD mechanism, we present Hermes, a framework optimized for large model inference on edge devices. We evaluate Hermes on Transformer-based models of different sizes. Our experiments illustrate that Hermes achieves up to 4.24 x increase in inference speed and 86.7% lower memory consumption than the state-of-the-art pipeline mechanism for BERT and ViT models, 2.58 x increase in inference speed and 90.3% lower memory consumption for GPT-style models.
Date of Conference: 18-20 November 2024
Date Added to IEEE Xplore: 02 January 2025
ISBN Information:

ISSN Information:

Conference Location: Milan, Italy

I. Introduction

The Transformer architecture has profoundly transformed the landscape of deep learning and brought forward large models with their applications spreading from data centers [1] to edge devices. Large models are generally categorized into Natural Language Processing (NLP), Computer Vision (CV), and Multimodal models. NLP is widely applied on mobile devices [2], [3], from intelligent personal assistants, like Google Assistant and Apple Siri to real-time language translation [4]. CV plays a pivotal role in the field of autonomous driving [5], [6], where it is utilized for tasks such as real-time object detection [7], [8], lane recognition [9], and traffic signal detection [10]. By enriching robots' perception and decision-making capabilities through the integration of diverse data types [11], such as visual, auditory [12], and tactile [13] information, Multimodal large models are revolutionizing the field of robotics [14], [15].

Contact IEEE to Subscribe

References

References is not available for this document.