Conferences >2023 IEEE International Sympo...

KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Machine learning (ML) inference workloads present significantly different challenges than ML training workloads. Typically, inference workloads are shorter running and un...Show More

Metadata

Abstract:

Machine learning (ML) inference workloads present significantly different challenges than ML training workloads. Typically, inference workloads are shorter running and under-utilize GPU resources. To overcome this, co-locating multiple instances of a model has been proposed to improve the utilization of GPUs. Co-located models share the GPU through GPU spatial partitioning facilities, such as Nvidia’s MPS, MIG, or AMD’s CU Masking API. Existing spatially partitioned inference servers create model-wise partitions by "right-sizing" based on a model’s latency tolerance to restricting resources. We show that model-wise right-sizing is under-utilized due to varying resource restriction tolerance of individual kernels within an inference pass.We propose Kernel-wise Right-sizing for Spatial Partitioned GPU Inference Servers (KRISP) to enable kernel-wise right-sizing of spatial partitions at the granularity of individual kernels. We demonstrate that KRISP can support a greater level of concurrently running inference models compared to existing spatially partitioned inference servers. KRISP improves overall throughput by 2x when compared with an isolated inference (1.22x vs prior works) and reduce energy per inference by 33%.

Published in: 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

Date of Conference: 25 February 2023 - 01 March 2023

Date Added to IEEE Xplore: 24 March 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/HPCA56546.2023.10071121

Conference Location: Montreal, QC, Canada

Funding Agency:

Contents

I. Introduction

With the rise of Machine Learning (ML) and Inference as a Service [17], [23], [56], [61], GPUs play a significant role in performance. Training machine learning models are computationally heavy for a sustained amount of time. However, inference workloads are shorter running, which leads to under-utilization of GPU resources [27], [28], [36]. Figure 1 (left) illustrates such a scenario where two inference models tempo-rally share a single GPU while executing, resulting in significant GPU resource under-utilization.

References is not available for this document.

KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?