I. Introduction
The last decade has seen the emergence of Deep Neural Network (DNN) training as an important workload on parallel systems, including High-Performance Computing and Cloud hardware. DNNs have been found to be very useful in many applications, like Computer Vision and Natural Language Processing due to their high accuracy. Compare with significant successes [1]–[3] realized in training such DNNs, there is relatively less focus on deploying them for inference on edge devices. The deployment of these large models for inference on commodity servers, as well as resource-constrained environments, is vital for successful democratization of Artificial Intelligence (AI) models.