1. Introduction
Machine learning as a service (MLaaS) helps many users leverage the benefits of artificial intelligence (AI) augmented applications on their private data. However, due to the growing concerns associated with the model IP protection [14], the service providers often prefer to retain the model at its end rather than sharing the black box model IP with the user. Users often do not prefer sharing their personal data due to various data privacy issues. To tackle these concerns, various private inference (PI) methods [7], [18], [20], [21] have been proposed that leverage techniques such as Homomorphic encryption (HE) [1] and secure multi-party computation (MPC) protocols to preserve the privacy of the client’s data as well as the model’s IP. Popular PI frameworks including Gazelle [9], DELPHI [18], CryptoNAS [5], and Cheetah [19] leverage these privacy preserving mechanisms. However, unlike traditional inference, the non-linear ReLU operation latency in PI can increase up to two orders of magnitude. For example, PI methods generally use Yao’s Garbled Circuits (GC) [22] that demand orders of magnitude higher latency and communication than that of linear multiply-accumulate (MAC) operations.