I. Introduction
Traditionally, it has been believed that the most compact hardware implementation of the operation mod with is achieved using the bit-serial approach [1], where each bit of is processed at a time while all bits of are accessed in parallel, computing in iterations. In this work, we demonstrate that more compact hardware implementations of are possible. Further, exploiting the advantage of a hardware implementation our design outperforms software counterparts. Compact implementations could be preferred in applications where area resources are a major concern, for example in lightweight cryptography [2].