I. Introduction
The proliferation of highly parallel workloads such as graph processing, data analytics, and machine learning is driving the demand for massively parallel high-performance systems with a large number of processing cores, extensive memory capacity, and high memory bandwidth [1,2]. Often these workloads are run on systems composed of many discrete packaged processors connected using conventional off-package communication links. These off-package links have inferior bandwidth and energy efficiency compared to their on-chip counterparts and have been scaling poorly compared to silicon scaling [3]. As a result, the overhead of inter-package communication has been growing at an alarming pace.