I. Introduction
Message Passing Interface (MPI) has remained the most popular parallel programming model for high-performance scientific applications for decades. However, recent studies [1] have shown that the bulk-synchronous models of MPI may not be best suited for a variety of emerging applications and communication patterns. While Partitioned Global Address Space (PGAS) [2] programming models have been proposed as an attractive alternative due to their light-weight one-sided communication and synchronization semantics for irregular applications [3], they are limited in expressing many-task computations.