I. Introduction
The increasing number of transistors on a chip (as predicted by Moore's law) has provided a continuous, dependable improvement in processor performance for several decades. Traditionally, these additional transistors were used to increase clock speeds, but since the mid-1990s they have been used to increase the number of cores on a single die. Future exascale machines will continue this multicore trend, but limited by power and heat constraints, they will need to be comprised of a much larger number of lower-power, lower-performance cores. Current architectures that offer this style of parallelism include Graphics Processing Units (GPUs), Intel's Xeon Phi, and Accelerated Processing Units (APUs) such as AMD's Fusion. Programming for the large number of lightweight cores offered by these devices means departing from the traditional distributed MPI approach, to a tiered programming model which is designed to harness both coarse and fine grained parallelism.