I. Introduction
Advances in multi- and many-core architectures are driving powerful changes in computing. Presently, efficient and scalable solutions for several challenge problems in parallel computing such as FFT [17], sorting [18] are available on varied architectures such as Intel CPUs [16], GPUs [18], and also the IBM Cell [11]. Of these, CPU and GPU based solutions stand-out for contrasting reasons. Current generation GPUs offer the best performance per price of more than 1 TFLOP for as little as $400. Modern multicore CPUs are not far behind, and in some computations offer a near-matching performance compared to GPUs. In fact, in a recent work [33], the authors show evidence to indicate that on a class of throughput-oriented problems, the average GPU performance is only three times faster than a 6-core CPU performance.