1 Introduction
The parallel architecture of the graphics processing unit (GPU) often allows data parallel computations to be carried out at rates orders of magnitude greater than those offered by a traditional CPU. Enabled by increased programmability and single-precision floating-point support, the use of graphics hardware for solving non-graphical (general purpose) computational problems began gaining wide-spread popularity in the early part of the last decade [10], [20], [23]. However, early approaches were limited in scope and flexibility because non-graphical algorithms had to be mapped to languages developed exclusively for graphics. Graphics hardware manufactures recognized the market opportunities for better support of general purpose computations on GPUs (GPGPU) and released language extensions and runtime environ-ments,
Notable platforms include the Compute Unified Device Architecture (CUDA) from NVIDIA [4], Stream from AMD/ATI [2], OpenCL from Apple and the Khronos Group [9], and DirectCompute from Microsoft [8].
eliminating many of the limitations found in early GPGPU solutions. Since the release of these second-generation GPGPU technologies, both graphics hardware and runtime environments have grown in generality, increasing the applicability of GPGPU to a breadth of domains. Today, GPUs can be found integrated on-chip in mobile devices and laptops [1], [6], [3], as discrete cards in higher-end consumer computers and workstations, and also within some of the world's fastest supercomputers [22]. Historical trends in processor performance in terms of approximate peak floating-point operations per second (FLOPS).