I. Introduction
GPUs need high-bandwidth memory systems to support their massively parallel execution model. Current DRAM solutions such as GDDR5 [1] and 3D-stacked memory [2], [3] deliver high theoretical performance. Unfortunately, it is difficult to reach this potential with contemporary GPU-compute workloads, leading to suboptimal bandwidth utilization, performance and power-efficiency [4]. To maximize bandwidth, DRAM interfaces are organized in a four-dimensional structure of channels, banks, rows and columns. The way the application memory access streams are mapped onto this structure has a significant impact on performance and power consumption. For the row bits, the addresses should change as little as possible to ensure high row buffer locality. For the channel and bank bits, the addresses should be highly variable to ensure uniform distribution of memory requests across channels and banks [5].