I. Introduction
In the past few years, the architecture community has been trying to replace the performance improvements once provided by regular clock frequency increases with a similar regular increase of the number of cores. Today, many company and academic roadmaps mention tens and often hundreds of cores per chip [18], [37]. The main chip manufacturers are already shipping processors with up to 8 cores and GPGPUs with tens of cores, each comprising lots of small SIMD execution units. While the addition of a small number of cores to a general-purpose CPU was proven profitable for a large class of applications, it is still unclear whether many-cores will be able to sustain the processing performance growth that is expected for the next decade. The increasing number of cores on a die creates hardware design difficulties and considerably enlarges the possible design-space, in particular for memory hierarchies and on-chip interconnects. Moreover, the performance of future many-cores will largely depend on parallel programming languages and the efficiency of their supporting run-time environments. For these reasons, researchers and engineers are in a pressing need for fast simulators to explore the design of such architectures, and even faster ones to evaluate programming models and their implementation, such as hybrid hardware/software solutions. At the same time, the increasing number of cores per chip and the rising run-time system complexity of these platforms is posing a major challenge to fast and practical simulation.