I. Introduction
The number of transistors per chip is growing due to technology scaling and increasing the clock rate of processors is becoming technologically less viable [1]. The current trend is therefore to integrate a growing number of processing cores on chip, forcing parallelizing compilers to mature rapidly and to provide efficient code for the multi-core processors. Most parallelizing compilers focus on loop parallelization as most of the execution time is spent in loops. However, scalable parallelism is in many cases not realizable because memory accesses and interprocessor communication are the bottlenecks. Recent research makes it clear that memory accesses and data transfers account for the majority of the power consumption [2] [3] and thus need to be addressed and handled more explicitly in order to achieve (power) efficient performance. This paper presents a tool chain that parallelizes an application based the data flowing inside an application and how this helps in mapping (manually) the algorithm on the architecture using intuitive parallel constructs. We present in detail a use case, Canny Edge Detection, as well as the performance numbers for a second application, fluid animate.