I. Introduction
Nowadays multi-cores are becoming mainstream hardware platforms for embedded and real-time systems. To fully utilize the processing capacity of multi-cores, software should be fully parallelized, i.e., not only inter-task parallelism, but also intra-task parallelism needs to be explored, such that an individual task (abstraction of a parallel program) is able to potentially utilize more than one core at the same time during its execution. Parallel tasks are commonly supported by almost all modern parallel programming languages, e.g., Cilk family, OpenMP and Intel's Thread Building Blocks. The primitives in these languages and libraries, such as parallel for, omp task and spawn/sync, result in intra-task parallelism structures that can be well represented via Directed Acyclic Graph (DAG) task models, which have gained much attention in the past few years [1]-[27].