1 Introduction
Task-based programming models have become ubiquitous in scientific computing. For the last decade, they have demonstrated how they can leverage performance from the bottom of the software stack with numerical libraries [1], [2], [3], [4], [5], [6], [7] all the way up to computational simulations and applications [8], [9], [10], [11]. Thanks to the fine-grained computations, task-based numerical libraries and applications have proven their ability to reduce idle time due to load imbalance, hiding data movement with computations and weakening artifactual synchronization points between processing units. Although they are capable of mitigating many overheads, they rely on dynamic engines or runtime systems to abstract the underlying hardware complexity from end-users. These runtime systems efficiently marshal task data dependencies and schedule the corresponding computational kernels on available hardware resources. There exists a myriad of dynamic runtime systems to support task-based programming models on shared- and distributed-memory systems, possibly equipped with hardware accelerators [12]. The lack of API standardization makes it cumbersome for task-based applications and library developers to exploit different runtimes and their respective features. This requires changes into the original code to port it to a specific task-based engine in order to execute on a given hardware system.