Conferences >2016 International Conference...

POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) bench...Show More

Metadata

Abstract:

We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.

Published in: 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

Date of Conference: 11-15 September 2016

Date Added to IEEE Xplore: 01 December 2016

ISBN Information:

DOI: 10.1145/2967938.2974052

Conference Location: Haifa, Israel

Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.

IEEE Keywords
- Processor scheduling ,
- Runtime ,
- Bandwidth ,
- Scheduling ,
- Multicore processing ,
- Benchmark testing
Index Terms
- Model Execution ,
- Scalable ,
- Data Cache ,
- Memory Bandwidth ,
- L2 Cache ,
- System Resources ,
- Resource Requirements ,
- Load Balancing ,
- Scheduling Algorithm ,
- Low Overhead ,
- Task Scheduling ,
- Depth-first ,
- Remote Memory ,
- Topological Map

Contents

1. Introduction

Computational task DAGs are executed on parallel computers by a task scheduling algorithm. Intelligent scheduling is critical for achieving high parallelism, low overheads and reduced communication. A key technique for load balancing task DAGs is work stealing (WS), which Blumofe et al. popularized for fork-join computations [2]. In scenarios of high parallel slackness, WS's distributed nature allows it to scale to a large number of cores with low overhead [4]. However, the space of a WS computation grows proportionally to the number of cores. Targeting a lower bound, Blelloch et al. proposed the parallel-depth-first (PDF) scheduler [1]. PDF schedules tasks by following the depth-first (serial) order of computation and has space requirements closer to the serial execution. PDF has been shown to provide constructive cache sharing in modern multicore architectures [3]. However, implementing PDF requires a centralized scheduler which limits scalability. Targeting NUMA architectures, Olivier et al. proposed to load balance multiple PDF schedulers via WS [8]. While enabling scalability to larger systems, such approach still suffers from centralized scheduling of fine-grained parallelism [9]. Furthermore, for applications in which the amount of parallelism varies greatly, a fixed hierarchy of PDF queues is not enough.

Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.

IEEE Keywords
- Processor scheduling ,
- Runtime ,
- Bandwidth ,
- Scheduling ,
- Multicore processing ,
- Benchmark testing
Index Terms
- Model Execution ,
- Scalable ,
- Data Cache ,
- Memory Bandwidth ,
- L2 Cache ,
- System Resources ,
- Resource Requirements ,
- Load Balancing ,
- Scheduling Algorithm ,
- Low Overhead ,
- Task Scheduling ,
- Depth-first ,
- Remote Memory ,
- Topological Map

References is not available for this document.

POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies

Abstract:

Metadata

Abstract:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies

Alerts

Abstract:

Metadata

Abstract:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?