Conferences >2022 IEEE International Confe...

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in differ...Show More

Metadata

Abstract:

There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and ...

Published in: 2022 IEEE International Conference on Cluster Computing (CLUSTER)

Date of Conference: 05-08 September 2022

Date Added to IEEE Xplore: 18 October 2022

ISBN Information:

ISSN Information:

DOI: 10.1109/CLUSTER51413.2022.00022

Conference Location: Heidelberg, Germany

Funding Agency:

Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.

IEEE Keywords
- Message passing ,
- Semantics ,
- Cluster computing ,
- Machine learning ,
- Big Data ,
- Benchmark testing ,
- Software
Index Terms
Author Keywords
- Apache Spark ,
- Netty ,
- MPI

Contents

I. Introduction

The global Internet population and unique mobile phone users continue to grow at an accelerated rate [1]. This rise in the digital footprint of the human population is triggering the generation of large amounts of data - the global datasphere is expected [2] to reach 175 ZettaBytes by 2025. It is becoming increasingly challenging for organizations to manage and process this large amount of data, also known as Big Data.

Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.

IEEE Keywords
- Message passing ,
- Semantics ,
- Cluster computing ,
- Machine learning ,
- Big Data ,
- Benchmark testing ,
- Software
Index Terms
Author Keywords
- Apache Spark ,
- Netty ,
- MPI

References is not available for this document.

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?