Loading [MathJax]/extensions/MathMenu.js
Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI | IEEE Conference Publication | IEEE Xplore

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI


Abstract:

There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in differ...Show More

Abstract:

There are several popular Big Data processing frameworks including Apache Spark, Dask, and Ray. The Apache Spark software provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. Some efforts - including RDMA-Spark and SparkUCX - were made in the past to optimize Apache Spark on High-Performance Computing (HPC) systems equipped with high-performance interconnects like InfiniBand. In the HPC community, Message Passing Interface (MPI) libraries are widely adopted for parallelizing science and engineering applications. This paper presents MPI4Spark which uses MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication inside the Big Data framework. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The performance of MPI4Spark is evaluated against RDMA-Spark and Vanilla Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench that contain a variety of Resilient Distributed Dataset (RDD), Graph Processing, and Machine Learning workloads. This evaluation is done on three HPC systems including TACC Frontera, TACC Stampede2, and an internal cluster. MPI4Spark outperforms Vanilla Spark and RDMA-Spark by 4.23x and 2.04x, respectively, on the TACC Frontera system using 448 processing cores (8 Spark workers) for the GroupByTest benchmark in OHB. The communication performance of MPI4Spark is 13.08x and ...
Date of Conference: 05-08 September 2022
Date Added to IEEE Xplore: 18 October 2022
ISBN Information:

ISSN Information:

Conference Location: Heidelberg, Germany

Funding Agency:


I. Introduction

The global Internet population and unique mobile phone users continue to grow at an accelerated rate [1]. This rise in the digital footprint of the human population is triggering the generation of large amounts of data - the global datasphere is expected [2] to reach 175 ZettaBytes by 2025. It is becoming increasingly challenging for organizations to manage and process this large amount of data, also known as Big Data.

Contact IEEE to Subscribe

References

References is not available for this document.