Conferences >2019 IEEE 35th International ...

Muses: Distributed Data Migration System for Polystores

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those ...Show More

Metadata

Abstract:

Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing systems with diverse execution engines. In order to take advantage of each execution engine's characteristics and APIs data scientists need to migrate and transform their datasets at a very high computational cost and manual labor. Data migration is challenging for two main reasons: i) execution engines expect specific types/shapes of the data as input; ii) there are various physical representations of the data (e.g., partitions). Therefore, migrating data efficiently requires knowledge of systems internals and assumptions. In this paper we present Muses, a distributed, high-performance data migration engine that is able to forward, transform, repartition, and broadcast data between distributed engines' instances efficiently. Muses does not require any changes in the underlying execution engines. In an experimental evaluation, we show that migrating data from one execution engine to another (in order to take advantage of faster, native operations) can increase a pipeline's performance by 30%.

Published in: 2019 IEEE 35th International Conference on Data Engineering (ICDE)

Date of Conference: 08-11 April 2019

Date Added to IEEE Xplore: 06 June 2019

ISBN Information:

ISSN Information:

DOI: 10.1109/ICDE.2019.00152

Conference Location: Macao, China

Contents

I. Introduction

Polystores [1], [4], [10] combine a set of specialized data processing engines (e.g., graph engines, dataflow engines, array databases) in order to perform data analysis at scale, using each specialized engine according to its characteristics. Each engine uses its own format and storage location for the data it processes, often making data migration between those data processing engines the bottleneck in processing that data. Thus, the decision on whether two different execution engines are used for a single data pipeline, depends on whether the overhead of data migration is smaller than the speedup caused by a specialized execution engine.

References is not available for this document.

Muses: Distributed Data Migration System for Polystores

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Muses: Distributed Data Migration System for Polystores

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References