Conferences >2019 IEEE 13th International ...

Data Partitioning Scheme for Efficient Distributed RDF Querying Using Apache Spark

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The rapid growth of semantic data in the form of Resource Description Framework (RDF) triples demands an efficient, scalable, and distributed storage and parallel process...Show More

Metadata

Abstract:

The rapid growth of semantic data in the form of Resource Description Framework (RDF) triples demands an efficient, scalable, and distributed storage and parallel processing strategies along with high availability and fault tolerance for its management and reuse. There are three open issues with distributed RDF data management systems that are not well addressed altogether in existing work. First is the querying efficiency, second, solutions are optimized for certain types of query patterns and don't necessarily work well for all types of query patterns, and the third is concerned with reducing pre-processing and data loading times. To address these issues, we propose a relational partitioning scheme called Subset Property Table (SPT) for RDF data that further partitions the existing Property Table approach into subsets of tables to minimize query input and join operation. We combine SPT with another existing model Vertical Partitioning (VP) for storing RDF datasets and demonstrate that our proposed combined (SPT + VP) approach outperforms state-of-the-art systems based on in-memory processing engine in a distributed environment.

Published in: 2019 IEEE 13th International Conference on Semantic Computing (ICSC)

Date of Conference: 30 January 2019 - 01 February 2019

Date Added to IEEE Xplore: 14 March 2019

ISBN Information:

Print on Demand(PoD) ISSN: 2325-6516

DOI: 10.1109/ICOSC.2019.8665614

Conference Location: Newport Beach, CA, USA

Contents

I. Introduction

The Semantic Web extends the World Wide Web from a “Web of Documents” to an open inter-linked “Web of Data” which associates the semantics of information and services on the web in order to provide machine-processable data for consumption by software agents to understand the semantics presented in various web documents. The Resource Description Framework (RDF) [1] [2] is a data model proposed by W3C to represent metadata about Web resources and facilitates the search engine to precisely locate and extract information on the Semantic Web. RDF has recently gained popularity due to its flexible data model for publishing data on the Web. There are a growing number of organizations, institutions, and companies adopting Semantic Web technologies to represent data in a semantically structured way and thereby contributing to the “Web of Data”. This steady growth of RDF data necessitates efficient RDF management solution for storing and querying these very large RDF graphs. Over the last decade, many RDF data management systems have been designed to provide scalable, highly available, and fault tolerant RDF stores with efficient SPARQL [3] query processing for distributed environments (e.g., RDF-3X [4], Partout [5], DREAM [6]). In recent years, many distributed RDF management systems are built on Big Data technologies like Hadoop (e.g. Rya[7], H2RDF+ [8], SHARD [9], CliqueSquare [10], PigSPARQL [11], Sempala [12], S2RDF [13], SPARQLGX [14]). These RDF data processing systems rely on cluster computing engines based on MapReduce [15] as an execution layer or in-memory frameworks such as Spark [16] and Impala [17]. In most cases, these systems are optimized for particular query patterns. Some of these existing systems often give up their data loading time for better querying performance. Therefore, it is necessary to implement a distributed RDF management system for better and efficient query performance on a wide range of query patterns with minimized loading cost by reducing the extensive pre-processing overhead. In this paper, we propose new RDF data partitioning schemes based on the two existing approaches Vertical Partitioning [18] and Property Table [19]. We extend the Property Table (PT) approach by further partitioning the PT into subsets and then combine it with VP approach to propose a new storage scheme. For storing and querying RDF data we use an in-memory cluster computing framework Spark, one of the most important and popular Hadoop ecosystem components. It utilizes in-memory caching and advanced directed acyclic graph (DAG) execution engine to create efficient query plans for data transformations. Spark runs programs up to 100 times faster in-memory processing mode and 10 times faster in disk processing mode than Hadoop MapReduce. Spark SQL is a Spark module that is used for structured data processing and allows running SQL like queries on Spark data. Spark SQL includes a cost-based optimizer that enables control code generation to make queries faster.

References is not available for this document.

MIT Libraries

MIT Libraries

Data Partitioning Scheme for Efficient Distributed RDF Querying Using Apache Spark

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Data Partitioning Scheme for Efficient Distributed RDF Querying Using Apache Spark

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References