Conferences >2016 IEEE International Confe...

Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Clustering large scale data has become an important challenge which motivates several recent works. While the emphasis has been on the organization of massive data into d...Show More

Metadata

Abstract:

Clustering large scale data has become an important challenge which motivates several recent works. While the emphasis has been on the organization of massive data into disjoint groups, this work considers the identification of non-disjoint groups rather than the disjoint ones. In this setting, it is possible for data object to belong simultaneously to several groups since many real-world applications of clustering require non-disjoint partitioning to fit data structures. For this purpose, we propose the Parallel Overlapping k-means method (POKM) which is able to perform parallel clustering processes leading to non-disjoint partitioning of data. The proposed method is implemented within Spark framework to ensure the distribution of works over the different computation nodes. Experiments which we have performed on simulated and real-world multi-labeled datasets shows both faster execution times and high quality of clustering compared to existing methods.

Published in: 2016 IEEE International Conference on Big Data (Big Data)

Date of Conference: 05-08 December 2016

Date Added to IEEE Xplore: 06 February 2017

ISBN Information:

DOI: 10.1109/BigData.2016.7840708

Conference Location: Washington, DC, USA

Contents

I. Introduction

Data Clustering has become a challenging task in data mining and machine learning since several real life applications require to organize data into groups based on their similar descriptive characteristics. Examples of these applications are Image Segmentation [1], market segmentation [2], customers segmentation [3], document summarizing[4] and many other applications. The issue of organizing data into groups has been studied during the last three decades. Several clustering methods have been proposed in the literature. Existing methods are based on several approaches, such as Partitioning, Hierarchical, Density based methods and graph based methods [3], [5] to look for groups in data. However, given the exponential growth of data captured from different available sources (social media, sites, mobile devices, on-line videos, etc), most of the existing methods cannot be used for large scale volume of data. The scalability and the ability of the method to perform clustering on big volume of data has become a necessary and important requirement.

References is not available for this document.

Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References