Loading [MathJax]/extensions/MathMenu.js
Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework | IEEE Conference Publication | IEEE Xplore

Parallel clustering method for non-disjoint partitioning of large-scale data based on spark framework


Abstract:

Clustering large scale data has become an important challenge which motivates several recent works. While the emphasis has been on the organization of massive data into d...Show More

Abstract:

Clustering large scale data has become an important challenge which motivates several recent works. While the emphasis has been on the organization of massive data into disjoint groups, this work considers the identification of non-disjoint groups rather than the disjoint ones. In this setting, it is possible for data object to belong simultaneously to several groups since many real-world applications of clustering require non-disjoint partitioning to fit data structures. For this purpose, we propose the Parallel Overlapping k-means method (POKM) which is able to perform parallel clustering processes leading to non-disjoint partitioning of data. The proposed method is implemented within Spark framework to ensure the distribution of works over the different computation nodes. Experiments which we have performed on simulated and real-world multi-labeled datasets shows both faster execution times and high quality of clustering compared to existing methods.
Date of Conference: 05-08 December 2016
Date Added to IEEE Xplore: 06 February 2017
ISBN Information:
Conference Location: Washington, DC, USA

I. Introduction

Data Clustering has become a challenging task in data mining and machine learning since several real life applications require to organize data into groups based on their similar descriptive characteristics. Examples of these applications are Image Segmentation [1], market segmentation [2], customers segmentation [3], document summarizing[4] and many other applications. The issue of organizing data into groups has been studied during the last three decades. Several clustering methods have been proposed in the literature. Existing methods are based on several approaches, such as Partitioning, Hierarchical, Density based methods and graph based methods [3], [5] to look for groups in data. However, given the exponential growth of data captured from different available sources (social media, sites, mobile devices, on-line videos, etc), most of the existing methods cannot be used for large scale volume of data. The scalability and the ability of the method to perform clustering on big volume of data has become a necessary and important requirement.

Contact IEEE to Subscribe

References

References is not available for this document.