Introduction
Clustering, or cluster analysis or data segmentation [1], commonly defined as the grouping of similar objects into classes called clusters [2] or defined more specifically as an unsupervised learning approach to classification of patterns into groups (clusters) based upon similarity, where a pattern is a representation of features or attributes of an object [3]. Clustering methods are unsupervised because we do not know the classification parameters, characteristics of data, or even the number of cluster groups versus the classification methods. Therefore, the clustering-based techniques try to estimate and learn these parameters from the given data. Usually, there is two way to perform this process: an offline method for a given saved batch of data, and an online for coming data sequentially. The offline methods resulting in better accuracy but not valid for very massive or real-time data.
A. Background
Today, clustering is considered one of the vital data-mining tools for analyzing data. There are large standard application fields in which clustering is one of its tools, such as the following:
Social network analysis
Collaborative filtering
Data summarizing
Multimedia data analysis
Customer segmentation
Biological data analysis
With increase data in size and speed, handling of big data has become inevitable. There are many definitions of Big Data, and one of them is the amount of data just beyond technology’s capability to store, manage, and process efficiently [4]. Big data are seen as vast, complex, and growing from multiple or autonomous resources. As a result of the fast improvement of communication, data storage methods, and the high ability to collect data, big data became rapidly growing in all fields and domains such as science, engineering, physical, biological, and biomedical sciences [5]. Also, many new applications can quickly generate vast amounts of data during a short time; for example, social networks provide incredible opportunities for social connections and an enormous volume of data [6].
In the same direction, Data streams refer to a massive amount of data generated at very high speed, such as network traffic, web click streams, and sensor networks [7]. Hence, Datastream mining has become a hot research topic because of introducing advanced technologies and applications that regularly generate data streams.
Although big data and streaming data add big challenges, the data type significantly impacts the clustering problem. As a result, the kind of data plays the primary role in choosing techniques used for the clustering analysis. Accordingly, there are wide ranges of clustering techniques’ models that have different clustering methodology. Figure 2 shows the general clustering techniques classified based on its methods from various studies. Because of the significant growth of data occupies many future challenges and often require specialized techniques [8].
The algorithm scalability is the ability of the algorithm to handle a growing amount of input by adding additional resources to the system [9]. In this case, the system can be scaled up or down based on the work size. Nowadays, the scaling the resources has become an essential factor as a result of the cost of adding resources to the system, which is why research has tended towards developing ways to deal with scalable systems, especially in cases of big data and what means real-time versus cost. According to scalability definition, we classify the techniques in this review into two types: traditional and scalable techniques. The traditional techniques consist of clustering algorithms without regard to the system’s scalability. In contrast, the scalable techniques consist of the clustering algorithms utilizing the system’s scalability.
Due to the limitations of the traditional clustering algorithms either in output speed or in processing data, researches investigated in two directions to face these challenges. The first direction is by trying to improve traditional algorithms to working with large data size and the other orientation by proposed new methodology based on the benefit of new technology such as parallel computing, cloud computing, and map-reduce.
B. Current Issues
The significant challenges for data miners and data analyzers come from using the best method to extract useful information from the large dimensional and increasing dataset [10]. Nowadays, Big Data is an exciting area for scientific research. It is becoming a common data source for many business applications, which require a range of data mining operations [11]. However, there are difficulties in applying data mining techniques to big-data because of the new challenges with big data [12]. The scalability, complexity, and the presence of mixed data are the main challenges of big data analysis, and clustering appeared due to [13]. So, parallelism is introduced by many clustering algorithms because it is useful for applying the ‘divide and conquer’ strategy in algorithms to reduce the time complexity related to big data [14]. It identified that the main challenges while clustering extensive data fall in the following points:
Clusters usually have non-uniform shapes,
Lack of knowledge and ability to either determine the number of input parameters or choose the better values of it, and
Scalability and the incredible size of extensive data.
On the other hand, one of the most critical aspects of the data mining problem is how similarity is defined and measured. The similarity measures affect the clustering and classification of information concerning the type of data. Clustering techniques for categorical data are very different from those for numerical data in terms of the definition of the similarity measure. It is also rare to find the boundaries of clusters and avoid their overlap, which adds a constraint to researchers when choosing the optimal similarity measure is applied to a wide range of data types.
Focus on scalable clustering techniques (Figure 3); the algorithms classified according to how to deal with the data either as a batch or real-time streaming. In streaming clustering techniques, most of the algorithms were coming from the traditional methods with some modification to working with the stream data. Unlike big data techniques, most approaches are based on new algorithms. They classified into two strategies based on the number of machine algorithm can dealing during the process (single or multiple machines).
C. Contributions
In this paper, we focus on the issues and challenges of scalable clustering techniques implemented to face big data or streaming data. The basic contribution of this paper as follows:
Present a general overview of different clustering algorithm based on the clustering techniques.
Reviews the different clustering algorithms and main features according to general features and categorizing them based on new scalability methods.
Present features comparison between studies algorithms based on clustering techniques.
Traditional Clustering Techniques
In this part, we review the traditional clustering techniques. Algorithms that are not dealing with system scalability or data size as a metric in clustering processing will be covered under the traditional term. The traditional method does not depend on processing the volume of data in terms of distribution methods on devices or division to deal with extensive data.
A. Hierarchical Clustering
In a hierarchical clustering algorithm, cluster data grouped with a sequence of nested partitions, either from separate clusters to a cluster, including all individuals or vice versa. The former is known as agglomerative, and the latter is called divisive. Agglomerative methods: use the ‘bottom-up’ approach; they begin with each object as a separate cluster and merge them into successively larger clusters. Divisive methods: on the other hand, use the ‘top-down’ approach; they begin with the whole set of objects in one cluster and proceed to divide it into successively smaller clusters. Figure 4 demonstrates the difference between the two approaches in process direction. In practice, agglomerative techniques are commonly used, while divisive techniques are limited due to their prohibitive computational burden. The output of hierarchical clustering is usually represented as a dendrogram or voroni diagram in visualizing big data (such as figure 5), which clearly describes the proximity among data objects and their clusters and provides good visualization. Although the classical hierarchical clustering methods are conceptually easy to understand, they suffer the disadvantages of high computational complexity. This high computational burden limits its application in large-scale data sets [15]. Many algorithms fall in this category such as Chameleon [16], ROCK [17], LIMBO [18], F-Tree [19], MTMDP [20], and MGR [21].
B. Partition-Based Clustering
In the partitioning-based algorithms, all clusters data is recursively divided into some partitions until the partition criterion reaches a specified value, and here each partition represents a cluster. K-means [22], and K-medoids [23] are most famous algorithms based on partitioning. K-means iterative update the centre of the cluster until coverage data. Some versions of K-means has been proposed in the way to improve time complexity as [24]. Other algorithm’s fall in this category, such as PAM [25], CLARA [26], CLARANS [27], COOLCAT [28] and Squeezer [29].
C. Density-Based Clustering
The density clustering aims to discover the shapes of the clusters. In this type, the data are numerical so that they can be grouped based on dimensional distances. Initially, data divided into three types of points: core, boundary, and noise points. The point considered a core point if it has a least
D. Probabilistic and Generative Clustering
In the model-based algorithms, data is clustering based on various strategies such as statistical methods and conceptual methods. There are two common ways of model-based algorithms: the neural network approach, and the analytical approach. The neural network approaches are supervised techniques; However, Kohonen’s SOM is the model used for clustering [34]. Algorithms fall in this category, such as SVC [35] and Ensemble [36].
E. Grid-Based Clustering
Grid-based clustering algorithms have shown great interest in their advantages of discovering clusters with different shapes and sizes. Mainly, There are two methods in this type: Fix-up and the adaptive grid partition method. The idea of a fix-up grid partition method is to divide each dimension of the data space into equal lengths, and then they crossed rectangular cells of the same size. Since the points in the same network belong to a group, they are treated as a single object. All groupings run on these grid cells. While the idea of the adaptive grid partition method is to divide data space into non-crossed grid cells of different sizes according to the data distribution feature, the total number of grid cells is significantly reduced compared to those fix-up partition methods. Still, the determination of spitting points required massive computation power [37]. Some algorithms, such as CLIQUE [38], STING [39], WaveCluster [40], and DENCLUE [41], used the fix-up grid method, while some other used the adaptive grid partition method such as OptiGrid [42] and MAFIA [43].
Scalable Clustering Techniques
We explore some of scalable clustering techniques. Usually, two different methods developed to handle big data; first, techniques focus on reducing the size of the data either vertically or horizontally (Figure 7), while the other methods focus on speeding the execution depend on multi-physical processors. With this technique, big data can be cut into smaller pieces to processing on different devices simultaneously. The multi-machine clustering algorithm classified into two categories according to the technique used to run multiple processes, either map-reduce or parallel, and distribution. Commonly, these methods solve the clustering time challenge of big data. The [44] presented a more details survey on parallel clustering algorithms according to the platform of big data. In common, the scalable clustering algorithm is designed to fit the specific scaled platform. While the [45] presented a more details survey on stream clustering algorithms compared with traditional algorithms. In this section, we review scaling algorithms based on techniques rather than the architecture of the platform across streaming and non-streaming algorithms.
A. Sampling-Based
This method is one of the first ways to try to overcome data volume and operation speed problems; the primary goal is based on clustering data samples and rather than clustering the entire data set (Figure 8). After processing, the results of the clustering are generalized to the whole data set [12]. These methods are one of the ideas that contributed to accelerating techniques by minimizing the data size and thus time and complexity. In contrast, these techniques added time and complexity in pre-processing data required for sampling operations. Moreover, the clustering subset of data not give the same accuracy compared to the entire data. Algorithms fall in this category, such as BIRCH [46], CURE [47], and CLARANS [27].
B. Reduction and Projection-Based
The high dimensionality of data adds additional challenges to most clustering algorithms, such as the existence of noises features or sparse data [48]. While sampling-based techniques reduce the data size vertically, they are not considered the best solution to a high-dimensional data set in cases. Similarly, the projection-based methods try to reduce the data size horizontally. The current approaches use procedures like feature selection, feature extraction, approximation, and random reduction. These techniques are also required pre-processing data as sampling-based. The feature selection lowers the dimension space by filtering the data attributes based on data dependency, While the feature extraction is constructing new advantage features. Algorithms fall in this category, such as Ensemble [48], Colibri [49], and RP [50].
C. Subspace Clustering
Subspace clustering is a technique that searches for clusters in different subspaces. The basic idea is to discover clusters using a subset of dimensions. Generally, two types of subspace clustering based on the search strategy; bottom-up and top-down. The bottom-up start by finding clustering in lower dimension and iterative merging them to process higher dimension spaces. Top-down start by find clusters in full dimension then evaluates the subspace of each cluster. Generally, Subspace clustering solves the problem of the high dimensional dataset faces most grid-based approaches. Algorithms fall in this category, such as SSSC [51], TNNLS [52] and StructAE [53]
D. Map-Reduced Based
As a single processor with one memory cannot handle extensive data at an adequate speed, it emphasizes the need for algorithms that run on multiple devices. The Map-Reduce framework (as displayed in Figure 9) dropped many concerns necessary to run algorithms on multiple devices such as network connection, data distribution, and load balancing. These advantages allow many researchers to easily applied and improved their algorithms in parallel processing systems. There is a set of proposed research that re-apply or re-implement a better clustering technique using map-reducer, such as the research in [54]. They presented an integrated approach for CURE clustering using the map-reduce method. Other algorithms fall in this category such as DisCo [55], PKMeans [56], BoW [57], wkPCM [58], and KIM [59].
E. Parallel and Distributed-Based
The data-driven path identification approach (DDPI) [60] is a concurrent algorithm representation of k-means with neural network batch training. Instead of implementing a distributed system, the author presented a data-parallel interface to permit the parallel implementation of the k-means algorithm using a neural network; but, it adds a supervised step. The three-parallelization steps of the algorithm are partitioning and distribution data then, computing using distributed data, and lastly, assembling local computational results. The concurrent structure provided a way to reduce the computational demand of neural techniques. The Density based distributed clustering (DBDC) [32] used the distributed techniques to speed up the clustering process on large scale data and based on density partitioning clustering technique (Figure 10). Distributed clustering in this algorithm is operating at two different levels (local and global). At the local level, DBDC uses an independent algorithm for clustering, which carried out the process on partitioned data. On the global level, it uses a density algorithm clustering called DBSCAN to generate the results in the main site. The DBSCAN is used for all kinds of metric data spaces only. While meeting real-time constraints in clustering data streams using parallelization in the cloud, the Cloud DIKW [61] introduced a cloud framework using integrated parallel batch and stream processing.
Moreover, clustering social media data streams are used as a tested application domain. Recently, the [62] presented a parallel implementation of a multi-objective feature selection. The algorithm makes possible use of high dimensional datasets when there are much more features than data items. The feature selection is most helpful in classification.
Evaluations
A. Results and Discussion
In this paper, we surveyed 101 algorithms in various clustering type, which includes most of the relevant clustering algorithms. We collected a large number of algorithms in different types to figure out the common themes of clustering type (such as scalability, complexity, data type) without focusing on the special considerations for each algorithm (such as application area, and data processing. The main criteria of paper selection was as the following: The main paper is proposed a clustering algorithm, and the algorithm is comparable with competitors in the same clustering area.
Figure 11 shows the percentage of each clustering method in the studied algorithms. The partition-based method had the most significant proportion, with 25% of all algorithms. 20% of the study algorithms applied an expandable method. Figure 12 shows the percentage of scalable processes that are not scalable. Projection-based algorithms were the highest research trend due to the ease with which this method applied against other scalable technologies.
Table 3 shows the clustering algorithm comparison classified according to their type and arranged chronologically within each class. The data are collected, directed from algorithms’ reference, and validated from other surveys [14], [63]–[72]. We evaluated the techniques based on the following:
Data Size: which determines the ability of the algorithm to operate using very large or limited data size
High Dimensions: It determines the strength of the algorithm work with large dimensions of data.
Streaming: This defines the way the algorithm uses data, either Batch or stream way.
Spatial Data Processing: Algorithm’s ability to dealing with complex and vital data types such as spatial.
Different data types: to determines whether the algorithm can work on more than one type of data at the same time.
Handling Noise: Algorithm’s ability to overcome the outliers.
Arbitrary Shape: The cluster shape output.
Scalable: determine if the algorithm includes any of the four scalable methods, which are sampling, projection, parallel, and Map-Reduced.
Complexity: The time complexity of algorithms which classify the complexity of the algorithm into three classes; first, if the algorithm complexity is linear or semi-linear, then complexity class is low; if it is below quadratic, then complexity class is Middle, and if it is quadratic or above then the complexity is high.
No of Parameters: The number of Parameters the algorithm needs to operate
Data Type: The type of data that is suitable for a specific algorithm.
B. Summary
Figure 13 presents a summary of the characteristic of the studied algorithms listed in Table 3. The percentage in table represents the average of feature in each algorithm’s category. From this statistic, the following observed:
Often the grid-based algorithms can handle extensive, high-dimensional data and then density-based algorithms. To manage the stream data, we find most of the algorithms depend on density, grid-based, then partitioning-based, while the hierarchical and model-based among the least researched here.
Most of the algorithms have high complexity implementation, except grid-based and density-based algorithms; most of their application is between medium and low complexity.
Partitioning and model-based researches are usually based on numeric data types, and this limits its use in many fields. When datatype is a complex structure, the grid-based methods are more distinct.
Conclusion
In this paper, we survey the literature to analyze and evaluate traditional and scalable clustering algorithms on big and large-scale datasets. First, we studying the different types of clustering and summarized the characteristics of the algorithms studies based on the algorithm’s characteristics, scalability, handling noise, data type, and complexity. Second, we compare the general characteristics of the clustering types. Finally, we summarized the strengths and weaknesses of each variety of clustering methods and scalable techniques. The main results obtained from the selected studies are:
The new scalable techniques are decreases due to the new direction of research toward implementing algorithms on the cloud-based infrastructure rather than developing new practical methods.
The most commonly used method for scalable techniques in the studies was the partitioning-based algorithms then hierarchical-based algorithms.
The density-based and grid-based algorithms found to be the most common techniques to handle large data size with noise in the literature. However, it observed that very few studies use mixed datatype datasets for evaluating the effectiveness of algorithm results.
Grid-based algorithms were the fastest techniques to handle high dimensionality data.
The most commonly used method for handling stream data in the studies was density-based algorithms.
Future Work
The large dimensional size of the data, the speed of data processing, data noise, data complexity, are integrated to form critical challenges in the field of data mining and analysis. Hence, these challenges realized using scalable techniques such as cloud and parallelism to handling data challenges. However, there is a wide range of scalable techniques across the current methods are used. The choice and implementation of the best model add additional challenges to researchers. Besides data problems designing new intelligent techniques for auto-adaptive based on data type, then auto partitioning data, distributing toward multiple methods, and collecting or sorting results at the same time is considered an open point to research.