I. Introduction
Preprocessing is an important part of data mining. Its main function is to sort out big data and lay the foundation for data analysis. Literature [1] measures the degree of similarity between data by similarity measurement, uses criterion function to evaluate the quality of clustering results, and uses K-means clustering algorithm to make the distance between each data to the center of its cluster as small as possible, and the distance between different clusters as large as possible. However, this algorithm has slow clustering convergence speed, long clustering time and low accuracy [1]. In reference [2], unsupervised learning method is used to measure the similarity of data without category labels, and big data is divided into various clusters to achieve the effect of data grouping. However, the data grouping distance of this method is not standardized, and the time cost of database clustering is long and the accuracy is low [2]. To solve this problem, combined with the above theory, this paper designs an efficient distributed database clustering algorithm for big data processing, reveals the differences between big data, discovers the internal relationship of big data, and provides a reliable basis for deeper data analysis.