Conferences >2021 26th International Compu...

A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Natural Language Processing (NLP) is one of the promising fields of artificial intelligence. In recent decades, high volume of text data has been generated through the In...Show More

Metadata

Abstract:

Natural Language Processing (NLP) is one of the promising fields of artificial intelligence. In recent decades, high volume of text data has been generated through the Internet. This kind of data is a valuable source of information which can be used in various fields such as information retrieval, search engines, recommender systems, etc. One practical task of text mining is document classification. In this paper, we mainly focus on Persian document classification. We introduce a new feature extraction approach derived from the combination of K-means clustering and Word2Vec to acquire semantically relevant and discriminant word representations. We call our proposed approach CC-Word2Vec (Categorical Clustering-Word2Vec) since we retrain the Word2Vec model using the word clusters of each category obtained by K-Means algorithm. We use 200 documents of 5 most frequent categories of Hamshahri news dataset to evaluate our method. We pass the extracted word vectors to Multi-Layer Perceptron (MLP) and Gradient Boosting (GB) classifiers to compare the performance of the proposed approach with Term Frequency Inverse Document Frequency (TF-IDF) and Word2Vec methods. Our new approach resulted in an improvement in the obtained accuracy of Gradient Boosting and Multi-Layer Perceptron models in comparison with TF-IDF and Word2Vec techniques.

Published in: 2021 26th International Computer Conference, Computer Society of Iran (CSICC)

Date of Conference: 03-04 March 2021

Date Added to IEEE Xplore: 07 May 2021

ISBN Information:

DOI: 10.1109/CSICC52343.2021.9420602

Conference Location: Tehran, Iran

Contents

I. Introduction

Nowadays, due to the advent of technology, a high volume of text data has been generating through web pages, social media, and other sources so that it is very time and energy consuming to categorize data by humans [1]. Text data is an unstructured data which can be analyzed using text mining methods [2]. Text mining framework, as it is mentioned in [3], has three sequential steps: text preprocessing, text representation, and knowledge discovery. According to [3], classification task is a technique of knowledge discovery and feature extraction methods are included in text representation. Text classification, more specifically, document classification is a supervised task in which the classifier is trained with some pre-categorized documents; Then, it will be expected that the classifier assigns an unseen document to one of the existing categories. There are many classification algorithms for text data including Support Vector Machines (SVM), Naïve Bayes [4], Logistic Regression, K-Nearest Neighbors (KNN) [5], and Neural Networks models [6] [7]. Furthermore, ensemble classifiers, made up of several classifiers, are another new technique for text classification. Gradient Boosting is a popular ensemble classifier for text classification [8]. Feature extraction is a crucial step which should be carried out before the classification task. It can enhance the prediction accuracy of the classification task through finding more discriminant representations or via dimensionality reduction techniques [9]. Although a lot of research has been done on English document classification, the developed methods might not necessarily perform well for Persian document classification [10]. In what follows, we will review some previous works on feature extraction methods and Persian text classification tasks. TF-IDF is a commonly used feature extraction method. This method calculates the importance of a word in the whole dataset or corpus [11]. In reference [1], Farhoodi and Yari applied TF-IDF technique to the Hamshahri dataset to realize which one of SVM or KNN is more efficient in Persian document classification task. They reported that the KNN algorithm outperforms SVM classifier. This efficiency can be improved by increasing the number of selected features and using cosine similarity measure. In [10], researchers classified Hamshahri news dataset using TF-IDF as the feature extraction method in the presence of entropy instead of stop word lists for removing stop words in the preprocessing step. They applied KNN and Nave Bayes classifiers to examine the accuracy of their method. In reference [12], TF and TF-IDF methods are used as the feature extraction for classifying Irna News Website. They operated Gaussian Naïve Bayes, Multinomial Naïve Bayes, Bernoulli Naïve Bayes, and SVM classifiers to report the most accurate algorithm for Persian text classification. They reported that Multinomial Naïve Bayes classifier is the most accurate algorithm with micro F1-score 0.838530 using TF-IDF method and removing stop words. In reference [13], researchers proposed an ensemble classifier consisting of SVM, KNN, and MLP classifiers to categorize two datasets; "Routers" and "Hamshahri". They also benefited from TF-IDF method as the feature extraction method. This ensemble classifier resulted in better accuracy and efficiency for both datasets in comparison with SVM, KNN, and MLP, individually. Jafari, Ezadi, Hossennejad, and Noohi [14] also worked on four categories of new Hamshahri dataset. They used representative vector to explore its impact on the accuracy of Persian document classification. They realized that high value of precision and recall can be achieved through removing more extra words and inserting a few words into the representative vector. In [15], topic models are utilized for Persian text classification and it is concluded that the use of topic models can lead to accuracy improvement with respect to the bag of words-based algorithms such as TF-IDF. Although the TF-IDF method is very common, it suffers from the lack of semantic relations between the words. As a result, another interesting method of feature extraction presented to solve this problem. It is the Word2Vec model introduced by Tomas Mikolov; et al. (2013) [16]. Researchers in reference [17] introduced a novel combined method benefited from Word2Vec and Latent Dirichlet Allocation (LDA) to extract the features of documents. This model considers both the relation between words and documents, and the relation between topics and documents. They tested their model with 20Newsgroups dataset [18] and SVM classifier. They concluded that their model outperforms TF-IDF, Word2Vec, and LDA methods.

References is not available for this document.

A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References