Loading web-font TeX/Math/Italic
Text Clustering with Feature Selection by Using Statistical Data | IEEE Journals & Magazine | IEEE Xplore

Text Clustering with Feature Selection by Using Statistical Data


Abstract:

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the c...Show More

Abstract:

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named CHIR, which is based on the chi2 statistic and new statistical data that can measure the positive term-category dependency. We also propose a new text clustering algorithm, named text clustering with feature selection (TCFS). TCFS can incorporate CHIR to identify relevant features (i.e., terms) iteratively, and the clustering becomes a learning process. We compared TCFS and the K-means clustering algorithm in combination with different feature selection methods for various real data sets. Our experimental results show that TCFS with CHIR has better clustering accuracy in terms of the F-measure and the purity.
Published in: IEEE Transactions on Knowledge and Data Engineering ( Volume: 20, Issue: 5, May 2008)
Page(s): 641 - 652
Date of Publication: 31 March 2008

ISSN Information:


1 Introduction

How to explore and utilize the huge amount of text documents is a major question in the areas of information retrieval and text mining. Document clustering is one of the most important text mining methods that are developed to help users effectively navigate, summarize, and organize text documents. By organizing a large amount of documents into a number of meaningful clusters, document clustering can be used to browse a collection of documents or to organize the results returned by a search engine in response to a user's query. It can significantly improve the precision and recall in information retrieval systems [18], and it is an efficient way to find the nearest neighbors of a document [3]. The problem of document clustering is generally defined as follows: Given a set of documents, we would like to partition them into a predetermined or an automatically derived number of clusters, such that the documents assigned to each cluster are more similar to each other than the documents assigned to different clusters. In other words, the documents in one cluster share the same topic, and the documents in different clusters represent different topics.

Contact IEEE to Subscribe

References

References is not available for this document.