Journals & Magazines >IEEE Transactions on Knowledg... >Volume: 20 Issue: 5

Text Clustering with Feature Selection by Using Statistical Data

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the c...Show More

Metadata

Abstract:

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named CHIR, which is based on the chi² statistic and new statistical data that can measure the positive term-category dependency. We also propose a new text clustering algorithm, named text clustering with feature selection (TCFS). TCFS can incorporate CHIR to identify relevant features (i.e., terms) iteratively, and the clustering becomes a learning process. We compared TCFS and the K-means clustering algorithm in combination with different feature selection methods for various real data sets. Our experimental results show that TCFS with CHIR has better clustering accuracy in terms of the F-measure and the purity.

Published in: IEEE Transactions on Knowledge and Data Engineering ( Volume: 20, Issue: 5, May 2008)

Page(s): 641 - 652

Date of Publication: 31 March 2008

ISSN Information:

DOI: 10.1109/TKDE.2007.190740

Contents

1 Introduction

How to explore and utilize the huge amount of text documents is a major question in the areas of information retrieval and text mining. Document clustering is one of the most important text mining methods that are developed to help users effectively navigate, summarize, and organize text documents. By organizing a large amount of documents into a number of meaningful clusters, document clustering can be used to browse a collection of documents or to organize the results returned by a search engine in response to a user's query. It can significantly improve the precision and recall in information retrieval systems [18], and it is an efficient way to find the nearest neighbors of a document [3]. The problem of document clustering is generally defined as follows: Given a set of documents, we would like to partition them into a predetermined or an automatically derived number of clusters, such that the documents assigned to each cluster are more similar to each other than the documents assigned to different clusters. In other words, the documents in one cluster share the same topic, and the documents in different clusters represent different topics.

References is not available for this document.

Text Clustering with Feature Selection by Using Statistical Data

Abstract:

Metadata

Abstract:

ISSN Information:

1 Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Text Clustering with Feature Selection by Using Statistical Data

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1 Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?