Conferences >2005 International Conference...

Improving Chinese text categorization by outlier learning

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Text categorization is one of the typical machine learning tasks that suffer from an incomplete training data problem. A main reason is the existence of outliers in train...Show More

Metadata

Abstract:

Text categorization is one of the typical machine learning tasks that suffer from an incomplete training data problem. A main reason is the existence of outliers in training data, such as non-sense documents, documents mislabeled or lying on the border between different categories, and documents that are out of the defined categories, etc. Therefore, in a text categorization task, outlier learning technique could be adopted to improve text categorization. In this paper, an outlier learning based text categorization system is proposed, where AdaBoost algorithm is adopted for outlier identifying. Simulation results reveal that the new system is successful in improving learning performance for text categorization.

Published in: 2005 International Conference on Natural Language Processing and Knowledge Engineering

Date of Conference: 30 October 2005 - 01 November 2005

Date Added to IEEE Xplore: 27 February 2006

Print ISBN:0-7803-9361-9

DOI: 10.1109/NLPKE.2005.1598808

Conference Location: Wuhan, China

Citations are not available for this document.

Contents

I. Introduction

Text categorization, which involves assigning one or more predefined categories to a free text according to its content, has turned out to be one of the very important and basic components in text information management. It has been studied for several years, and a number of efficient machine learning approaches have been utilized, such as Bayesian classifiers [1], nearest neighbor classifiers [2], decision trees [1], rule learning [3], support vector machines (SVM) [4], ensemble learning methods [5], neural networks [6], and so on. In machine learning, incomplete data is a big problem. There are many possibilities that can cause the training data to be incomplete, such as mislabeling, biases, omissions, non-sufficiency, imbalance, noise, outliers, etc. As one of the typical machine learning tasks, text categorization consistently suffers from an incomplete training data problem. Among all cases of the data-incomplete circumstance, outlier problem may be the most important factor for text categorization task. An outlier is a pattern that was either mislabeled in the training data, or inherently ambiguous and hard to recognize [7]. In text categorization task, there usually exist a lot of outliers in the training data, for example, non-sense documents, documents mislabeled or lying on the border between different categories, and documents that are out of the defined categories, etc. Therefore, text categorization must address the outlier problem to reach a high performance. However, traditional machine learning approaches mentioned above for text categorization did not take into account the outlier problem in some sense.

Cites in Patents (1)Patent Links Provided by 1790 Analytics

Renders, Jean-Michel; Privault, Caroline; Menuge, Ludovic, "INTERACTIVE CLEANING FOR AUTOMATIC DOCUMENT CLUSTERING AND CATEGORIZATION"

Patent No. 7711747 Patent Office Google Scholar

References is not available for this document.

Improving Chinese text categorization by outlier learning

Abstract:

Metadata

Abstract:

I. Introduction

Cites in Patents (1)Patent Links Provided by 1790 Analytics

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Improving Chinese text categorization by outlier learning

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Cites in Patents (1)Patent Links Provided by 1790 Analytics

References

IEEE Account

Purchase Details

Profile Information

Need Help?