Conferences >2005 International Conference...

Learning effective features for Chinese text categorization

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Text categorization task always suffers from a high dimension problem, which leads the learning system to be in a status of either lower efficiency or lower performance. ...Show More

Metadata

Abstract:

Text categorization task always suffers from a high dimension problem, which leads the learning system to be in a status of either lower efficiency or lower performance. A number of feature selection methods have therefore been adopted or proposed for its dimensional reduction, such as DF, IG, Chi Square and so on. Unlike those traditional feature selection methods, in this paper, a feature selection method based on the idea of "discriminative learning" is presented, where those learned "effective" features rather than traditional "important" features are used to construct feature space. During learning effective features, a variant AdaBoost algorithm as well as a pairwise multiclass learning scheme are adopted. Simulation results show the presented method works well.

Published in: 2005 International Conference on Natural Language Processing and Knowledge Engineering

Date of Conference: 30 October 2005 - 01 November 2005

Date Added to IEEE Xplore: 27 February 2006

Print ISBN:0-7803-9361-9

DOI: 10.1109/NLPKE.2005.1598809

Conference Location: Wuhan, China

Contents

I. Introduction

Text categorization denotes the activity of automatically assigning one or multiple predefined categories to a free text. As one of the very important and basic components in text information management, it has received a lot of studies in recent years and a number of efficient machine learning approaches have been introduced, such as Bayesian classifiers [1], nearest neighbor classifiers [2], decision trees [1], rule learning [3], support vector machines (SVM) [4], ensemble learning methods [5], neural networks [6], and so on. In spite of what kind of learning techniques is adopted, for a text categorization task, the basic and first step is to transform the text document into some kind of representation that is more suitable for learning, called as document representation. This process usually includes two steps, feature extraction and feature selection. The most commonly used feature extraction is the vector space model (VSM) [7], where a document is represented as a vector of terms. Thus, according to the term frequency (TF) or other useful information of the text term, we could easily represent the document as a feature vector. Those terms used to express the documents form a dictionary, which usually has a very large size because of the inherent property of a language. And for a given document, most terms in the dictionary will not appear, this will lead most elements of its feature vector are zero. As a consequence, document representations have to face the sparse and high dimensionality problem. A feature selection procedure therefore becomes very important. There are several methods proposed to perform feature selection, such as DF Thresholding, information gain (IG), mutual information (MI) and statisticsetc [2]. Usually, two basic principles are considered in getting a feature vector: (i) the higher the term frequency in a document is, the more important it is to the category the document belongs to, (ii) the higher the term frequency is in all documents in the corpus, the more unimportant it is [8].

References is not available for this document.

MIT Libraries

MIT Libraries

Learning effective features for Chinese text categorization

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Learning effective features for Chinese text categorization

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References