I. Introduction
Text categorization denotes the activity of automatically assigning one or multiple predefined categories to a free text. As one of the very important and basic components in text information management, it has received a lot of studies in recent years and a number of efficient machine learning approaches have been introduced, such as Bayesian classifiers [1], nearest neighbor classifiers [2], decision trees [1], rule learning [3], support vector machines (SVM) [4], ensemble learning methods [5], neural networks [6], and so on. In spite of what kind of learning techniques is adopted, for a text categorization task, the basic and first step is to transform the text document into some kind of representation that is more suitable for learning, called as document representation. This process usually includes two steps, feature extraction and feature selection. The most commonly used feature extraction is the vector space model (VSM) [7], where a document is represented as a vector of terms. Thus, according to the term frequency (TF) or other useful information of the text term, we could easily represent the document as a feature vector. Those terms used to express the documents form a dictionary, which usually has a very large size because of the inherent property of a language. And for a given document, most terms in the dictionary will not appear, this will lead most elements of its feature vector are zero. As a consequence, document representations have to face the sparse and high dimensionality problem. A feature selection procedure therefore becomes very important. There are several methods proposed to perform feature selection, such as DF Thresholding, information gain (IG), mutual information (MI) and statisticsetc [2]. Usually, two basic principles are considered in getting a feature vector: (i) the higher the term frequency in a document is, the more important it is to the category the document belongs to, (ii) the higher the term frequency is in all documents in the corpus, the more unimportant it is [8].