Conferences >2012 International Conference...

Clustering based two-stage text classification requiring minimal training data

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Clustering aided classification methods are based on the assumption that the learned clusters under the guidance of initial training data can somewhat characterize the un...Show More

Metadata

Abstract:

Clustering aided classification methods are based on the assumption that the learned clusters under the guidance of initial training data can somewhat characterize the underlying distribution of the data set. However, our experiments show that whether such assumption holds is based on both the separability of the considered data set and the size of the training data set. It is often violated on data set of bad separability, especially when the initial training data are too few. In this case, clustering based methods would perform worse. In this paper, we propose a clustering based two-stage text classification approach to address the above problem. In the first stage, labeled and unlabeled data are first clustered with the guidance of the labeled data. Then a self-training style clustering strategy is used to iteratively expand the training data under the guidance of an oracle or expert. At the second stage, discriminative classifiers can subsequently be trained with the expanded labeled data set. Unlike other clustering based methods, the proposed clustering strategy can effectively cope with data of bad separability. Furthermore, our proposed framework converts the problem of sparsely labeled text classification into a supervised one, therefore, supervised classification models, e.g. SVM, can be applied, and techniques proposed for supervised learning can be used to further improve the classification accuracy, such as feature selection, sampling methods and data editing or noise filtering. Our experimental results demonstrated the effectiveness of our proposed approach especially when the size of the training data set is very small.

Published in: 2012 International Conference on Systems and Informatics (ICSAI2012)

Date of Conference: 19-20 May 2012

Date Added to IEEE Xplore: 25 June 2012

ISBN Information:

DOI: 10.1109/ICSAI.2012.6223496

Conference Location: Yantai, China

References is not available for this document.

Contents

I. Introduction

The goal of automatic text classification is to automatically assign documents to a number of predefined categories. It is of great importance due to the ever-expanding amount of text documents available in digital form in many real-world applications, such as web-page classification and recommendation, email processing and filtering. Text classification has once been considered as a supervised learning task, and a large number of supervised learning algorithms have been developed, such as Support Vector Machines(SVM)[1], Naive Bayes[2], Nearest Neighbor[3], and Neural networks[4]. A comparative study was given in [5]. SVM has been recognized as one of the most effective text classification methods. Furthermore, a number of techniques suitable for supervised learning have been proposed to improve classification accuracy, such as feature selection, data editing and noise filtering, and sampling methods against bias.

Select All

Joachims, T. (1998). Text categorization with support vector machines: Learning with Many Relevant Features. In C. Ndellec and C. Rouveirol (Eds.), Proceedings of the European Conference on Machine Learning pp. 137-142, Berlin: Springer.

CrossRef Google Scholar

Lewis, D. D (1998). Naïve Bayes at forty: The independence assumption in information retrieval, ECML98.

CrossRef Google Scholar

Masand, B., Linoff, G., Waltz, D. (1992). Classifying news stories using memory based reasoning, 15th ACM SIGIR Conference, 59-64.

CrossRef Google Scholar

Ng, T.H., Goh, W.B., Low, K.L. (1997). Feature selection, perception learning and a usability case study for text categorization, 20th ACM SIGIR Conference.

CrossRef Google Scholar

Yang, Y. Liu, X. (1999). An re-examination of text categorization, 22th ACM SIGIR Conference.

CrossRef Google Scholar

Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of 16th International Conference on Machine Learning (pp. 200-209). San Francisco: Morgan Kaufmann.

Google Scholar

Blum, A. Mitchell, T. (1998). Combining labeled and unlabeled data with Co-Training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (pp. 92-100).

CrossRef Google Scholar

Nigam, K., McCallurn, A. K., Thrun, S. Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103-134.

Google Scholar

Seeger, M. (2001). Learning with labeled and unlabeled data. Technical report, Edinburgh University.

Google Scholar

10.

H.J. Zeng, X.H. Wang, Z. Chen, W.Y. Ma. (2003). CBC: Clustering based text classification requiring minimal labeled data, proceedings of the 3rd IEEE international conference on data mining, ICDM2003.

View Article

Google Scholar

11.

A. Kyriakopoulou. (2008). Text classification aided by clustering:a literature review. Tools in Artificial Inteligence, 2008, 233-252.

CrossRef Google Scholar

References is not available for this document.

Clustering based two-stage text classification requiring minimal training data

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Clustering based two-stage text classification requiring minimal training data

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?