1. Introduction
Text categorization techniques can classify original language texts into various classifications categories defined in advance by the researcher [1]. The increasingly rapid changes in information technology and the skyrocketing use of the internet have turned all the available information into a huge mass of jumbled data. Consequently, sorting through large amounts of data has become a very important task. Because of this, many automatic algorithms have been developed in the past several years, such as, the support vector machine [2]–[6], Naïve Bayes [3], [7], the artificial neural network [8]–[11], K-Nearest Neighbor [3], [12]–[13], scene labeling by relaxation algorithm [12]. Among them, the support vector machine is one of the more recent ones in calculus technology, and the research of Joachims [4] indicated that it has the advantages of “high dimensional input space,” “less unimportant characteristic values,” “sparse vectors among marking files,” “most of the text categorization problems are linear and divisible,” etc. Many researches have proven that this method has outstanding classification validity [4], [6], [15]–[16]. For the patent classification system, the most widely used one is the International Patent Classification (IPC) established by the Strasbourg Agreement (1971). And, with the rapid increase in patent application numbers, so increased the workload of the patent sorters. If computer technology could be used to deal automatically with the technical portions of the patent classification work, then the efficiency and accuracy of patent classification will be greatly improved. However, the automatic patent classify system is only being used in the patent division of major countries, and then mainly as a research project, and is seldom being used in practice [17]. A large majority of previously published studies [17]–[19] on this subject considered the IPC patent classification system and developed automatic patent categorization models by using artificial intelligence techniques (e.g. neural network). However, most of them focused on using computational models for the International Patent Classification (IPC) system rather than using them in real world cases of patent classification. On the other hand, the principle of our approach to patent classification is based on expert screening by reading the patent documents in detail. As a result, the support machine vector algorithm will keep modifying the classified logic in the patent document through expert monitoring. Therefore this research proposed a novel two-stage process combining the support vector machine and expert screening technique for building an automatic patent categorization system to overcome these problems.