I. Introduction(heading 1)
All Today, it's possible to access to a high level of contextual information on the internet, and it is essential for users to search information accordance with their needs. However, the searching engines, using the traditional methods, in answer to the user's short question often come up with thousands of pages, which might have been arranged according to commercial goals. Such system provides a large set of possible answers arranged based on the keywords of the user's question and it's the user who should browse through this massive set and find the true answer, if there is any. Frequently, retrieved information differs vastly with the users intended meaning. On the other hand, it's hard on most users to find the appropriate answers to their questions from among the massive information and it's necessary that they have the required skill and experience for changing a question into a few key-words. In contrast to this technology, there might be a QAS which is able to get the user's question as a question in a natural language and extract the appropriate answer with a minimum redundancy and a maximum accuracy. From one perspective, QAS can be divided into two categories of open and specific fields. The open-field QAS changes are yearly reflected in the text retrieval conference (TREC). By definition, it should be able to answer the general questions with referring to a predetermined large set of texts. In contrast, the specific field systems are used for specialized fields such as medical dataset. QC is one of the important processes in QASs, that is semantically categorizes each question according to the types of the answers. For example. the question “who first went to the moon?” is categorized as the “human” type of answer because it is perceived based on its answer, and then it is labeled as classified. The same is correct for, other questions about places, colors, animals, etc. After classifying a question, the searching system browses for those kinds of paragraphs, which belong to the related type of answer so that the appropriate answer is extracted. This study is to present a QCS for Persian language. First, to train machine learning model a dataset is needed. Therefore, after gathering Persian questions, in supervised learning method, questions must be classified by an expert into fine and coarse categories. On the feature extraction step, different features part of speech (POS), question informer (QI), question word, tokens of question and their positions in each sentence, are extracted from each question. The labels used in the current study are mentioned in Table I. To train the QCS the CRF machine learning model has been used. Each word in a question is placed on a separate line and in front of each token, there are their corresponding features on different columns. On the last column, the appropriate answer type which has been chosen by an expert is mentioned repeatedly for each token of the question. the next question is inserted at the same rules after rendering a blank line in dataset. Throughout the dataset the answer-type labels are repeated for every token in a question. However, in this situation some ambiguity occurs, when the classifier trays to predict unique label for each question. In other words, due to the existing uncertainty conditions, there might be different answer-types to some questions. As a solution, we use a majority voting on the predicted labels. The one which is repeated the most frequently is chosen as the final label. Thus it is possible to have a unique answer-type label for each question. Finally, after the data was provided, the CRF machine learning model is trained, and its accuracy is evaluated by intended testing dataset. Wei in [1] proposed a classifier based on support vector machine (SVM), and he mentions following features as the features which are used in the classifier: “interrogative word; primary sememe, which is in HowNet, of first-degree and second-degree dependent word of interrogative word and named entity and singular/plural features” as well in feature extraction phase. Dongwei in [2] presented a QC method based on improved rules, by selecting seven common types of questions. Zhang in [3] used a SVM machine learning model in QC with sentence word, POS, named entity and semantics features for training classifier. Xia in [4] adapted three strategies to extract classification features: (1) using focused words in a question, (2) Using a domain attribute, (3) Using the binding of two domain attributes. They have reported a two-step classifier, encompassing rule based and SVM classifier. Nguyen in [5] used bag-of- words as features for their all experiments. They proposed semi-supervised learning for improving the accuracy of QC. Hejazi in [6] used an ontological rule-based classifier for determining answer-type labels to questions. The labels are considered as a question target and the identification of the question types is emphasized in relation to Persian ontology. Mohammadi-janghara in [7] used a combination of “uni–gram” and “bi-gram” model in a QAS in the field of “biography”. The words have higher differentiation and they are related to special categories, such as birth dates, are extracted from the question after being weighed precisely. Then, they turn into some special dictionaries form files. Next they are saved as key-words for each question type. Based on different answer-types namely, short, descriptive and listing-Persian questions they are classified. In [8] VSNOW machine learning model is used, based on a hierarchical classification of English questions with six coarse and 50 fine categories. Wang [9] used semantic grams and SVM learning model for classifying Chinese questions. They reported 20 percent accuracy increase with using semantic Uni-grams and Bi-grams, as compared to the usual form using N-grams (Uni-grams and Bi-grams). Instead of using a binary vector, Huang in [10] used a SVM, based on the word-weighing method. Word-weighing is applied with pre-processing step on the data, according to the idea of entropy in information retrieval. Lee in [11] managed to classify questions by using the SVM learning model and features like question informer (QI), Bi-grams, the first word of the question, the first two words of the question, and wh-question words. In [12], some actions such as word segmentation, key-word and head phrase extraction as well as some semantic features such as the HowNet and some syntactic features, are suggested as important steps in feature extraction. Metzler in [13] suggested n-gram functions, parts of speech (POS), semantic Word Net and Name Entity Recognition as important features for system training. In [14], a QCS is trained, using some two-layer forward neural network with back propagation. The Significant features used in that study are the Query-Text Relevance, the average word and phrase frequency, question length and word and phrase variance diffusion. Proposed in [21], the classification of the what-type questions. Authors just have considered the nouns as semantic words. Each word in question tagged as label using conditional random fields model, and the head noun's label is chosen as the question category. The features such as words, part-of-speech, chunker, parser information, question length, name entity, hypemym, synset and transition features are used for training English what-type questions classifier.
Coarse and fine grained question categories.
Abbreviation(14) | Abbreviation(3), explanation(11) |
Description(2781) | Manner(464), Reason(728), definition (437), description(1152) |
Human(412) | Title(14), person(238), Job(1), Speech(31), group(111), other(17) |
Location(404) | City(60), country(53), Sea(33), side(1), state(9), mountain(27), other(221) |
Numeric(437) | Num(114), date(155), period(30), length(27), percent(37), weight(4), money(28), Temperature(6), Size(2), Rank(3), height(1), distance(4), count(12), code(5), other(9) |
Entity(952) | Word(12), vehicle(4), tools(39), term(67), object(3), multimedia(137), material(102), linguistic(18), language(23), knowledge(1), Literature(1), Symbol(9), Symbol(8), Religion(64), Product(33), plant(18), food(20), event(37), dolor(8), body(48), animal(34), medicine(33), method(32), other(201) |