Phrase Ranking and Wikipedia Based Cluster Labeling | IEEE Conference Publication | IEEE Xplore

Phrase Ranking and Wikipedia Based Cluster Labeling


Abstract:

Automatically labeling document clusters with words which indicate their topics is a relatively new and active research field. The most frequently used process, labeling ...Show More

Abstract:

Automatically labeling document clusters with words which indicate their topics is a relatively new and active research field. The most frequently used process, labeling with the most frequent words in the clusters, turns out using several words that are virtually void of descriptive power even after traditional stop words are eliminated. Another procedure, labeling with the most anticipated words, often include rather obscure results. We present Phrase Rank, a variation of the Page Rank algorithm based on relational graph representation of the content of web document collections. Phrase Rank achieves segregation and ranking of discriminative phrases higher than the ambiguous Phrases followed by common phrases. Thus a set of important text features are first extracted from the cluster documents. Further we use these features to extract cluster labels from the external knowledge sources such as pre-categorized knowledge of Wikipedia. We experiment with a test dataset to demonstrate the efficacy of Phrase Rank algorithm.
Date of Conference: 21-23 December 2013
Date Added to IEEE Xplore: 09 October 2014
Electronic ISBN:978-0-7695-5013-8
Conference Location: Katra, India

I. Introduction

The amount of electronic information at one's disposal is increasing swiftly with the growth in digital processing. Moreover, huge quantity of textual information have brought about the requirement for efficient procedures that can tailor the data in governable arrangements. One of the most popular techniques for organizing textual information is the use of clustering algorithms, which group a set of documents into coherent clusters. These algorithms create clusters, where documents within a cluster are as similar as possible and documents in one cluster are dissimilar from documents in other clusters. In many applications of clustering, particularly in user interface based applications, human users interact directly with the created clusters. In such settings we must label the clusters so that users can understand what the cluster is about. Huge amount of research is happening on clustering algorithms and their applications in information retrieval and data mining. However, comparatively less amount of research has been done on labelling a cluster. Use of all Phrases in a document for the purpose of identifying most appropriate label explodes the feature space and gives rise to a crucial problem, “the curse of dimensionality”. To avoid this issue, a popular approach for cluster labeling is to employ statistical approach for feature selection. The quality of top features needs to be assured by a robust Phrase raking method.

References

References is not available for this document.