Conferences >2020 IEEE International Confe...

Clustering Documents using the Document to Vector Model for Dimensionality Reduction

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The TF-IDF model is the most common way of representing documents in the vector space. However, its results are highly dimensional, posing problems to the classic cluster...Show More

Metadata

Abstract:

The TF-IDF model is the most common way of representing documents in the vector space. However, its results are highly dimensional, posing problems to the classic clustering algorithms due to the curse of dimensionality. Recent word embeddings based techniques can reduce the documents representations dimensionality while also preserving the semantic relationships between words. In this paper, we analyze the accuracy of four different classical clustering algorithms (K-Means, Spherical K-Means, LDA, and DBSCAN) in combination with the Document to Vector model.

Published in: 2020 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR)

Date of Conference: 21-23 May 2020

Date Added to IEEE Xplore: 01 July 2020

ISBN Information:

DOI: 10.1109/AQTR49680.2020.9129967

Conference Location: Cluj-Napoca, Romania

Contents

I. Introduction

Clustering is a data mining task which aims to group similar objects based on a dissimilarity measure. One of the many uses of clustering is textual data analysis using document clustering, which plays an important role in document retrieval, web search and spam filtering [9]. Many methods for clustering documents have been proposed, most of them using a term-frequency inverse-document-frequency matrix (TF-IDF matrix) to represent a corpus on which a chosen clustering method is applied [3]. However, the TF-IDF model has several drawbacks: it does not consider the semantic similarities between words, neither the word order and produces a high dimensional representation which often must be reduced using Principal Component Analysis or similar techniques [9]. The Paragraph Vector or Document to Vector (Doc2Vec) model [9] overcomes these disadvantages by representing words as n-dimensional vectors learnt using the word context.

References is not available for this document.

MIT Libraries

MIT Libraries

Clustering Documents using the Document to Vector Model for Dimensionality Reduction

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Clustering Documents using the Document to Vector Model for Dimensionality Reduction

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References