Analogy-based Assessment of Domain-specific Word Embeddings

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The ability of word embeddings to identify shared semantic regularities between word pair categories such as capital-country has led to the use of analogies as a method o...Show More

Metadata

Abstract:

The ability of word embeddings to identify shared semantic regularities between word pair categories such as capital-country has led to the use of analogies as a method of validating word embedding models. Further research has shown that relative to the complete breadth of possible analogy categories, there exists a limit to the particular categories accessible, in terms of accuracy, to current analogy equations executed against word embeddings trained on generalized, non domain-specific text corpora. As most, if not all, domain-specific, scientific analogy pairs belong to problematic analogy categories (i.e. the lexicographical and the encyclopedic), we examine the degree to which a domain-specific text corpus and vocabulary positively improve analogy predictions from word embeddings. Our findings demonstrate that in comparison to analogy-based tests performed against general word embeddings, predictions by domain-specific word embeddings outperform in exactly those analogy categories that are both highly problematic and the location of domain knowledge.

Published in: 2020 SoutheastCon

Date of Conference: 28-29 March 2020

Date Added to IEEE Xplore: 13 November 2020

ISBN Information:

ISSN Information:

DOI: 10.1109/SoutheastCon44009.2020.9249736

Conference Location: Raleigh, NC, USA

References is not available for this document.

Contents

I. Introduction

Analogies are recognized as a method for validating word embeddings [16] [22]. Typically, both the word embeddings and the analogy test sets are built from generalized text corpora and generalized vocabularies. Recent research has examined the performance of word embeddings built from domain-specific text corpora and trained using domain-specific vocabularies [7]. Our research tests the hypothesis that word embeddings built from a domain-specific, Earth science corpus and trained using domain-specific vocabulary will better predict domain-specific, Earth science analogies when compared with the results achieved by tests of non domain-specific analogies against word embeddings produced by generalized corpora. Further, we tested the hypothesis that the improvement in predictions would occur in the categories of analogical relationships in which most, if not all, domain knowledge is to be found.

Select All

AMS Glossary, 2019, [online] Available: http://glossary.ametsoc.org/wiki/Special:AllPages.

Google Scholar

Geoffrey Bilder, CrossRef API, 2019, [online] Available: https://github.com/CrossRef/rest-api-doc.

Google Scholar

S. Bird, E. Loper and E. Klein, Natural Language Processing with Python, Sebastopol, CA:O'Reilly Media, 2009.

Google Scholar

M. A. Boden, The creative mind: myths and mechanisms, London:Routledge, 2004.

CrossRef Google Scholar

A. Clark, "Language embodiment and the cognitive niche", Trends in Cognitive Sciences, vol. 10, no. 8, pp. 370-374, Aug. 2006.

CrossRef Google Scholar

S. Federici, S. Montemagni and V Pirrelli, "Inferring semantic similarity from distributional evidence: an analogy-based approach to word sense disambiguation", Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, 1997.

Google Scholar

S. Ghosh, P. Chakraborty, E. Cohn, J. S. Brownstein and N. Ramakrishnan, "Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach", Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM '16, pp. 1129-1138, 2016.

CrossRef Google Scholar

A. Gladkova, A. Drozd and S. Matsuoka, "Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn't", Proceedings of the NAACL Student Research Workshop, pp. 8-15, 2016.

CrossRef Google Scholar

D. A. Jurgens, S. M. Mohammad, P. D. Turney and K. J. Holyoak, "SemEval-2012 Task 2: Measuring Degrees of Relational Similarity", Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), pp. 356-364, 2012.

Google Scholar

10.

A. Khatua, A. Khatua and E. Cambria, "A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks", Information Processing & Management, vol. 56, no. 1, pp. 247-257, Jan. 2019.

CrossRef Google Scholar

11.

Y. Lepage and C. L. Goh, "Towards automatic acquisition of linguistic features", Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009), pp. 118-125, 2009.

Google Scholar

12.

S. McGregor, M. Purver and G. Wiggins, "Words Concepts and the Geometry of Analogy", Proceedings of the 2016 Workshop on Semantic Spaces at the Intersection of NLP Physics and Cognitive Science, pp. 39-48, 2016.

CrossRef Google Scholar

13.

T. Mikolov, W. Yih and G. Zweig, "Linguistic regularities in continuous space word representations", Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746-751, 2013.

Google Scholar

14.

T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient estimation of word representations in vector space", Proceedings of International Conference on Learning Representations (ICLR), 2013.

Google Scholar

15.

Global Change Master Directory, 2019, [online] Available: https://gcmd.nasa.gov/index.html.

Google Scholar

16.

J. Pennington, R. Socher and C. D. Manning, "Glove: Global Vectors for Word Representation", Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, 2014.

CrossRef Google Scholar

17.

R. Rehurek and P. Sojka, "Software Framework for Topic Modelling with Large Corpora", Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 2010.

Google Scholar

18.

Leonard Richardson, BeautifulSoup, 2019, [online] Available: https://www.crummy.com/software/BeautifulSoup/.

Google Scholar

19.

J. C. Ross, R. Murthy, K. K. Ganguli and P. Bhattacharyya, "Identifying Raga Similarity in Hindustani Classical Music through Distributed Representation of Raga Names", Proceedings of the 13th International Symposium on CMMR, 2017.

Google Scholar

20.

R. G. Raskin, "SWEET 2.1 Ontologies", AGU Fall Meeting Abstracts, 2010.

Google Scholar

21.

P. D. Turney and M. L. Littman, "Corpus-based Learning of Analogies and Semantic Relations", Machine Learning, vol. 60, pp. 251-278, Sep. 2005.

CrossRef Google Scholar

22.

P.D. Turney, "Similarity of Semantic Relations", Computational Linguistics, vol. 32, no. 3, pp. 379-416, Sep. 2006.

CrossRef Google Scholar

23.

P.D. Turney, "The latent relation mapping engine: algorithm and experiments", Journal of Artificial Intelligence Research, vol. 33, pp. 615-655, 2008.

CrossRef Google Scholar

24.

E. Vylomova, L. Rimell, T. Cohn and T. Baldwin, Take and Took Gaggle and Goose Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning, Aug. 2016, [online] Available: .

CrossRef Google Scholar

25.

G. A. Wiggins, "A preliminary framework for description analysis and comparison of creative systems", Knowledge-Based Systems, vol. 19, no. 7, pp. 449-458, Nov. 2006.

CrossRef Google Scholar

References is not available for this document.

Analogy-based Assessment of Domain-specific Word Embeddings

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Analogy-based Assessment of Domain-specific Word Embeddings

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?