Analogy-based Assessment of Domain-specific Word Embeddings | IEEE Conference Publication | IEEE Xplore

Analogy-based Assessment of Domain-specific Word Embeddings


Abstract:

The ability of word embeddings to identify shared semantic regularities between word pair categories such as capital-country has led to the use of analogies as a method o...Show More

Abstract:

The ability of word embeddings to identify shared semantic regularities between word pair categories such as capital-country has led to the use of analogies as a method of validating word embedding models. Further research has shown that relative to the complete breadth of possible analogy categories, there exists a limit to the particular categories accessible, in terms of accuracy, to current analogy equations executed against word embeddings trained on generalized, non domain-specific text corpora. As most, if not all, domain-specific, scientific analogy pairs belong to problematic analogy categories (i.e. the lexicographical and the encyclopedic), we examine the degree to which a domain-specific text corpus and vocabulary positively improve analogy predictions from word embeddings. Our findings demonstrate that in comparison to analogy-based tests performed against general word embeddings, predictions by domain-specific word embeddings outperform in exactly those analogy categories that are both highly problematic and the location of domain knowledge.
Published in: 2020 SoutheastCon
Date of Conference: 28-29 March 2020
Date Added to IEEE Xplore: 13 November 2020
ISBN Information:

ISSN Information:

Conference Location: Raleigh, NC, USA
References is not available for this document.

I. Introduction

Analogies are recognized as a method for validating word embeddings [16] [22]. Typically, both the word embeddings and the analogy test sets are built from generalized text corpora and generalized vocabularies. Recent research has examined the performance of word embeddings built from domain-specific text corpora and trained using domain-specific vocabularies [7]. Our research tests the hypothesis that word embeddings built from a domain-specific, Earth science corpus and trained using domain-specific vocabulary will better predict domain-specific, Earth science analogies when compared with the results achieved by tests of non domain-specific analogies against word embeddings produced by generalized corpora. Further, we tested the hypothesis that the improvement in predictions would occur in the categories of analogical relationships in which most, if not all, domain knowledge is to be found.

Select All
1.
AMS Glossary, 2019, [online] Available: http://glossary.ametsoc.org/wiki/Special:AllPages.
2.
Geoffrey Bilder, CrossRef API, 2019, [online] Available: https://github.com/CrossRef/rest-api-doc.
3.
S. Bird, E. Loper and E. Klein, Natural Language Processing with Python, Sebastopol, CA:O'Reilly Media, 2009.
4.
M. A. Boden, The creative mind: myths and mechanisms, London:Routledge, 2004.
5.
A. Clark, "Language embodiment and the cognitive niche", Trends in Cognitive Sciences, vol. 10, no. 8, pp. 370-374, Aug. 2006.
6.
S. Federici, S. Montemagni and V Pirrelli, "Inferring semantic similarity from distributional evidence: an analogy-based approach to word sense disambiguation", Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, 1997.
7.
S. Ghosh, P. Chakraborty, E. Cohn, J. S. Brownstein and N. Ramakrishnan, "Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach", Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM '16, pp. 1129-1138, 2016.
8.
A. Gladkova, A. Drozd and S. Matsuoka, "Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn't", Proceedings of the NAACL Student Research Workshop, pp. 8-15, 2016.
9.
D. A. Jurgens, S. M. Mohammad, P. D. Turney and K. J. Holyoak, "SemEval-2012 Task 2: Measuring Degrees of Relational Similarity", Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval 2012), pp. 356-364, 2012.
10.
A. Khatua, A. Khatua and E. Cambria, "A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks", Information Processing & Management, vol. 56, no. 1, pp. 247-257, Jan. 2019.
11.
Y. Lepage and C. L. Goh, "Towards automatic acquisition of linguistic features", Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009), pp. 118-125, 2009.
12.
S. McGregor, M. Purver and G. Wiggins, "Words Concepts and the Geometry of Analogy", Proceedings of the 2016 Workshop on Semantic Spaces at the Intersection of NLP Physics and Cognitive Science, pp. 39-48, 2016.
13.
T. Mikolov, W. Yih and G. Zweig, "Linguistic regularities in continuous space word representations", Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746-751, 2013.
14.
T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient estimation of word representations in vector space", Proceedings of International Conference on Learning Representations (ICLR), 2013.
15.
Global Change Master Directory, 2019, [online] Available: https://gcmd.nasa.gov/index.html.
16.
J. Pennington, R. Socher and C. D. Manning, "Glove: Global Vectors for Word Representation", Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, 2014.
17.
R. Rehurek and P. Sojka, "Software Framework for Topic Modelling with Large Corpora", Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45-50, 2010.
18.
Leonard Richardson, BeautifulSoup, 2019, [online] Available: https://www.crummy.com/software/BeautifulSoup/.
19.
J. C. Ross, R. Murthy, K. K. Ganguli and P. Bhattacharyya, "Identifying Raga Similarity in Hindustani Classical Music through Distributed Representation of Raga Names", Proceedings of the 13th International Symposium on CMMR, 2017.
20.
R. G. Raskin, "SWEET 2.1 Ontologies", AGU Fall Meeting Abstracts, 2010.
21.
P. D. Turney and M. L. Littman, "Corpus-based Learning of Analogies and Semantic Relations", Machine Learning, vol. 60, pp. 251-278, Sep. 2005.
22.
P.D. Turney, "Similarity of Semantic Relations", Computational Linguistics, vol. 32, no. 3, pp. 379-416, Sep. 2006.
23.
P.D. Turney, "The latent relation mapping engine: algorithm and experiments", Journal of Artificial Intelligence Research, vol. 33, pp. 615-655, 2008.
24.
E. Vylomova, L. Rimell, T. Cohn and T. Baldwin, Take and Took Gaggle and Goose Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning, Aug. 2016, [online] Available: .
25.
G. A. Wiggins, "A preliminary framework for description analysis and comparison of creative systems", Knowledge-Based Systems, vol. 19, no. 7, pp. 449-458, Nov. 2006.
Contact IEEE to Subscribe

References

References is not available for this document.