Electronic Thesis and Dissertation Repository

Measuring Enrichment Of Word Embeddings With Subword And Dictionary Information

Felipe Urra, The University of Western Ontario

Abstract

Word embedding models have been an important contribution to natural language processing; following the distributional hypothesis, "words used in similar contexts tend to have similar meanings", these models are trained on large corpora of natural text and use the contexts where words are used to learn the mapping of a word to a vector. Some later models have tried to incorporate external information to enhance the vectors. The goal of this thesis is to evaluate three models of word embeddings and measure how these enhancements improve the word embeddings. These models, along with a hybrid variant, are evaluated on several tasks of word similarity and a task of word analogy. Results show that fine-tuning the vectors with semantic information improves performance in word similarity dramatically; conversely, enriching word vectors with syntactic information increases performance in word analogy tasks, with the hybrid approach finding a solid middle ground.