Thesis Format

Integrated Article

Measuring Enrichment Of Word Embeddings With Subword And Dictionary Information

Felipe Urra, The University of Western OntarioFollow

Degree

Master of Science

Program

Computer Science

Supervisor

Mercer, Robert E.

Abstract

Word embedding models have been an important contribution to natural language processing; following the distributional hypothesis, "words used in similar contexts tend to have similar meanings", these models are trained on large corpora of natural text and use the contexts where words are used to learn the mapping of a word to a vector. Some later models have tried to incorporate external information to enhance the vectors. The goal of this thesis is to evaluate three models of word embeddings and measure how these enhancements improve the word embeddings. These models, along with a hybrid variant, are evaluated on several tasks of word similarity and a task of word analogy. Results show that ﬁne-tuning the vectors with semantic information improves performance in word similarity dramatically; conversely, enriching word vectors with syntactic information increases performance in word analogy tasks, with the hybrid approach ﬁnding a solid middle ground.

Summary for Lay Audience

Word embedding models have been an exploding field in Natural Language Processing in the last few years. After several improvements in computational power and the development of neural networks, previously proposed models are now possible to implement in modern computers. In simple terms, a word embedding is a mapping of words to vectors of real numbers, while a word embedding model is a structure or a program that builds said mapping. The advantage of being able to represent words as vectors of a hundred or so dimensions makes these models ideal to use in real applications. Moreover, word embeddings have demonstrated to have several interesting properties, such as making words with similar meanings also similar in the vector space and the ability to represent word analogies through vector operations, for example, king - man + woman ≈ queen.

The way these models learn this information is based on word frequency and analyzing contexts to learn which words are used together. The learning is founded on the distributional hypothesis: "words that occur in similar contexts tend to have similar meanings", also known as "a word is characterized by the company it keeps". In addition to using this hypothesis some models derive extra information from the data or incorporate external information from other sources. One thing that makes the models difficult to analyze is the fact that all of them use different values of hyperparameters to do the learning, such as the number of vector dimensions, or the size of the contexts to analyze, among many others. This situation is also worsened by the fact that the values of the hyperparameters are not universal, and the performance of the model changes depending on the type or size of the data and the problem to solve.

This thesis aims to evaluate several word embedding models based on a popular one called word2vec. Two of these models are novel: a simplified variant of an existing model and a hybrid combination of two other models. Word similarity and word analogy tasks are used in this evaluation. This thesis also aims to replicate some previously published results and hyperparameter values and to check whether the novel models can obtain competent results on both types of tasks.

Recommended Citation

Urra, Felipe, "Measuring Enrichment Of Word Embeddings With Subword And Dictionary Information" (2019). Electronic Thesis and Dissertation Repository. 6262.
https://ir.lib.uwo.ca/etd/6262

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Thesis - Felipe Urra.pdf (2880 kB)

Download

COinS

Thesis Format

Measuring Enrichment Of Word Embeddings With Subword And Dictionary Information

Degree

Program

Supervisor

Abstract

Summary for Lay Audience

Recommended Citation

Creative Commons License

Links

Browse

Author Corner

Links

Thesis Format

Measuring Enrichment Of Word Embeddings With Subword And Dictionary Information

Author

Degree

Program

Supervisor

Abstract

Summary for Lay Audience

Recommended Citation

Creative Commons License

Share

Links

Browse

Author Corner

Links