Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article


Master of Science


Computer Science


Mercer, Robert E.


Word embedding models have been an important contribution to natural language processing; following the distributional hypothesis, "words used in similar contexts tend to have similar meanings", these models are trained on large corpora of natural text and use the contexts where words are used to learn the mapping of a word to a vector. Some later models have tried to incorporate external information to enhance the vectors. The goal of this thesis is to evaluate three models of word embeddings and measure how these enhancements improve the word embeddings. These models, along with a hybrid variant, are evaluated on several tasks of word similarity and a task of word analogy. Results show that fine-tuning the vectors with semantic information improves performance in word similarity dramatically; conversely, enriching word vectors with syntactic information increases performance in word analogy tasks, with the hybrid approach finding a solid middle ground.

Summary for Lay Audience

Word embedding models have been an exploding field in Natural Language Processing in the last few years. After several improvements in computational power and the development of neural networks, previously proposed models are now possible to implement in modern computers. In simple terms, a word embedding is a mapping of words to vectors of real numbers, while a word embedding model is a structure or a program that builds said mapping. The advantage of being able to represent words as vectors of a hundred or so dimensions makes these models ideal to use in real applications. Moreover, word embeddings have demonstrated to have several interesting properties, such as making words with similar meanings also similar in the vector space and the ability to represent word analogies through vector operations, for example, king - man + woman ≈ queen.

The way these models learn this information is based on word frequency and analyzing contexts to learn which words are used together. The learning is founded on the distributional hypothesis: "words that occur in similar contexts tend to have similar meanings", also known as "a word is characterized by the company it keeps". In addition to using this hypothesis some models derive extra information from the data or incorporate external information from other sources. One thing that makes the models difficult to analyze is the fact that all of them use different values of hyperparameters to do the learning, such as the number of vector dimensions, or the size of the contexts to analyze, among many others. This situation is also worsened by the fact that the values of the hyperparameters are not universal, and the performance of the model changes depending on the type or size of the data and the problem to solve.

This thesis aims to evaluate several word embedding models based on a popular one called word2vec. Two of these models are novel: a simplified variant of an existing model and a hybrid combination of two other models. Word similarity and word analogy tasks are used in this evaluation. This thesis also aims to replicate some previously published results and hyperparameter values and to check whether the novel models can obtain competent results on both types of tasks.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.