Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article


Doctor of Philosophy


Computer Science


Lizotte, Daniel J

2nd Supervisor

Sedig, Kamran



This dissertation explores how machine learning, natural language processing and information retrieval may assist the exploratory search task. Exploratory search is a search where the ideal outcome of the search is unknown, and thus the ideal language to use in a retrieval query to match it is unavailable. Three algorithms represent the contribution of this work. Archetype-based Modeling and Search provides a way to use previously identified archetypal documents relevant to an archetype to form a notion of similarity and find related documents that match the defined archetype. This is beneficial for exploratory search as it can generalize beyond standard keyword matching. By training word embeddings to generate vector representations of all words in the archetypal document vocabulary, and then training an author representation which is a conglomeration of these word representations, a similarity metric can be constructed to use for searching. Unclassified author representations from new corpuses can then be directly classified by machine learning algorithms, compared, and ranked, allowing this technique to be search document collections. Archetype-based Information Retrieval provides a way to extract the keywords most associated with archetypal author representations. This allows integration with keyword-based information retrieval systems that use a probabilistic relevancy score to retrieve more pertinent results. Lastly, Archetype-based Temporal Language Adaptive Stratification makes use of the scoring of previous algorithms and adapts for transitions over time between archetypal states, such as depressive episodes. This algorithm is specialized to find these temporal transitions between archetypes (i.e., depressed and not depressed) and identify language associated with the transition. In concert with Public Health Ottawa, these techniques have been used to 1) identify the language online that is related to the opioid epidemic and the individuals suffering from addiction, and 2) estimate the number of individuals matching this archetype within the catchment area for Public Health Ottawa. These techniques have also been used to identify language associated with depression on social media, and a synthetic example of how to use this to look for transitions between depressive states is described in a partially synthetic case study.

Summary for Lay Audience

Advances in machine learning and natural language processing have changed the way that we can search data. When performing a search, such as an online web search using sites such as Google, we are required to input keywords to retrieve content that is relevant to us. This is an example of a look-up style search, where the relevant language is known and there is some idea of what the results should look like. On the other hand, there is exploratory search, which has less definite results and in most cases the relevant vocabulary is partially known at best.

One of the techniques in this dissertation, Archetype-based Modeling and Search, provides a way for a machine learning model to learn the relevant vocabulary from documents that were previously identified as being relevant. By using machine learning approximated notions of similarity, the system is able to find complex associations with words and an approximation of concepts to the relevant documents to find. However, this process is computationally demanding, and can be a bit of a ‘black box’ when trying to understand the decisions made by the machine learning system. The next technique, Archetype-based Information Retrieval, builds upon the first by extracting the keywords which best explain the decisions being made. We then show how these keywords can be used in a normal information retrieval system, which means that the task of forming a query has been changed from thinking of relevant words to identifying sets of documents which contain keywords thought to be relevant. The last technique, Archetype-based Temporal Language Adaptative Stratification, is a way to expand the previous two techniques to be better at identifying behaviours that change over time, and then analyzes the transition to see what language is associated with that change.

The first two techniques were demonstrated in coordination with Public Health Ottawa to examine the state of the opioid epidemic in their local catchment area as it was represented on social media. We estimate the number of individuals active on Reddit that match this profile and use this to estimate the population prevalence. The last technique was developed while working on analyzing depression on social media during the COVID-19 pandemic, and uses partially synthetic data to avoid the ethical complexities of analyzing and reporting on an individual’s experience with depression.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License.