
Exploratory Search with Archetype-based Language Models
Abstract
This dissertation explores how machine learning, natural language processing and information retrieval may assist the exploratory search task. Exploratory search is a search where the ideal outcome of the search is unknown, and thus the ideal language to use in a retrieval query to match it is unavailable. Three algorithms represent the contribution of this work. Archetype-based Modeling and Search provides a way to use previously identified archetypal documents relevant to an archetype to form a notion of similarity and find related documents that match the defined archetype. This is beneficial for exploratory search as it can generalize beyond standard keyword matching. By training word embeddings to generate vector representations of all words in the archetypal document vocabulary, and then training an author representation which is a conglomeration of these word representations, a similarity metric can be constructed to use for searching. Unclassified author representations from new corpuses can then be directly classified by machine learning algorithms, compared, and ranked, allowing this technique to be search document collections. Archetype-based Information Retrieval provides a way to extract the keywords most associated with archetypal author representations. This allows integration with keyword-based information retrieval systems that use a probabilistic relevancy score to retrieve more pertinent results. Lastly, Archetype-based Temporal Language Adaptive Stratification makes use of the scoring of previous algorithms and adapts for transitions over time between archetypal states, such as depressive episodes. This algorithm is specialized to find these temporal transitions between archetypes (i.e., depressed and not depressed) and identify language associated with the transition. In concert with Public Health Ottawa, these techniques have been used to 1) identify the language online that is related to the opioid epidemic and the individuals suffering from addiction, and 2) estimate the number of individuals matching this archetype within the catchment area for Public Health Ottawa. These techniques have also been used to identify language associated with depression on social media, and a synthetic example of how to use this to look for transitions between depressive states is described in a partially synthetic case study.