Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Science

Program

Biostatistics

Supervisor

Lizotte, Daniel J.

Abstract

The Structural Topic Model (STM) incorporates external information about expected document-topic proportions to enhance the model. Motivated by focus groups, whose transcripts represent text data inherently grouped by session, we propose three extensions to the STM: 1) mean document-topic proportion estimation using a regression with random effects; 2) partitioned estimation of group-specific topic covariance matrices; and 3) a post hoc mixed effects regression on topic prevalence which incorporates latent variable uncertainty into the coefficient estimates. We explore the utility of these modifications through simulated examples and apply them to focus group transcripts from a pan-Canadian study on homelessness. The new methods, collectively the “mixSTM", improved topic model fit when there was complex group-related variation in topic prevalence and provided new avenues for interpretation. These methods may better represent analyst beliefs about qualities of grouped text data, although there is a risk of over-complicating the estimation given small, qualitative data sources.

Summary for Lay Audience

The output of health or social research can be text data, such as when conducting focus groups. Topic modelling is a quantitative method capable of extracting information rapidly from large collections of text. To do so, topic models propose a theoretical generation mechanism for text which assumes the existence of “topics": groups of words that tend to co-occur and thus share elements of meaning. Each document (text segment) in a collection is assumed to be composed of multiple topics in different amounts; the goal of topic modelling is to find both the topics and the amounts. The Structural Topic Model (STM) of Roberts et al. (2014) lets external information about each document (e.g., the author/speaker) influence how much the topics are expected to be used in each document. This thesis proposes modifications to the STM when documents are grouped. One example of grouped documents could be focus group transcripts: a transcript is composed of text segments that are more related to each other than to segments of other transcripts, so it forms an inherent grouping. We provide two extensions to the generative model of the STM and one for exploring the results of the topic model. We showed with simulated data that our changes to the generative model could obtain closer estimates of topic proportions to the truth than the STM, particularly if there were many external quantities whose relationship with topics varied between groups. We then applied our methods to focus group transcripts from a pan-Canadian study on homelessness, and showed that our results align with previous analyses of the data and revealed some additional word associations. Thus, these methods have the same advantages as the original STM, with potential benefits related to the incorporation of the grouping structure into estimation and the ability to interpret the output in light of the groups. These extensions to the STM are applicable to many grouped-document settings in that they may better represent the beliefs that analysts have about how topics are distributed across documents.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Included in

Biostatistics Commons

Share

COinS