Electronic Thesis and Dissertation Repository

A Deep Topical N-gram Model and Topic Discovery on COVID-19 News and Research Manuscripts

Yuan Du, The University of Western Ontario

Abstract

Topic modeling with the latent semantic analysis (LSA), the latent Dirichlet allocation (LDA) and the biterm topic model (BTM) has been successfully implemented and used in many areas, including movie reviews, recommender systems, and text summarization, etc. However, these models may become computationally intensive if tested on a humongous corpus. Considering the wide acceptance of machine learning based on deep neural networks, this research proposes two deep neural network (NN) variants, 2-layer NN and 3-layer NN of the LDA modeling techniques. The primary goal is to deal with problems with a large corpus using manageable computational resources.

This thesis analyze two datasets related to COVID-19 to explore the underlying structures. The first dataset includes over 7,000 CBC COVID-19 related news articles for the period of January 9, 2020 to May 3,2020. The second dataset, called CORD-19, includes over 100,000 research manuscripts related to COVID-19 for the period of January 2, 2020 to August 1, 2020. We discovered that in the first dataset 14 topics were including “traveling”, “lockdown”, “masks”, the focus of social media attention during the period of January to May of 2020. For the second dataset, 17 topics, including "vaccine", “treatment” and "social distancing", were identified to be the focus of research articles for the period of January to August of 2020. Compared to the traditional LDA, our proposed model requires less computation time and shows better performance.