Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article

Degree

Master of Science

Program

Computer Science

Collaborative Specialization

Artificial Intelligence

Abstract

Topic modeling with the latent semantic analysis (LSA), the latent Dirichlet allocation (LDA) and the biterm topic model (BTM) has been successfully implemented and used in many areas, including movie reviews, recommender systems, and text summarization, etc. However, these models may become computationally intensive if tested on a humongous corpus. Considering the wide acceptance of machine learning based on deep neural networks, this research proposes two deep neural network (NN) variants, 2-layer NN and 3-layer NN of the LDA modeling techniques. The primary goal is to deal with problems with a large corpus using manageable computational resources.

This thesis analyze two datasets related to COVID-19 to explore the underlying structures. The first dataset includes over 7,000 CBC COVID-19 related news articles for the period of January 9, 2020 to May 3,2020. The second dataset, called CORD-19, includes over 100,000 research manuscripts related to COVID-19 for the period of January 2, 2020 to August 1, 2020. We discovered that in the first dataset 14 topics were including “traveling”, “lockdown”, “masks”, the focus of social media attention during the period of January to May of 2020. For the second dataset, 17 topics, including "vaccine", “treatment” and "social distancing", were identified to be the focus of research articles for the period of January to August of 2020. Compared to the traditional LDA, our proposed model requires less computation time and shows better performance.

Summary for Lay Audience

Topic modeling is an unsupervised machine learning technology that detects the structures of words and phrases in documents. It is one of the most powerful techniques for text mining. Topic modeling provides us with methods to organize, understand and summarize large amounts of textual information. It helps us to discover hidden theme patterns that exist in the collection and annotate documents. Topic modeling has been widely used in applications, and a large number of articles have been published in various fields such as software engineering, political science, medical science, and linguistics. For example, topic modeling has been applied to analyze information collected by social media websites such as Twitter and Facebook.

In this thesis, we analyze two datasets related to COVID-19. We use the Natural Language Processing (NLP) methods to identify topics and keywords related to COVID-19 from the news and research manuscripts. We discovered that in the first dataset 14 topics were including “traveling”, “lockdown”, “masks”, the focus of social media attention during the period of January to May of 2020. For the second dataset, 17 topics, including "vaccine", “treatment” and "social distancing", were identified to be the focus of research articles for the period of January to August of 2020. Compared to the traditional LDA, our proposed model requires less computation time and shows better performance

Share

COinS