Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Science

Program

Computer Science

Supervisor

Mercer, Robert E.

Abstract

The exponential increase of available documents online makes document classification an important application in natural language processing. The goal of text classification is to automatically assign categories to documents. Traditional text classifiers depend on features, such as, vocabulary and user-specified information which mainly relies on prior knowledge. In contrast, deep learning automatically learns effective features from data instead of adopting human-designed features. In this thesis, we specifically focus on biomedical document classification. Beyond text information from abstract and title, we also consider image and table captions, as well as paragraphs associated with images and tables, which we demonstrate to be an important feature source to our method.

Summary for Lay Audience

Text classification, especially document classification, is a process of assigning categorical labels to each document. In recent years, the number of digital documents, for instance scientific publications, continues to increase exponentially. The huge amount of open data on the internet is critical for research, and it is important to index them in a proper way. Manual indexing is inefficient and costs a lot of time and money, so automatic document indexing is a burgeoning field of research.

Deep learning has become an essential part of artificial intelligence due to the improvements in parallel computing and supporting hardware. Deep learning has performed exceptionally in many domains, such as computer vision and natural language processing. Compared to traditional machine learning algorithms, deep learning is more flexible and requires less domain knowledge when dealing with the tasks. It can automatically learn effective features from data instead of adopting human-designed features.

In this thesis, we deal with large scale document classification in an automatic way using deep learning approaches, and we specifically focus on biomedical document classification. Beyond text information from the abstract and title, we also consider image and table captions, as well as paragraphs associated with images and tables, which we demonstrate to be an important feature source for our method.

Share

COinS