Master of Science
Dr Robert Mercer
During the last decade, the amount of research published in biomedical journals has grown significantly and at an accelerating rate. To fully explore all of this literature, new tools and techniques are needed for both information retrieval and processing. One such tool is the identification and extraction of key claims. In an e ort to work toward claim-extraction, we aim to identify the key areas in the body of the article referred to by text in the abstract. In this project, our work is preliminary to that goal in that we attempt to match specific clauses in the abstract with the section of the article body to which they refer. For our data, we use journal articles from PubMed with structured abstracts. Our technique is based on the cosine-measure of feature vectors using a bag-of-words approach. We refine our technique through the application of five di erent experimental variables: feature-weighting, word and bi-gram based feature-sets, text pre-processing, fixedexpression filtering, and di erent classifier heuristics. We found that the choice of classifier dominates all other considerations, and while their performance with feature-weighting is synergistic, other variables were found to have little or no e ffect.
Bugorski, Arthur T., "Cosine Similarity for Article Section Classification: Using Structured Abstracts as a Proxy for an Annotated Corpus" (2014). Electronic Thesis and Dissertation Repository. 2154.