Electronic Thesis and Dissertation Repository


Doctor of Philosophy


Computer Science


Robert E. mercer


During the last decade biomedicine has developed at a tremendous pace. Every day a lot of biomedical papers are published and a large amount of new information is produced. To help enable automated and human interaction in the multitude of applications of this biomedical data, the need for Natural Language Processing systems to process the vast amount of new information is increasing. Our main purpose in this research project is to extract the relationships between genotypes and phenotypes mentioned in the biomedical publications. Such a system provides important and up-to-date data for database construction and updating, and even text summarization. To achieve this goal we had to solve three main problems: finding genotype names, finding phenotype names, and finally extracting phenotype--genotype interactions. We consider all these required modules in a comprehensive system and propose a promising solution for each of them taking into account available tools and resources.

BANNER, an open source biomedical named entity recognition system, which has achieved good results in detecting genotypes, has been used for the genotype name recognition task. We were the first group to start working on phenotype name recognition. We have developed two different systems (rule-based and machine-learning based) for extracting phenotype names from text. These systems incorporated the available knowledge from the Unified Medical Language System metathesaurus and the Human Phenotype Onotolgy (HPO). As there was no available annotated corpus for phenotype names, we created a valuable corpus with annotated phenotype names using information available in HPO and a self-training method which can be used for future research. To solve the final problem of this project i.e. , phenotype--genotype relationship extraction, a machine learning method has been proposed. As there was no corpus available for this task and it was not possible for us to annotate a sufficiently large corpus manually, a semi-automatic approach has been used to annotate a small corpus and a self-training method has been proposed to annotate more sentences and enlarge this corpus. A test set was manually annotated by an expert. In addition to having phenotype-genotype relationships annotated, the test set contains important comments about the nature of these relationships. The evaluation results related to each system demonstrate the significantly good performance of all the proposed methods.