Electronic Thesis and Dissertation Repository

Thesis Format



Doctor of Philosophy


Computer Science


Zhang, Kaizhong


The systematic studies of proteins has gradually become fundamental in the research related to molecular biology. Shotgun proteomics use bottom-up proteomics techniques in identifying proteins contained in complex mixtures using a combination of high performance liquid chromatography coupled with mass spectrometry technology. Current mass spectrometers equipped with high sensitivity and accuracy can produce thousands of tandem mass spectrometry (MS/MS) spectra in a single run. The large amount of data collected in a single LC-MS/MS run requires effective computational approaches to automate the process of spectra interpretation. De novo peptide sequencing from tandem mass spectrometry (MS/MS) has emerged as an important technology for peptide sequencing in proteomics. However, the low identification rate of the acquired mass spectral limits the efficiency of computational approaches. To increase the accuracy and practicality of de novo sequencing, some previous algorithms used multiple spectra to identify the peptide sequence.

In this thesis, we focus on de novo sequencing of multiple SILAC labeled tandem mass spectra. Compared with previous approach, our research develop de novo sequencing algorithms based on different idea of how to use multiple spectra. SILAC technology uses medium containing different kinds of isotope-labeled essential amino acids, usually Arginine(R) and Lysine(K), to label newly synthesized proteins with stable isotopes during cell growth. Multiple MS/MS spectra for the same peptide sequence are produced by spectrometer after the SILAC samples are processed by LC-MS/MS shotgun proteomics. Based on the factors such as the type of isotope labeling, retention time, precursor ion mass, multiple spectra with different type of SILAC modifications for the same peptide in the sample can be used to identify the peptide sequence. In this study, not only are we aiming to identify the peptide sequence with specific SILAC modifications, but we are also pinpointing locations of SILAC modifications from multiple SILAC labeled MS/MS spectra. We propose two de novo sequencing algorithms to compute the peptide sequence which are based on total number of SILAC modifications and based on the combinations of SILAC modifications of Arginine(R) and Lysine(K). With two dynamic programming algorithms to identify peptide sequence and locating its SILAC modifications, the potential candidates are computed with similarity scores and then refinement algorithms are applied. Finally, a confident score is designed to measure all of the candidate sequence.

To verify the performance of our algorithm, we compare the experimental results. We also compare the output candidates between our approach and PEAKS de novo.

Summary for Lay Audience

The systematic studies of proteins have exponentially increased in importance in the field of molecular biology. Understanding the sequence of amino acids of each protein can help us infer on its structure and therefore its role in normal cell development and in diseased tissues. Currently, the identification of proteins within complex mixtures can be done using a combination of techniques, including chromatography and mass spectrometry. The ladder can break down and label numerous short amino acid sequences that require the appropriate computational approach to interpret such large body of data, which is also the rate limiting step in improving the efficiency of this method. Here, we developed two de novo sequencing algorithms of multiple Stable isotope labeling by amino acids in cell culture (SILAC) labeled tandem mass spectra by incorporating isotope-labeled amino acids into newly synthesized proteins. At once, we can identify the protein sequence and locate SILAC modifications, both of which are validated using currently available algorithms.