Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Science

Program

Computer Science

Abstract

De novo peptide sequencing is an efficient approach to identifying peptides from tandem mass spectrometry (MS/MS). Compared with the database search methods, de novo peptide sequencing is particularly effective in identifying novo peptide sequences. This thesis presents a new De Novo sequencing model comprising two deep learning models and a dynamic programming algorithm, namely DpNovo. The deep learning model learns features of a spectrum and gives scores to each peak, and the dynamic programming is capable of determining the optimal amino acid sequence path with the highest accumulated score. Finally, the predicted sequence with mass values representing uncertain mass intervals will be provided. DpNovo is capable of reconstructing charge-two peaks in their charge-one positions, which significantly improves the accuracy of predicting high-charged spectra. Besides, the dynamic programming algorithm ensures that accurate predictions can be made even in cases where some signal peaks are missing. In terms of performance, DpNovo has been tested on both the NIST and ProteomeXchange databases. The deep learning model has demonstrated an excellent ability to identify both signal and noise peaks. Additionally, the accuracy of peptide sequence prediction obtained through the dynamic programming algorithm is comparable to those of other proposed de novo sequencing models.

Summary for Lay Audience

Tandem mass spectrometry (MS/MS) is an important tool for identifying peptides. The peptide identification approaches can analyze the tandem mass spectrometry of fragments to infer the original peptide sequence. There are mainly two methods for peptide identification: database search and de novo sequencing. Database search uses a reference database to match experimental data with theoretical spectra, while de novo sequencing does not rely on a database and can identify novo amino acid sequences. Recently, machine learning technology is utilized in de novo sequencing, and many machine learning-based approaches have been proposed. In this thesis, we proposed a new model called DpNovo, which combines deep learning and dynamic programming. The deep learning model assigns scores to each peak and the dynamic programming algorithm is capable of finding the optimal amino acid sequence with the highest accumulated score. DpNovo is capable of reconstructing charge-two peaks in their charge-one positions, which greatly enhances the accuracy of predicting high-charged spectra. Additionally, the dynamic programming algorithm employed by DpNovo ensures that it can make precise predictions even in situations where some signal peaks may be absent. The training dataset is acquired from the NIST database, and testing was conducted on the NIST and the ProteomeXchange database. The deep learning model has displayed proficiency in detecting both signal and noise peaks. Moreover, the precision of peptide sequence prediction is equivalent to that of other proposed de novo sequencing models.

Share

COinS