Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science

Collaborative Specialization



Kaizhong Zhang


In recent years, the technology of glycopeptide sequencing through MS/MS mass spectrometry data has achieved remarkable progress. Various software tools have been developed and widely used for protein identification. Estimation of false discovery rate (FDR) has become an essential method for evaluating the performance of glycopeptide scoring algorithms. The target-decoy strategy, which involves constructing decoy databases, is currently the most popular utilized method for FDR calculation. In this study, we applied various decoy construction algorithms to generate decoy glycan databases and proposed a novel approach to calculate the FDR by using the EM algorithm and mixture model.

Summary for Lay Audience

In recent years, an increasing number of glycopeptide identification software has been developed, capable of scoring glycopeptides and identifying tandem mass spectrometry data. However, due to the potential mistakes in the results, false discovery rate (FDR) estimation plays a key role in evaluating the confidence of correctness. Applying the decoy-target approach is one of the most effective methods for calculating FDR, which requires building a decoy database. In this study, we explored a novel method for generating decoy databases based on the probability of glycan composition in the target database, and then compared it with other decoy construction methods. Meanwhile, since the distribution of target matches could be a mixture of the correct matches and incorrect matches, we created a new FDR estimation approach by using the EM algorithm with a mixture model.