Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science


Ilie, Lucian.


Protein sequence similarity is a vital task in biological sequence analysis, as it can help detect homologous sequences and infer their function. For aligning protein sequences, different scoring schemes, such as using matrices like BLOSUM and PAM, are widely used for scoring similarities. As good as these matrices are, they lack the contextual information coming from amino acid positioning in different sequences. We introduce the E-score as a contextual scoring approach, using the cosine similarity between the embedding vectors of amino acids produced
by deep learning models. We tested our approach on many reference multiple sequence alignments and show that alignments produced using the new method are significantly better than those obtained using BLOSUM matrices. Since protein similarity identification is one of the most widely used procedures in biological sciences, the impact of the new method is expected
to be significant.

Summary for Lay Audience

This thesis introduces the E-score, a novel contextual scoring system for protein sequence alignment, utilizing cosine similarity between embedding vectors of amino acids. The E-score, derived from various protein embedding models, marks a significant improvement over traditional fixed substitution matrices like BLOSUM. Extensive testing identified ProtT5 as the most effective embedding model, surpassing BLOSUM45, the best among fixed matrices. The E-score's application in alignments yields higher accuracy, demonstrating its superiority in aligning protein sequences.

The research validates the E-score's effectiveness by comparing alignments from the Conserved Domain Database of NCBI, realigned using both the E-score and BLOSUM matrices. Alignments based on the ProtT5-score are notably closer to the NCBI's base alignments than those using BLOSUM45, underscoring the E-score's advantage over traditional methods. Beyond pairwise alignment, the E-score's potential extends to multiple sequence alignment algorithms and various fields requiring text or similarity analysis. The continual improvement of language models promises further enhancements in E-score performance, especially with fine-tuning for specific protein species or biological sequences.

Available for download on Monday, December 01, 2025