Electronic Thesis and Dissertation Repository

Improved Protein Sequence Alignments Using Deep Learning

Seyed Sepehr Ashrafzadeh, Western University

Abstract

Protein sequence similarity is a vital task in biological sequence analysis, as it can help detect homologous sequences and infer their function. For aligning protein sequences, different scoring schemes, such as using matrices like BLOSUM and PAM, are widely used for scoring similarities. As good as these matrices are, they lack the contextual information coming from amino acid positioning in different sequences. We introduce the E-score as a contextual scoring approach, using the cosine similarity between the embedding vectors of amino acids produced
by deep learning models. We tested our approach on many reference multiple sequence alignments and show that alignments produced using the new method are significantly better than those obtained using BLOSUM matrices. Since protein similarity identification is one of the most widely used procedures in biological sciences, the impact of the new method is expected
to be significant.