
Analyzing sgRNA Cleavage Activities for SaCas9 and SpCas9 in Bacteria
Abstract
CRISPR systems are used for strain-specific bacterial elimination and enhance bacterial recombineering outcomes. Their effectiveness depends on reliably generating targeted DNA breaks at intended sites using Cas9 directed by a sgRNA. However, not all Cas9/sgRNA combinations lead to the same degree of cleavage. Many groups have collected datasets to analyze Cas9/sgRNA cleavage activity in eukaryotic organisms and cleavage datasets for bacteria are limited and largely only test a single Cas9 orthologue. Moreover, prediction models trained on these data do not generalize to activities measured in other assays, or to bacteria other than where the data was collected. To overcome these problems, I generate a number of high-quality cleavage datasets for pools of sgRNAs using enrichment and depletion experimental setups to identify the sgRNA cleavage landscape for (Tev)SpCas9 and (Tev)SaCas9 in bacteria. Activities measured using enrichment experiments were extensively validated by assaying sgRNAs individually. Cleavage activities for identical sgRNAs measured by enrichment and depletion setups are highly correlated suggesting a congruence between different measurement modalities. I also identify toxic sgRNA phenotypes that were related to the number and position of mismatches to chromosomal DNA. I tested sgRNA pools containing mismatches relative to targets identifying off-target cleavage as one potential mechanism of sgRNA induced toxicity while simultaneously providing position-dependent cleavage information for model training. Machine learning models crisprHAL and crisprHAL2.0 trained on TevSpCas9 and TevSaCas9 datasets produce accurate predictions that generalized to relevant organisms such as S. enterica and C. rodentium. I also identify the importance of nucleotides downstream of the PAM sequence for cleavage activity and model predictions. The models produced show marked increases in predictive accuracy compared to previous models, indicating that the quality of training data is imperative for accurate and generalizable performance. The data collected in this thesis helps to further understand sgRNA requirements for reliable cleavage in bacteria by orthogonal Cas9 enzymes.