Expression changes confirm predicted single nucleotide variants affecting mRNA splicing

Mutations that cause genetic diseases can be difficult to identify if the mutation does not affect the sequence of the protein, but the splice form of the transcript. However, the prediction of deleterious changes caused by genomic variants that affect splicing has been shown to be accurate using information theory-based methods. We made several such predictions of potential splicing changes that could be caused by SNPs which were found to cause natural and/or cryptic splice site strength changes. We evaluated a selected set of 22 SNPs that we predicted by information analysis to affect splicing, validated these with targeted expression analysis, and compared the results with genome-scale interpretation of RNAseq data from tumors. Abundance of natural and predicted splice isoforms were quantified by q-RT-PCR and with probeset intensities from exon microarrays using RNA isolated from HapMap lymphoblastoid cell lines containing the predicted deleterious variants. These SNPs reside within the following genes: TMPRSS3 and DERL3 . 15 of these SNPs showed a significant change in the use of the affected splice site. Individuals homozygous for the stronger allele had higher transcription of the associated gene than individuals with the weaker allele in 3 of these SNPs. 13 SNPs had a direct effect on exon inclusion, while 10 altered cryptic site use. In 4 genes, individuals of the same genotype had high expression variability caused by alternate factors which masked effects of the SNP. Targeted expression analyses for 8 SNPs in this study were confirmed by results of genome-wide information theory and expression analyses.

Information theory-based (IT-based) models of donor and acceptor mRNA splice sites reveal the effects of changes in strengths of individual sites (Rogan et al. 1998;Rogan et al. 2003). This facilitates prediction of phenotypic severity (Rogan and Schneider 1995;von Kodolitsch et al. 1999;von Kodolitsch et al. 2006). The effects of splicing mutations can be predicted in silico by information theory (Rogan and Schneider 1995;Rogan et al. 1998;O'Neill et al. 1998;Allikmets et al. 1998;Kannabiran et al. 1998;Khan et al. 1998;von Kodolitsch 1999;Vockley et al. 2000;Svojanovsky et al. 2000;Khan et al. 2002;Rogan et al. 2003;Lamba et al. 2003;Khan et al. 2004;von Kodolitsch et al. 2006, Viner et al. 2014Dorman et al. 2014;Shirley et al. 2018) and predictions confirmed in vitro by experimental studies (Vockley et al. 2000;Rogan et al. 2003;Lamba et al. 2003;Susani et al. 2004;Hobson et al. 2006). Strengths of one or more splice sites may be altered and, in some instances, concomitant with amino acid changes in coding sequences (Rogan et al. 1998). Information analysis has been a successful approach for recognizing non-deleterious variants (Rogan and Schneider 1995), and for distinguishing of milder from severe mutations (Rogan et al. 1998;von Kodolitsch et al. 1999).
present study, we explicitly predict and validate SNPs that influence mRNA structure and levels of expression of the genes containing them.
The robustness of information analysis in predicting splicing mutations for Mendelian disorders justifies the use of this approach to identify SNPs that are likely to have a measurable impact on mRNA splicing. Others have used exon microarrays to compare different cellular states and then confirm suggested abnormalities from the expression data using q-RT-PCR (Thorsen et al. 2008). We hypothesized that the predicted effect of SNPs on expression of the proximate exon would correspond to the expression of exon microarray probes of genotyped individuals in the HapMap cohort. We used the dose-dependent expression of the minor allele to qualify SNPs for subsequent information analysis consistent with alterations of mRNA splicing.
These predicted mutations were then analyzed by q-RT-PCR to validate the accuracy of the bioinformatic predictions.
We recently described several deleterious single nucleotide polymorphisms in dbSNP that affect splicing and at least one of these is common (Nalla and Rogan 2005). This analysis used the NCBI Entrez query engine which conservatively defines splicing-related SNPs as only those variants involving the dinucleotides immediately adjacent to exon boundaries. Given that constitutive splicing mutations can arise at other locations within pre-mRNA sequences and can involve cryptic splicing, we addressed whether other genomic variants might be a source of common mutation. To test the feasibility of this hypothesis, we used information analysis to examine the potential impact of SNPs mapped predominantly onto the genome sequences of chromosomes 21 and 22 on splicing.

Materials and Methods
exonic probeset based on the genotype of a SNP it is associated to (SNP within natural donor/acceptor region of exon). Probesets displaying a stepwise change in mean SI (where mean SI of homozygous rare is < 90% of homozygous common with a simultaneous decrease in heterozygotes) were found with another script. Splicing Index boxplots were created using R, where the x and y-axis are genotype and SI, respectively (Supplemental Figure 1). This gives a visual representation of all 176 individuals (if genotyped), and allows one to quickly analyze the effect a SNP has on a particular probeset.
SNPs of varying strength changes (≥ 0.5 bits) were chosen to be further analyzed by q-RT-PCR. ΔR i < 1 bit were included to determine if these small changes lead to detectable changes in splicing. Additional SNPs tested for splicing effects were predicted in previous publications (Nalla & Rogan, 2005).
Cell culture & RNA extraction. EBV-transformed lymphoblastoid cell lines of HapMap individuals with our SNPs of interest (homozygous common, heterozygous and homozygous rare when available) were ordered from the Coriell Cell Repositories (CEU: GM07000, GM07019,   GM07022, GM07056, GM11992, GM11994, GM11995, GM12872. YRI: GM18855, GM18858,   GM18859, GM18860, GM19092, GM19093,  TE buffer) at 37ºC for 15 minutes. The reaction was stopped with EDTA (0.05M; 2.5% v/v), and heated to 65ºC for 20 minutes, followed by ethanol precipitation (resuspended in 0.1% v/v . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
Design of real-time expression assay. Sequences were obtained from UCSC and Ensembl. DNA primers used to amplify a known splice form or one predicted by information analysis were designed using Primer Express (ABI). DNA primers were obtained from IDT (Coralville, IA, USA), and dissolved to 200 uM; sequences in Supplemental Table 1. Primers were designed to amplify the wildtype splice form, exon skipping (if a natural site is weakened), and cryptic site splice forms which were previously reported (UCSC mRNA and EST tracks) or those predicted by information analysis (where R i cryptic site ≥ R i weakened natural site).
Two types of reference amplicons were used to quantify allele specific splice forms.
These consisted of intrinsic products derived from constitutively spliced exons with the same gene and external genes with high uniformity of expression among HapMap cell lines. Reference primers internal to the genes of interest were designed 1-4 exons adjacent from the affected exon (without evidence of variation from the UCSC Genome Browser), placed upstream of the SNP of interest whenever possible. Two advantages to including an internal reference in the q-RT-PCR experiment include: potential detection of changes in total mRNA levels; and account for interindividual variation of expression.
External reference genes were chosen based on consistent PLIER intensities with low coefficients of variation in expression among all 176 HapMap individuals. The following external controls were selected: exon 39 of SI (PLIER intensity 11.4 ± 1.7), exon 9 of FRMPD1 (22 ± 2.81), exon 46 of DNAH1 (78.5 ± 9.54), exon 3 of CCDC137 (224 ± 25) and exon 25 of VPS39 (497 ± 76). The external reference chosen for an experiment was matched to the intensity of the probeset within the exon of interest. This decreased potential errors in C T values and . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; proved to be accurate and reproducible for most genes.
Primers were placed over junctions of interest (whenever possible) to amplify a single splice form. T m ranged from 58-65ºC, and amplicon lengths varied from 69-136nt. BLASTn (refseq_rna database) was used to reduce possible cross-hybridization.
Precipitated cDNA was resuspended in water at 20ng/µL of original RNA concentration.
All designed primer sets were tested with conventional PCR to ensure a single product at the expected size. PCR reactions were prepared with 1.0M Betaine (Sigma-Aldrich), and were heated to 80ºC before adding Taq Polymerase (Invitrogen). Optimal T m for each primer set was determined to obtain maximum yield.
Quantitative PCR was performed on an Eppendorf Mastercycler ep Realplex 4, a Bio-Rad CFX96, as well as a Stratagene Mx3005P. SYBR Green assays were performed using the KAPA SYBR FAST qPCR kit (Kapa Biosystems) in 10µL reactions using 200µM of each primer and 24ng total of cDNA per reaction. For some tests, SsoFast Eva Green supermix (Bio-Rad) was used with 500µM of each primer instead.
When testing the effect of a SNP, all primers designed for that SNP as well as the gene internal reference and external reference, were run simultaneously. C t values obtained from these experiments are normalized to its external reference using the Relative Expression Software Tool (REST; http://www.gene-quantification. de/rest.html;Phaffl et al. 2002).
Taqman Assay. Two dual-labeled Taqman probes were designed to detect the two splice forms of XRCC4 (6nt deletion). Probes were placed over the sequence junction of interest where . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; variation would be near the probe middle (Supplemental Figure 1). The assay was performed on an ABI StepOne Real-Time PCR system using ABI Genotyping Master Mix. Experiment was run in 25µL reactions (300nM each primer, 400nM probe [5'-FAM or TET fluorophore with a 3' Black Hole quencher; IDT], and 80ng cDNA total). Probes were tested in separate reactions.

Results
Selection of candidate SNPs affecting splicing. Publically available exon microarray data was used to find exons affected by splice site strength-altering SNPs. A change in the mean SI of individuals of differing genotypes may suggest the possibility of altered splicing. A stepwise decrease (where the mean SI for the heterozygote is in between the two homozygotes) could reflect an increase in the allelic effect. There were 9328 HapMap-annotated SNPs within donor/acceptor regions of known exons which contained at least one probeset. Of 987 SNPs that are associated to exonic probesets which differ in mean SI between the homozygous common and rare HapMap individuals, 573 caused a decrease in natural site R i . Leaky mutations (reduction in information content where final R i ≥ R i,minimum ) comprise 40-60% of the total and also exhibit reduced SI values. These results indicate that the proposed approach will detect severe, as well as moderate, splicing mutations with reduced penetrance and milder phenotypes, consistent with our previous reports (von Kodolitsch et al. 1999;von Kodolitsch et al. 2006).
These SNPs were analyzed by information theory to find those which caused a potential splice-affecting R i change. Of the 9328 HapMap SNPs within the natural splice sites of exon probeset-containing exons, 112 (1.2%) and 235 (2.5%) were found on chromosome 21 and 22, respectively. Of those, 21 SNPs on chr21 (0.23% total, 18.8% of chr21) and 34 on chr22 (0.36% . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; of total, 14.5% of chr22) associated with a stepwise decrease in probeset intensity. 7 of the 21 chr21 SNPs (33.3%) and 9 of the 34 chr22 SNPs (26.5%) caused information changes which satisfied either of the following criteria: a natural site ΔR i ≥ 0.5 bits, or a change in strength to a potential cryptic site(s) with an R i comparable than the neighbouring natural site, or where mRNA/EST data supported cryptic site use. While a minimum ΔR i of 0.5 bits (1.4 fold) was chosen, the actual minimum change resulting in a detectable splicing effect is not known (ΔR i 's range from 0.5 to 7.8 bit). The 16 SNPs are listed in order of decreasing ΔR i : rs2075276 We report q-RT-PCR results for 13 out of the 16 SNPs (primers to test the affect of rs16994182, rs2075276 and rs3950176 were not suitable for q-RT-PCR, or gave ambiguous results), along with 8 other candidate SNPs found in previous publications (Nalla and Rogan, 2005)   . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; Limited availability of individuals with a particular genotype can be insufficient for statistical significance (rs2243187; Suppl. Figure 1.4). Although exon microarrays can be used to find potential alternate splicing and give support to our predictions, q-RT-PCR must be employed to confirm the splicing effect.
Accuracy of predictions. There were 22 total SNPs chosen for analysis by q-RT-PCR.
Primers were developed to amplify known and information-theory predicted splice forms. 15 out of 22 SNPs tested showed a measurable change in splicing consistent with information-theory predictions. Of these 15 sites, 10 lead to an increase in alternate splice site use (2 of which increased strength of cryptic site, 8 increased use of unaffected pre-existing site), 6 lead to a change in exon retention (5 increased exon skipping), 3 which increased the use of an alternative exon, and 4 which appears to decrease total mRNA levels of that gene. We did not detect altered splicing in 6 SNPs, 2 of which caused information changes > 1 bit. Three of the four SNPs where ΔR i < 1 were hampered by high variability of gene expression between individuals.
Change in the information content of a splice site (a measure of binding affinity) was used to predict experimentally-derived change in splice isoform levels. In 12 out of the 15 SNPs which caused measurable effects, the change in splice site strength predicted by information theory were consistent with the changes measured by q-RT-PCR. The 3 exceptions are rs2070573 (C21orf2), rs17002806 (WBP2NL), and rs2835585 (TTC3). Fold changes predicted to reduce strength > 100 fold were experimentally found to reduce expression by 38 to 58 fold. Predicted changes in strength below 8 fold were not consistently detectable on wildtype splicing, though changes in less abundant splice forms were regularly observed (i.e. rs2835585 altered exon skipping levels by ~3-9 fold, but the wildtype splice form predominated).
. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; Experimental results are shown in Table 1 SNPs affecting use of cryptic sites by weakening natural splice sites. When natural site strength is changed, the resulting mRNA splicing change depends on the strength and location of cryptic splice sites. We detected an increase in cryptic site use coinciding with a decrease in natural site strength caused by the following SNPs: rs1805377 (XRCC4 exon 8 acceptor; 11.5 bits to 3.9 bits; 221-473 fold increase of 6nt downstream site detected by q-RT-PCR, complete discrimination in dual-labelled probe experiment); rs2243187 (IL19 exon 5 acceptor; 7.3 to -0.3 bits; 1.8 fold increase of 3nt downstream site in heterozygote); rs3747107 (GUSBP11 acceptor of 3' terminal exon of mRNA splice form BX538181; 8.9 to 1.4 bits; 31 and 42.8 fold increase of 114 and 118 nt upstream cryptic sites, respectively); rs17002806 (WBP2NL exon 6 donor; 10 to 6.5 bits; 34 fold increase of 25nt downstream site use); rs6003906 (DERL3 exon 5 acceptor; 2.2 to 0.3 bits; double appearance of AK125830 mRNA splice form using a 123nt downstream acceptor); and rs13076750 (LPP, acceptor of rare exon within intron 1; 9.3 to -1.6 bits; 16 fold increase of 7nt downstream cryptic acceptor).
There are pre-existing cryptic sites near and with greater predicted information content of weakened natural splice sites which were not recognized or was not significantly altered in use; rs1893592 (UBASH3A exon 10 donor; 9.1 to 4.3 bits) did not increase the use of 7.0 and 6.1 bit sites 29 and 555nt downstream of affected donor; rs17002806 (WBP2NL, described earlier) did . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; not activate a 5.7 bit site 67nt downstream; rs3747107 (GUSBP11) strengthened a cryptic site 2nt downstream (1.6 to 7.5 bits) but no product was detected; rs2835585 (TTC3 exon 3 acceptor; 6.4 to 4.4 bits) did not activate two stronger cryptic sites (6.9 and 7.2 bits, 60 and 87nt upstream respectively); It is clear that cryptic splice sites proximate to weakened natural splice site are not guaranteed to be activated and thus emphasizes the need for wet-lab experiments to confirm these bioinformatic predictions. and XRCC4 regions tested showed preference to the upstream acceptor as well, which is congruent with the processive mechanism of detecting acceptor splice sites (Robberson et al.

1990).
SNPs affecting exon retention. SNPs-directed increases in exon skipping were found to reduce natural site strength from 1.6 to 10.9 bits, and lead to increases in exon skipping ranging from 3 to 1911 fold between homozygotes of opposing genotypes. These SNPs include: rs2835585 (TTC3 exon 3 acceptor; 6.4 to 4.4 bits; 2-9 fold); rs1018448 (ARFGAP3 exon 12 acceptor; 10.6 to 12.8 bits; 1.5-2.6 fold); rs1333973 (IFI44L exon 2 donor; 9.5 to 5.0 bits; ~15 . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
SNPs which changed natural site strength but were not found to affect exon retention changed information content ranging from 3 to 4.8 bits. Skipping was not detected in weak homozygotes for rs1893592 (UBASH3A) and rs17002806 (WBP2NL), and was detected not altered by rs2835655 (TTC3 exon 39 donor; 12 to 9 bits). The SNP rs2243187 (IL19, described earlier) was found to decrease exon skipping (halved in heterozygote) while increasing the use of an alternate 3nt downstream acceptor. This is consistent with Rogan et al. (2003) where the creation of a strong splice site closely situated to a natural site was shown to facilitate an increase in exon skipping, an effect that the A-allele eliminates. acceptor; 10.7 to 9.4 bits; 32-57% total mRNA compared to homozygous wild type genotype); rs1018448 (ARFGAP3; 12.8 to 10.6 bits; 31.9-68.5% compared to homozygous wild type genotype); In each case, genotypic differences in the exon microarray data follow the expected trend but are not large enough to be statistically significant. Due to the modest differences in the array data, additional individuals must be tested to confirm these effects.
Predicted deleterious SNP without detectable evidence of alternate splicing. There were 6 SNPs in this study which were predicted to disrupt natural splice sites, but where there was no detectable effect on splicing. Potential causes include inter-individual expression variability, small (< 1 bit) strength changes, limitations to RT-PCR primer design, and the failure to correctly predict the SNP's splicing effect due to limitations of the splicing models (for example, compensatory splicing regulatory enhancers). Splicing effects were not identified for 3 SNPs where the information change was <1 bit (2-fold). Genetic variability masked potential splicing effects of these SNPs: rs16802 (BCR exon 14 acceptor; 8.8 to 9.4 bits; individuals of like genotype varied by 8.3 fold), rs2252576 (of BACE2 exon 5 acceptor; 9.0 to 9.6 bits; 12 fold) and . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; rs8130564 (TMPRSS3 exon 5 acceptor; 6.3 to 6.8 bits; 112 fold). Various primer sets designed to detect an exon affected by the SNP rs17357592 where ΔR i ≤ 1 bit (COL6A2 exon 21 acceptor; 8.4 to 7.8 bits) failed to give a single product. Interpreting the results from the SNP rs16994182 (CLDN14 exon 2 donor; 8.6 to 8.1 bits) was complicated by the lack of an adequate internal reference primer set. As this gene consists of only 3 exons, any internal reference must cover the potentially affected second exon. Any difference detected by these primers could be caused by altered splicing, and therefore cannot account for any variation in expression between individuals. Therefore, the splicing differences detected by q-RT-PCR (Table 1)

Discussion
Predicted deleterious SNP alleles that alter constitutive mRNA splicing are confirmed by expression and spliced EST data, and may be common in populations. The preponderance of leaky splicing mutations and cryptic splice sites, which often produce both normal and mutant transcripts, is consistent with balancing selection (Nuzhdin et al. 2004) or possibly with mutant loci that contribute to multifactorial disease. Minor SNP alleles are often present in > 1% of populations (Janosíková et al. 2005). This would be consistent with a bias against finding . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; mutations that abolish splice site recognition in dbSNP. Such mutations are more typical in rare Mendelian disorders (Rogan et al. 1998).
We note that work described in this manuscript was performed several years ago (ca. The ValidSpliceMut web-beacon (http://validsplicemut.cytognomix.com) is a splicing mutation variant database containing predicted and confirmed splice variants from the Cancer Genome Atlas (TCGA). These have been identified by the Shannon Pipeline (a high-throughput IT-based prediction tool based on ASSEDA), and validated by RNA-Seq data from matched tissues and tumors lacking these mutations (Shirley et al. 2018 rs2070573, rs13076750 and rs10190751 were flagged due to intron inclusion, which is supported by q-RT-PCR results. An independent study focused on cryptic site activating mutations with the . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; same TCGA data (Jayasinghe et al. 2018) failed to identify rs2070573 as a splicing mutation (which strengthens a cryptic splice site 360nt downstream of exon 6 in C21orf2). Furthermore, neither study flagged two other cryptic site-strengthening SNPs described in this manuscript (rs743920 [EMID1] and rs2838010 [FAM3B]). Interestingly, rs2835585 and rs1805377 were flagged due to intron inclusion, however q-RT-PCR experiments instead showed increased exon skipping and cryptic site use, respectively. Aberrant splicing was not was detected experimentally for rs2835655 or rs2072049, however these SNPs were flagged due to increased intron inclusion. The design of the q-RT-PCR experiments associated with these SNPs were not optimized to detect this form of abnormal splicing.
The splicing impact (and when known, the disease-association) of many of the discussed SNPs have been implicated subsequent to the development of this study. As previously mentioned, the CFLAR SNP rs10190751 is known to modulate the FLICE-inhibitory protein (c-FLIP) from its S-form to its R-form, and the latter form has been linked to increased lymphoma risk (Ueffing et al. 2009). Furthermore, increased exon skipping due to the IFI44L SNP rs1333973 has been reported in RNAseq experiments (Zhao et al. 2013a) and this alternate splice form has been implicated in a reduction in antibody response to the measles vaccine (Haralambieva et al. 2017). The splicing impact of XRCC4 rs1805377 has been noted previously (Nalla and Rogan 2005), but to our knowledge has not previously been experimentally confirmed. This SNP has been implicated with an increased risk of gastric cancer (Chiu et al.  CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; (Ge and Concannon, 2018). Hiller et al. (2006) described the 3nt deletion caused by IL19 rs2243187, but did not describe the increase in exon skipping seen in this study. The SNP rs743920 (EMID1) was associated with change in allelic expression using EST data (Ge et al. 2005; impact on splicing not described). Conversely, studies which linked TMPRSS3 variants to hearing loss did not find the SNP rs8130564 to be significant (Lee et al. 2013;Chung et al. 2014). Interestingly, the BACE2 SNP rs2252576 (which was not found to alter splicing in this study) has been associated to Alzheimer's dementia in Down syndrome (Mok et al. 2014).
The compound effects of splicing mutations have been previously described. Krawczak et al. selected 38 genes known to have single-nucleotide mutations within donor and acceptor sites (from HGMD, as of January 2006) which have been associated with various diseases (Krawczak et al. 2007). Using neural networks, 87.4% of the mutations found in splice sites (n=430) were reported to cause exon skipping or cryptic site use. Of these splice-altering mutations (n=376), 56.9% were mutations in donor sites causing exon skipping, 13.6% resulted in the use of a cryptic donor site, 22.3% were acceptor site mutations leading exon skipping and 7.2% lead to cryptic acceptor use. Their data also suggested the possibility that exon skipping is less likely in the presence of nearby cryptic sites when a donor is weakened, but not acceptors. A region of only 50 bp surrounding the affected splice-sites was used to search for cryptic sites, and therefore there is a strong possibility that sites outside of this range may have been missed.
Why are so few natural splice sites strengthened by SNP-induced information changes?
Most such changes would be thought to be neutral mutations, which are ultimately lost by chance (Fisher 1930). Those variants which are retained are more likely to confer a selective advantage (Li 1967). Indeed, the minor allele in rs2266988, which strengthens a donor splice site by 2.3 bits . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; (29.9 fold) at the 5' end of the open reading frame in PRAME, occurs in 25% of the population (~50% in Europeans). We have shown a number of instances where apparently simple changes in strength of splice sites that would be expected to have little or no impact on splicing of the associated exon in fact alters the degree of exon skipping of that exon.
SNPs producing significant changes in information at functional binding sites may be useful for selecting tag SNPs in disease association studies. Such an approach would be independent of measures of haplotype block (Zhang et al. 2002;Carlson et al. 2004;Pe'er and Beckmann 2004)  Considering the number of constitutive splicing mutations found, it is unlikely that sequence variation alone can account for the extensive heterogeneity in mRNA transcript structures, given the relatively high proportion of genes known to exhibit tissue-specific alternative splicing (Modreck and Lee 2002). Nevertheless, this study raises questions regarding the degree to which alternative splicing is the result of inter-individual genomic sequence differences rather than purely regulatory mechanisms. Because much of the information required for splice site recognition resides within neighboring introns, it would be prudent to consider contributions from intronic and exonic polymorphism that produce structural exon variation.
While exon microarrays can be used to show alternate splicing differences based on . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019; genotype, there are obvious limitations with this technology. Probesets were placed to detect wildtype splice forms with mRNA and EST support, but may not be adequately placed to detect rare splicing events that have not yet been reported or which have little evidence. Smaller nucleotide changes due to cryptic site use (XRCC4, EMID1) seemed to be explicitly avoided in these probesets, which could not detect the splicing change. Indeed, significant effort was required to design TaqMan assays that distinguished the isoforms generated by rs1805377 (XRCC4). While some genotype-specific differences in SI were quite significant (CFLAR, IFI44L), many showed only minor changes (ARFGAP3, LPP) and most had outlier individuals of one genotype with a comparable SI to the population of the second genotype (IL19). rs2835585 increased exon skipping in TTC3 nearly 10 fold but a resulting decrease in total wildtype splicing at the affected exon junction was undetectable, most likely due to the great difference in abundance between the wildtype and skipped splice isoforms. Whether or not this small increase would cross the threshold of allele-specific exon skipping that may contribute to disease predisposition and pathogenesis is in question. In the case of the E3 ubiquitin ligase, TTC3, the third exon does not include any definitively-assigned protein domain (Tsukahara et al. 1996, Suizu et al. 2009).
This study describes the prediction of validation of natural and cryptic splice site alterations caused by common SNPs. Individual information represents a continuous phenotypic measure that is well suited to the analysis of contributions of multiple, incompletely penetrant SNPs in different genes from the same individual, as typically seen in genetically complex diseases. This technique has complemented efforts to identify disease-associated protein coding (and non-coding) mutations in a comprehensive, high-throughput variant interpretation study of . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The Veridical and Shannon pipeline software has been developed to perform and validated large-scale analysis of potential splicing variants in complete genomes using RNAseq data (Viner et al. 2014). These resources have been used to evaluate millions of variants in TCGA cancer patient genomes (Shirley et al. 2018). For the SNPs described in this paper, targeted functional splicing analyses, for the most part, reproduce the results of our multigenome-wide surveys of sequence variations affecting mRNA splicing. This concordance increases confidence that these publicly (https://ValidSpliceMut.cytognomix.com) and commercially (https://MutationForecaster.com) available resources can help to identify mutations contributing to patient clinical phenotypes. . CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint . http://dx.doi.org/10.1101/549089 doi: bioRxiv preprint first posted online Feb. 13, 2019;  Red text indicates a decrease in the abundance of a particular splice form, while green text indicates an increase in abundance. A -Acceptor Splice Site Affected; D -Donor Splice Site Affected; NC -Not detectable (abolished). a Values from comparing heterozygote with homozygote common. b No allele specific difference in expression and splicing. c Change in splicing likely related to change in RNA level. d Intron 2-3 inclusion of TTC3 amplified by PCR, but no allele specific change detected. e mRNA in-frame when alternate exon is used, and out of frame due to cryptic site use. f This splice form not at detectable levels in homozygote. g PRAME is a special case where two SNPs affect splicing of two separate exons. h High variation between individuals of the same genotype found by q-RT-PCR. i Splice form not detected by PCR. j Cryptic acceptor 114nt upstream of affected site / cryptic acceptor 118nt upstream of affected site. k Cryptic donor 555nt downstream of affected site / cryptic donor 29nt downstream of affected site.  (7):693-9.