21 research outputs found
Using empirical evidence to predict if and how a DNA variant will disrupt RNA splicing in rare disorders
Background
The diagnostic rate in Mendelian disorders continues to hover around 50% after genomic testing, meaning that around half of families and clinicians are left with no actionable answer. Variants affecting splicing motifs are particularly challenging to interpret. To conclusively link a splicing variant to disease it’s necessary to determine the consequences of altered splicing on the final mRNA transcript and subsequent protein. Consequently, most probable splicing variants are classified as VUS and unactionable. 
A range of powerful but opaque algorithms have proliferated for predicting whether a variant alters splicing. Many are based on machine learning and deep learning, with the data and features used to make a specific prediction usually unavailable to be verified and weighted by clinicians. Without detailing the nature and source(s) of evidence used to make each prediction, these algorithms are relegated to the lowest evidence weighting according to globally-accepted, gold standard variant classification rules, established by the ACMG-AMP.
In addition, most algorithms currently make no attempt to predict mis-splicing outcomes which will occur as the result of a variant, meaning that bespoke functional testing is still required to discover the variant impact on pre-mRNA splicing and allow ACMG-AMP guided variant reclassification for a definitive molecular diagnosis.
There is an urgent need for evidence-based, clinically-validated tools for pathology interpretation of splicing variants.
Aims
To bridge the gap between data science and genetic pathology, by developing methods based on empirical evidence to predict if and how a DNA variant will disrupt RNA splicing in rare disease.
To determine empirical features that accurately inform:
1) spliceosomal selection of a cryptic-donor, in preference to the ‘authentic-donor’ (positioned at the exon-intron junction), and other nearby decoy-donors (any GT or GC) that are not used by the spliceosome, and 
2) The mis-splicing events which will occur because of a variant precluding use of the authentic-donor or authentic-acceptor.
Methods
We use empirical and clinically relevant data to define and evaluate measurable features enriched in (1) cryptic-donors selected by the spliceosome vs decoy-donors (any GT/GC motif) which were not selected by the spliceosome and (2) mis-splicing events (exon skipping or cryptic activation) which occurred because of a splicing variant. 
Results
For 1) we evaluated the use of current algorithms to show that while intrinsic splice-site strength and proximity to the authentic-donor strongly influence spliceosomal selection of a cryptic-donor, these factors alone are not sufficient for accurate prediction.
For 2) we find that natural, stochastic mis-splicing events seen in population-based RNA-Seq are remarkably prescient of the mis-splicing events that will occur predominantly after the inactivation of an authentic splice site.
Conclusions
We’ve created an accurate, evidence-based method to predict the nature of variant -induced mis-splicing. The ability to confidently predict the outcome of a splicing variant is a major step forward which will greatly aid in genetic diagnosis of families with Mendelian disorders
Refining clinically relevant parameters for mis-splicing risk in shortened introns with donor-to-branchpoint space constraint
Intronic deletions that critically shorten donor-to-branchpoint (D-BP) distance of a precursor mRNA impose biophysical space constraint on assembly of the U1/U2 spliceosomal complex, leading to canonical splicing failure. Here we use a series of β-globin (HBB) gene constructs with intron 1 deletions to define D-BP lengths that present low/no risk of mis-splicing and lengths which are critically short and likely elicit clinically relevant mis-splicing. We extend our previous observation in EMD intron 5 of 46 nt as the minimal productive D-BP length, demonstrating spliceosome assembly constraint persists at D-BP lengths of 47-56 nt. We exploit the common HBB exon 1 β-thalassemia variant that strengthens a cryptic donor (NM_000518.5(HBB):c.79G > A) to provide a simple barometer for the earliest signs of space constraint, via cryptic donor activation. For clinical evaluation of intronic deletions, we assert D-BP lengths > 60 nt present low mis-splicing risk while space constraint increases exponentially with D-BP lengths < 55 nt, with critical risk and profound splicing abnormalities with D-BP lengths < 50 nt
De novo variants in the non-coding spliceosomal snRNA gene RNU4-2 are a frequent cause of syndromic neurodevelopmental disorders
De novo variants in the RNU4-2 snRNA cause a frequent neurodevelopmental syndrome
Around 60% of individuals with neurodevelopmental disorders (NDD) remain undiagnosed after comprehensive genetic testing, primarily of protein-coding genes1. Large genome-sequenced cohorts are improving our ability to discover new diagnoses in the non-coding genome. Here we identify the non-coding RNA RNU4-2 as a syndromic NDD gene. RNU4-2 encodes the U4 small nuclear RNA (snRNA), which is a critical component of the U4/U6.U5 tri-snRNP complex of the major spliceosome2. We identify an 18 base pair region of RNU4-2 mapping to two structural elements in the U4/U6 snRNA duplex (the T-loop and stem III) that is severely depleted of variation in the general population, but in which we identify heterozygous variants in 115 individuals with NDD. Most individuals (77.4%) have the same highly recurrent single base insertion (n.64_65insT). In 54 individuals in whom it could be determined, the de novo variants were all on the maternal allele. We demonstrate that RNU4-2 is highly expressed in the developing human brain, in contrast to RNU4-1 and other U4 homologues. Using RNA sequencing, we show how 5′ splice-site use is systematically disrupted in individuals with RNU4-2 variants, consistent with the known role of this region during spliceosome activation. Finally, we estimate that variants in this 18 base pair region explain 0.4% of individuals with NDD. This work underscores the importance of non-coding genes in rare disorders and will provide a diagnosis to thousands of individuals with NDD worldwide
Recommended from our members
De novo variants in the RNU4-2 snRNA cause a frequent neurodevelopmental syndrome.
Around 60% of individuals with neurodevelopmental disorders (NDD) remain undiagnosed after comprehensive genetic testing, primarily of protein-coding genes1. Large genome-sequenced cohorts are improving our ability to discover new diagnoses in the non-coding genome. Here we identify the non-coding RNA RNU4-2 as a syndromic NDD gene. RNU4-2 encodes the U4 small nuclear RNA (snRNA), which is a critical component of the U4/U6.U5 tri-snRNP complex of the major spliceosome2. We identify an 18 base pair region of RNU4-2 mapping to two structural elements in the U4/U6 snRNA duplex (the T-loop and stem III) that is severely depleted of variation in the general population, but in which we identify heterozygous variants in 115 individuals with NDD. Most individuals (77.4%) have the same highly recurrent single base insertion (n.64_65insT). In 54 individuals in whom it could be determined, the de novo variants were all on the maternal allele. We demonstrate that RNU4-2 is highly expressed in the developing human brain, in contrast to RNU4-1 and other U4 homologues. Using RNA sequencing, we show how 5 splice-site use is systematically disrupted in individuals with RNU4-2 variants, consistent with the known role of this region during spliceosome activation. Finally, we estimate that variants in this 18 base pair region explain 0.4% of individuals with NDD. This work underscores the importance of non-coding genes in rare disorders and will provide a diagnosis to thousands of individuals with NDD worldwide
Empirical prediction of variant-associated cryptic-donors with 87% sensitivity and 95% specificity
Abstract
        Predicting which cryptic-donors may be activated by a genetic variant is notoriously difficult. Through analysis of 5,145 cryptic-donors activated by 4,811 variants (versus 86,963 decoy-donors not used; any GT or GC), we define an empirical method predicting cryptic-donor activation with 87% sensitivity and 95% specificity.  Strength (according to four algorithms) and proximity to the authentic-donor appear important determinants of cryptic-donor activation.  However, other factors such as auxiliary splicing elements, which are difficult to identify, play an important role and are likely responsible for current prediction inaccuracies. We find that the most frequent mis-splicing events at each exon-intron junction, mined from 40,233 RNA-sequencing samples, predict with remarkable accuracy which cryptic-donor will be activated in rare disease. Aggregate RNA-Sequencing splice-junction data provides an accurate, evidence-based method to predict variant-activated cryptic-donors in genetic disorders, assisting pathology consideration of possible consequences of a variant for the encoded protein and RNA diagnostic testing strategies.</jats:p
Gene discovery informatics toolkit defines candidate genes for unexplained infertility and prenatal or infantile mortality
AbstractDespite a recent surge in novel gene discovery, genetic causes of prenatal-lethal phenotypes remain poorly defined. To advance gene discovery in prenatal-lethal disorders, we created an easy-to-mine database integrating known human phenotypes with inheritance pattern, scores of genetic constraint, and murine and cellular knockout phenotypes—then critically assessed defining features of known prenatal-lethal genes, among 3187 OMIM genes, and relative to 16,009 non-disease genes. While around one-third (39%) of protein-coding genes are essential for murine development, we curate only 3% (624) of human protein-coding genes linked currently to prenatal/infantile lethal disorders. 75% prenatal-lethal genes are linked to developmental lethality in knockout mice, compared to 54% for all OMIM genes and 34% among non-disease genes. Genetic constraint correlates with inheritance pattern (autosomal recessive <<autosomal dominant <X-linked), and is greatest among prenatal-lethal genes. Importantly, >90% of recessive genes show neither missense nor loss-of-function constraint, even for prenatal-lethal genes. Detailed ontology mapping for 624 prenatal-lethal genes shows marked enrichment among dominant genes for nuclear proteins with roles in RNA/DNA biology, with recessive genes enriched in cytoplasmic (mitochondrial) metabolic proteins. We conclude that genes without genetic constraint should not be excluded as potential novel disease genes, and especially for recessive conditions (<10% constrained). Prenatal lethal genes are 5.9-fold more likely to be associated with a lethal murine phenotype than non-disease genes. Cell essential genes are largely a subset of mouse-lethal genes, notably under-represented among known OMIM genes, and strong candidates for gamete/embryo non-viability. We therefore curate 3435 ‘candidate developmental lethal’ human genes: essential for murine development or cellular viability, not yet linked to human disorders, presenting strong candidates for unexplained infertility and prenatal/infantile mortality.</jats:p
Empirical prediction of variant-activated cryptic splice donors using population-based RNA-Seq data
AbstractPredicting which cryptic-donors may be activated by a splicing variant in patient DNA is notoriously difficult. Through analysis of 5145 cryptic-donors (versus 86,963 decoy-donors not used; any GT or GC), we define an empirical method predicting cryptic-donor activation with 87% sensitivity and 95% specificity. Strength (according to four algorithms) and proximity to the annotated-donor appear important determinants of cryptic-donor activation. However, other factors such as splicing regulatory elements, which are difficult to identify, play an important role and are likely responsible for current prediction inaccuracies. We find that the most frequently recurring natural mis-splicing events at each exon-intron junction, summarised over 40,233 RNA-sequencing samples (40K-RNA), predict with accuracy which cryptic-donor will be activated in rare disease. 40K-RNA provides an accurate, evidence-based method to predict variant-activated cryptic-donors in genetic disorders, assisting pathology consideration of possible consequences of a variant for the encoded protein and RNA diagnostic testing strategies.</jats:p
Empirical prediction of variant-associated cryptic-donors with 87% sensitivity and 95% specificity
AbstractPredicting which cryptic-donors may be activated by a genetic variant is notoriously difficult. Through analysis of 5,145 cryptic-donors activated by 4,811 variants (versus 86,963 decoy-donors not used; any GT or GC), we define an empirical method predicting cryptic-donor activation with 87% sensitivity and 95% specificity. Strength (according to four algorithms) and proximity to the authentic-donor appear important determinants of cryptic-donor activation. However, other factors such as auxiliary splicing elements, which are difficult to identify, play an important role and are likely responsible for current prediction inaccuracies. We find that the most frequent mis-splicing events at each exon-intron junction, mined from 40,233 RNA-sequencing samples, predict with remarkable accuracy which cryptic-donor will be activated in rare disease. Aggregate RNA-Sequencing splice-junction data provides an accurate, evidence-based method to predict variant-activated cryptic-donors in genetic disorders, assisting pathology consideration of possible consequences of a variant for the encoded protein and RNA diagnostic testing strategies.</jats:p
