182 research outputs found

    X-CAP improves pathogenicity prediction of stopgain variants

    Get PDF
    Abstract: Stopgain substitutions are the third-largest class of monogenic human disease mutations and often examined first in patient exomes. Existing computational stopgain pathogenicity predictors, however, exhibit poor performance at the high sensitivity required for clinical use. Here, we introduce a new classifier, termed X-CAP, which uses a novel training methodology and unique feature set to improve the AUROC by 18% and decrease the false-positive rate 4-fold on large variant databases. In patient exomes, X-CAP prioritizes causal stopgains better than existing methods do, further illustrating its clinical utility. X-CAP is available at https://github.com/bejerano-lab/X-CAP

    AVADA improves automated genetic variant database construction directly from full-text literature

    Get PDF
    Purpose: The primary literature on human genetic diseases includes descriptions of pathogenic variants that are essential for clinical diagnosis. Variant databases such as ClinVar and HGMD collect pathogenic variants by manual curation. We aimed to automatically construct a freely accessible database of pathogenic variants directly from full-text articles about genetic disease. Methods: AVADA (Automatically curated VAriant DAtabase) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic variants and genes in full text of primary literature and converts them to genomic coordinates for rapid downstream use. Results: AVADA automatically curated almost 60% of pathogenic variants deposited in HGMD, a 4.4-fold improvement over the current state of the art in automated variant extraction. AVADA also contains more than 60,000 pathogenic variants that are in HGMD, but not in ClinVar. In a cohort of 245 diagnosed patients, AVADA correctly annotated 38 previously described diagnostic variants, compared to 43 using HGMD, 20 using ClinVar and only 13 (wholly subsumed by AVADA and ClinVar's) using the best automated abstracts-only based approach. Conclusion: AVADA is the first machine learning tool that automatically curates a variants database directly from full text literature. AVADA is available upon publication at http://bejerano.stanford.edu/AVADA

    S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing

    Get PDF
    Exome analysis of patients with a likely monogenic disease does not identify a causal variant in over half of cases. Splice-disrupting mutations make up the second largest class of known disease-causing mutations. Each individual (singleton) exome harbors over 500 rare variants of unknown significance (VUS) in the splicing region. The existing relevant pathogenicity prediction tools tackle all non-coding variants as one amorphic class and/or are not calibrated for the high sensitivity required for clinical use. Here we calibrate seven such tools and devise a novel tool called Splicing Clinically Applicable Pathogenicity prediction (S-CAP) that is over twice as powerful as all previous tools, removing 41% of patient VUS at 95% sensitivity. We show that S-CAP does this by using its own features and not via meta-prediction over previous tools, and that splicing pathogenicity prediction is distinct from predicting molecular splicing changes. S-CAP is an important step on the path to deriving non-coding causal diagnoses

    AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature

    Get PDF
    The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient’s disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient’s given set of phenotypes. Diagnosis of singleton patients (without relatives’ exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database–based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children’s Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu
    • …