12 research outputs found

    AVADA improves automated genetic variant database construction directly from full-text literature

    Get PDF
    Purpose: The primary literature on human genetic diseases includes descriptions of pathogenic variants that are essential for clinical diagnosis. Variant databases such as ClinVar and HGMD collect pathogenic variants by manual curation. We aimed to automatically construct a freely accessible database of pathogenic variants directly from full-text articles about genetic disease. Methods: AVADA (Automatically curated VAriant DAtabase) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic variants and genes in full text of primary literature and converts them to genomic coordinates for rapid downstream use. Results: AVADA automatically curated almost 60% of pathogenic variants deposited in HGMD, a 4.4-fold improvement over the current state of the art in automated variant extraction. AVADA also contains more than 60,000 pathogenic variants that are in HGMD, but not in ClinVar. In a cohort of 245 diagnosed patients, AVADA correctly annotated 38 previously described diagnostic variants, compared to 43 using HGMD, 20 using ClinVar and only 13 (wholly subsumed by AVADA and ClinVar's) using the best automated abstracts-only based approach. Conclusion: AVADA is the first machine learning tool that automatically curates a variants database directly from full text literature. AVADA is available upon publication at http://bejerano.stanford.edu/AVADA

    AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature

    Get PDF
    The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient’s disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient’s given set of phenotypes. Diagnosis of singleton patients (without relatives’ exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database–based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children’s Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu

    AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature

    Get PDF
    Purpose: Both monogenic pathogenic variant cataloging and clinical patient diagnosis start with variant-level evidence retrieval followed by expert evidence integration in search of diagnostic variants and genes. Here, we try to accelerate pathogenic variant evidence retrieval by an automatic approach. Methods: Automatic VAriant evidence DAtabase (AVADA) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic genetic variant evidence in full-text primary literature about monogenic disease and convert it to genomic coordinates. Results AVADA automatically retrieved almost 60% of likely disease-causing variants deposited in the Human Gene Mutation Database (HGMD), a 4.4-fold improvement over the current best open source automated variant extractor. AVADA contains over 60,000 likely disease-causing variants that are in HGMD but not in ClinVar. AVADA also highlights the challenges of automated variant mapping and pathogenicity curation. However, when combined with manual validation, on 245 diagnosed patients, AVADA provides valuable evidence for an additional 18 diagnostic variants, on top of ClinVar’s 21, versus only 2 using the best current automated approach. Conclusion : AVADA advances automated retrieval of pathogenic monogenic variant evidence from full-text literature. Far from perfect, but much faster than PubMed/Google Scholar search, careful curation of AVADA-retrieved evidence can aid both database curation and patient diagnosis

    Softwareverifikation mit IC3 mittels Abstraktion und Interpolation

    No full text
    Abweichender Titel laut Übersetzung der Verfasserin/des VerfassersZsfassung in dt. Sprache. - Literaturverz. S. 77 - 81Software-Modellprüfung ist ein Ansatz zur Verifikation von Software-Programmen, der auf Schlussfolgerungen über Programmzustände aufgebaut ist. Ein Software-Modellprüfer kann beweisen oder widerlegen, dass bestimmte Eigenschaften für Software-Programme gelten. Eigenschaften, die beschreiben, dass gewisse Zustände während der Programmausführung nie vorkommen dürfen, heißen Sicherheitseigenschaften. In dieser Arbeit wird ein Modellprüfungsalgorithmus entwickelt, der Sicherheitseigenschaften für gewisse Programme beweisen und widerlegen kann. Der Modellprüfungsansatz basiert auf dem Prinzip der induktiven, inkrementellen Modellprüfung. Ein induktiver, inkrementeller Modellprüfungsalgorithmus beweist Sicherheitseigenschaften, indem er nach und nach eine Beschreibung einer Zustandsmenge aufbaut, welche das Programm während der Ausführung nie verlassen kann, und die alle sicher sind. Die Modelle der Programme, auf denen der Modellprüfungsalgorithmus arbeitet, sind sogenannte Übergangssysteme. Die in dieser Arbeit vorgestellen Übergangssysteme werden als prädikatenlogische Formeln über der Theorie der quantorenfreien linearen ganzzahligen Arithmetik beschrieben. Derartige Übergangssysteme arbeiten auf unendlich vielen Zuständen, da jede prädikatenlogische Konstante im Übergangssystem als beliebige ganze Zahl interpretiert werden kann. Der in dieser Arbeit entwickelte Ansatz basiert auf dem IC3-Modellprüfungsalgorithmus ( [10]). IC3 beweist Eigenschaften allerdings nur auf Übergangssystemen, die auf endlich vielen Zuständen arbeiten. Da wir allerdings Sicherheitseigenschaften für Systeme beweisen wollen, die unendliche viele Zustände bearbeiten, reduzieren wir das Problem auf den endlichen Fall, indem wir Prädikatenabstraktion auf den Zustandsraum anwenden. Der reduzierte Zustandsraum heißt auch abstrakte Domäne. Allerdings ist es im Zuge von Prädikatenabstraktion schwierig, herauszufinden, welche Prädikate es dem Algorithmus ermöglichen, Sicherheitseigenschaften zu beweisen. Wir gehen an das Problem heran, indem wir mit einer Menge von heuristisch gewählten Prädikaten für die abstrakte Domäne anfangen, und während der Ausführung des Modellprüfungsalgorithmus neue Prädikate zur Menge hinzufügen, um die abstrakte Domäne zu verfeinern. In unserem Modellprüfungsalgorithmus wird die abstrakten Domäne verfeinert, wenn unechte Gegenbeispiele gefunden werden: Das heißt, die abstrakte Domäne erlaubt Übergänge, die im ursprünglichen Übergangssystem nicht vorkommen könnten. Um derartige unechte Gegenbeispiele zu beseitigen, wird die abstrakte Domäne mit Prädikaten verfeinert, die aus Craig-Interpolanten ( [20, 21]) entnommen werden. Die Formeln, aus denen die Craig-Interpolanten entnommen werden, beschreiben, warum das Gegenbeispiel im konkreten Übergangssystem nicht vorkommen kann. Daher ist unser Ansatz ein Beispiel für Counterexample-Guided Abstraction Refinement ( [18, 19]). Das Ergebnis dieser Masterarbeit ist IC3-CEGAR, ein inkrementeller, induktiver Modellprüfungsalgorithmus, der Sicherheitseigenschaften auf gewissen Softwareprogrammen beweisen oder widerlegen kann.Software model checking is an approach to formal software verification based on reasoning about the states a program can be in. A software model checker can prove that certain properties hold on a given program. Properties expressing that certain states must never be reached during a run are called safety properties. In this work, we aim to construct a model checker that can prove or refute safety properties on certain programs. The approach for model checking is based on the principle of incremental, inductive model checking. An incremental, inductive model checker proves safety properties by incrementally constructing a description of a set of states that the program can never leave and all of which are safe. The model of the programthat the checker operates on is the transition system. The transition systems we derive from software programs are expressed as first-order formulas over the theory of quantifier-free linear integer arithmetic. Such transition systems operate on infinitely many states, since all first-order constants in the transition system can be interpreted as an arbitrary integer value. The method we develop in this work is based on the IC3 model checker ( [10]). IC3 can prove properties only on systems that comprise finitely many states. Since we are aiming at proving safety properties on systems working with infinitely many states, we reduce the problem to the finite-state case by applying Boolean predicate abstraction on the state space. The reduced state space is called the abstract domain. However, Boolean predicate abstraction is subject to the difficulty of choosing a fitting set of predicates that will allow the algorithm to prove safety properties. We approach the problem by starting with a set of heuristically determined predicates, and adding new instances to the abstract domain predicates as needed during the run of the algorithm, thus refining the abstract domain. In the model checker we developed, refinement is predominantly triggered by spurious counterexamples: The abstract domain admits transitions which could not occur in the original state space. In order to eliminate such spurious counterexamples, we refine the abstract domain with predicates extracted from Craig interpolants ( [20, 21]) over a formula that describes the infeasibility of the spurious counterexample. Thus, our approach to abstraction is an instance of counterexample-guided abstraction refinement ( [18, 19]). The result of this thesis is IC3-CEGAR, an incremental, inductive model checker that is capable of proving or disproving safety properties on certain software programs.12
    corecore