9 research outputs found

    Investigating DNA-, RNA-, and protein-based features as a means to discriminate pathogenic synonymous variants

    Get PDF
    Synonymous single-nucleotide variants (SNVs), although they do not alter the encoded protein sequences, have been implicated in many genetic diseases. Experimental studies indicate that synonymous SNVs can lead to changes in the secondary and tertiary structures of DNA and RNA, thereby affecting translational efficiency, cotranslational protein folding as well as the binding of DNA-/RNA-binding proteins. However, the importance of these various features in disease phenotypes is not clearly understood. Here, we have built a support vector machine (SVM) model (termed DDIG-SN) as a means to discriminate disease-causing synonymous variants. The model was trained and evaluated on nearly 900 disease-causing variants. The method achieves robust performance with the area under the receiver operating characteristic curve of 0.84 and 0.85 for protein-stratified 10-fold cross-validation and independent testing, respectively. We were able to show that the disease-causing effects in the immediate proximity to exon–intron junctions (1–3 bp) are driven by the loss of splicing motif strength, whereas the gain of splicing motif strength is the primary cause in regions further away from the splice site (4–69 bp). The method is available as a part of the DDIG server at http://sparks-lab.org/ddig

    Approche de réseau sémantique probabiliste pour l’étude des relations génotype-phénotype dans le cadre des maladies génétiques humaines

    No full text
    Cette thèse porte sur le développement d'une méthode pour modéliser des systèmes complexes en utilisant des graphes de connaissances et des algorithmes de raisonnement automatisés. La méthode de modélisation a été appliquée aux maladies rares pour prédire leurs causes, depuis les niveaux génétique, cellulaire et physiologique jusqu'au niveau de l'organisme entier. Pour la création du graphe de connaissances, deux ontologies, GO et HPO, ont été utilisées. Étant donné qu'il n'existait pas de bases de données avec des relations entre ces ontologies, une méthode d'apprentissage automatique a été développée pour inférer des relations et appliquée aux ontologies GO et HPO. La thèse est complétée par une méthode d'apprentissage automatique pour inférer des effets délétères après une variation génétique appelée INDEL. Dans l'ensemble, le travail d'intelligence artificielle présenté dans cette thèse de doctorat aide les chercheurs à comprendre ce qui se passe dans le corps humain à différents niveaux d'abstraction, depuis l'apparition d'une variation génétique jusqu'au développement d'une maladie rare.This thesis is about the development of a method for modeling complex systems using knowledge graphs and automated reasoning algorithms. The modeling method was applied to rare diseases to predict their causes from the genetic to the cellular, physiological, and whole organism levels. For the creation of the knowledge graph, two ontologies, GO and HPO, were used. Since there were no databases with relationships between these ontologies, a machine learning method was developed to infer relationships and applied to both GO and HPO ontologies. The thesis is completed by a machine learning method to infer deleterious effects after a genetic variation called INDEL. Altogether, the artificial intelligence work presented in this doctoral thesis assists rare disease researchers in understanding what happens in the human body at various levels of abstraction, from the occurrence of a genetic variation to the development of a rare disease

    A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i)

    Get PDF
    International audienceBACKGROUND: Small insertion and deletion polymorphisms (Indels) are the second most common mutations in the human genome, after Single Nucleotide Polymorphisms (SNPs). Recent studies have shown that they have significant influence on genetic variation by altering human traits and can cause multiple human diseases. In particular, many Indels that occur in protein coding regions are known to impact the structure or function of the protein. A major challenge is to predict the effects of these Indels and to distinguish between deleterious and neutral variants. When an Indel occurs within a coding region, it can be either frameshifting (FS) or non-frameshifting (NFS). FS-Indels either modify the complete C-terminal region of the protein or result in premature termination of translation. NFS-Indels insert/delete multiples of three nucleotides leading to the insertion/deletion of one or more amino acids. RESULTS: In order to study the relationships between NFS-Indels and Mendelian diseases, we characterized NFS-Indels according to numerous structural, functional and evolutionary parameters. We then used these parameters to identify specific characteristics of disease-causing and neutral NFS-Indels. Finally, we developed a new machine learning approach, KD4i, that can be used to predict the phenotypic effects of NFS-Indels. CONCLUSIONS: We demonstrate in a large-scale evaluation that the accuracy of KD4i is comparable to existing state-of-the-art methods. However, a major advantage of our approach is that we also provide the reasons for the predictions, in the form of a set of rules. The rules are interpretable by non-expert humans and they thus represent new knowledge about the relationships between the genotype and phenotypes of NFS-Indels and the causative molecular perturbations that result in the disease

    Variation Interpretation Predictors: Principles, Types, Performance, and Choice

    No full text
    corecore