758 research outputs found

    Tools and annotations for variation

    Get PDF
    Since the finishing of the Human Genome Project, many next-generation (NGS) or high-throughput sequencing platforms have emerged. One of the applications of NGS technology, variant discovery, can serve as a basis for precision medicine. Large sequencing projects are generating huge amounts of genetic variation data, which are stored in databases, either large central databases such as dbSNP, or gene- or disease-centered locus-specific databases (LSDBs). There are many variation databases with many different formats and varying quality. Apart from storage and analysis pipeline capacity problems, the interpretation of the variation is also an issue. Computational methods for predicting the effects of variants have been and are being developed, since experimental assessment of variation effects is often not feasible. Benchmark datasets are needed for the development and for performance assessment of such prediction methods.We studied quality related aspects of variant databases and benchmark datasets. The online tool called VariOtator was developed to aid in the consistent use of the Variation Ontology, which was specifically developed to describe variation. Standardization is one aspect of database quality; the use of an ontology for variant annotation will contribute to the enhancement of it.BTKbase is a locus-specific database containing information on variants in BTK, the gene involved in X-linked agammaglobulinemia (XLA), a primary immunodeficiency. If available, phenotypic data, i.e. the variant effects, are also provided. Statistics on variants and variation types showed that there is a wide spectrum of variants and variation types, and that the distribution of protein variants in the different BTK domains is not even.The VariSNP database containing datasets with neutral (non-pathogenic) variants was generated by selecting variants from dbSNP and filtering for variants found in the ClinVar, PhenCode and SwissProt databases. Variants in these three databases are considered to be disease-related. The VariSNP database contains 13 datasets following the functional classification of dbSNP, and is updated on a regular basis.To study the sensitivity to variation in different protein and disease groups, we predicted the pathogenicity of all possible single amino acid substitutions (SAASs) in all proteins in these groups, using the well-performing prediction method PON P2. Large differences in the proportions of harmful, benign and unknown variants were found, and distinctive patterns of SAAS types were found, both in the original and variant amino acids.Representativeness is one quality aspect of variation benchmark datasets, and relates to the representation of the space of variants and their effects. We studied the coverage and distribution of protein features, including structure (CATH) and enzyme classification (EC), Pfam domains and Gene Ontology terms, in established benchmark datasets. None of the datasets is fully representative. Coverage of the features is in general better in the larger datasets, and better in the neutral datasets. At the higher levels of the CATH and EC classifications, all datasets were unbiased, but for the lower levels and other features, all datasets were biased

    Integrating population variation and protein structural analysis to improve clinical interpretation of missense variation: application to the WD40 domain

    Get PDF
    We present a generic, multidisciplinary approach for improving our understanding of novel missense variants in recently discovered disease genes exhibiting genetic heterogeneity, by combining clinical and population genetics with protein structural analysis. Using six new de novo missense diagnoses in TBL1XR1 from the Deciphering Developmental Disorders study, together with population variation data, we show that the β-propeller structure of the ubiquitous WD40 domain provides a convincing way to discriminate between pathogenic and benign variation. Children with likely pathogenic mutations in this gene have severely delayed language development, often accompanied by intellectual disability, autism, dysmorphology and gastrointestinal problems. Amino acids affected by likely pathogenic missense mutations are either crucial for the stability of the fold, forming part of a highly conserved symmetrically repeating hydrogen-bonded tetrad, or located at the top face of the β-propeller, where ‘hotspot’ residues affect the binding of β-catenin to the TBLR1 protein. In contrast, those altered by population variation are significantly less likely to be spatially clustered towards the top face or to be at buried or highly conserved residues. This result is useful not only for interpreting benign and pathogenic missense variants in this gene, but also in other WD40 domains, many of which are associated with disease

    Structural and Computational Characterization of Disease-Related Mutations Involved in Protein-Protein Interfaces

    Get PDF
    Computational docking; Interface prediction; Protein-protein interactionsAcoblament molecular computacional; Predicció d'interfícies; Interaccions proteïna-proteïnaAcoplamiento molecular computacional; Predicción de interfaces; Interacciones proteína-proteínaOne of the known potential effects of disease-causing amino acid substitutions in proteins is to modulate protein-protein interactions (PPIs). To interpret such variants at the molecular level and to obtain useful information for prediction purposes, it is important to determine whether they are located at protein-protein interfaces, which are composed of two main regions, core and rim, with different evolutionary conservation and physicochemical properties. Here we have performed a structural, energetics and computational analysis of interactions between proteins hosting mutations related to diseases detected in newborn screening. Interface residues were classified as core or rim, showing that the core residues contribute the most to the binding free energy of the PPI. Disease-causing variants are more likely to occur at the interface core region rather than at the interface rim (p < 0.0001). In contrast, neutral variants are more often found at the interface rim or at the non-interacting surface rather than at the interface core region. We also found that arginine, tryptophan, and tyrosine are over-represented among mutated residues leading to disease. These results can enhance our understanding of disease at molecular level and thus contribute towards personalized medicine by helping clinicians to provide adequate diagnosis and treatments.This research was funding by the EU European Regional Development Fund (ERDF) through the Program Interreg V-A Spain-France-Andorra (POCTEFA), by the CSIC (intramural grant number 201720I031), and by the Spanish Ministry of Economy and Competitiveness (grants BIO2016-79930-R and SAF2016-80255-R). M.R. is recipient of an FPI fellowship from the Severo Ochoa program

    The contribution of missense mutations in core and rim residues of protein-protein interfaces to human disease.

    Get PDF
    AbstractMissense mutations at protein–protein interaction sites, called interfaces, are important contributors to human disease. Interfaces are non-uniform surface areas characterized by two main regions, “core” and “rim”, which differ in terms of evolutionary conservation and physicochemical properties. Moreover, within interfaces, only a small subset of residues (“hot spots”) is crucial for the binding free energy of the protein–protein complex.We performed a large-scale structural analysis of human single amino acid variations (SAVs) and demonstrated that disease-causing mutations are preferentially located within the interface core, as opposed to the rim (p<0.01). In contrast, the interface rim is significantly enriched in polymorphisms, similar to the remaining non-interacting surface. Energetic hot spots tend to be enriched in disease-causing mutations compared to non-hot spots (p=0.05), regardless of their occurrence in core or rim residues. For individual amino acids, the frequency of substitution into a polymorphism or disease-causing mutation differed to other amino acids and was related to its structural location, as was the type of physicochemical change introduced by the SAV.In conclusion, this study demonstrated the different distribution and properties of disease-causing SAVs and polymorphisms within different structural regions and in relation to the energetic contribution of amino acid in protein–protein interfaces, thus highlighting the importance of a structural system biology approach for predicting the effect of SAVs

    Spatial Distribution of Disease-associated Variants in Three-dimensional Structures of Protein Complexes

    No full text

    Protein structure and phenotypic analysis of pathogenic and population missense variants inSTXBP1.

    Get PDF
    This is the final version of the article. Available from Wiley via the DOI in this record.BACKGROUND: Syntaxin-binding protein 1, encoded bySTXBP1, is highly expressed in the brain and involved in fusing synaptic vesicles with the plasma membrane. Studies have shown that pathogenic loss-of-function variants in this gene result in various types of epilepsies, mostly beginning early in life. We were interested to model pathogenic missense variants on the protein structure to investigate the mechanism of pathogenicity and genotype-phenotype correlations. METHODS: We report 11 patients with pathogenic de novo mutations inSTXBP1identified in the first 4293 trios of the Deciphering Developmental Disorder (DDD) study, including six missense variants. We analyzed the structural locations of the pathogenic missense variants from this study and the literature, as well as population missense variants extracted from Exome Aggregation Consortium (ExAC). RESULTS: Pathogenic variants are significantly more likely to occur at highly conserved locations than population variants, and be buried inside the protein domain. Pathogenic mutations are also more likely to destabilize the domain structure compared with population variants, increasing the proportion of (partially) unfolded domains that are prone to aggregation or degradation. We were unable to detect any genotype-phenotype correlation, but unlike previously reported cases, most of the DDD patients withSTXBP1pathogenic variants did not present with very early-onset or severe epilepsy and encephalopathy, though all have developmental delay with intellectual disability and most display behavioral problems and suffered seizures in later childhood. CONCLUSION: Variants acrossSTXBP1that cause loss of function can result in severe intellectual disability with or without seizures, consistent with a haploinsufficiency mechanism. Pathogenic missense mutations act through destabilization of the protein domain, making it prone to aggregation or degradation. The presence or absence of early seizures may reflect ascertainment bias in the literature as well as the broad recruitment strategy of the DDD study.This study was supported by the Health Innovation Challenge Fund (grant number: HICF-1009-003) and Wellcome Trust Sanger Institute (grant number: WT098051)

    Integration of protein three-dimensional structure into the workflow of interpretation of genetic variants

    Get PDF
    Life stores information in large biopolymer molecules, which can be repre- sented as a sequence of letters. Computers stores information in sequences of zeros and ones. This predestines computers for automated processing of biological data and with a great success. Computational biology has produced many methods and tools based on biological sequences. However, reducing life to just sequences radically reduces the whole picture. The functionality of biomolecules, especially proteins, is performed in the three-dimensional (3D) space. Thus, limiting methods in computational biology to sequences will never yield sufficient insights in the ways molecular biology operates. In this thesis I present my work on the integration of protein 3D structure information into the methodological workflow of computational biology. We developed an algorithmic pipeline that is able to map protein sequences to protein structures, providing an additional source of information. We used this pipeline in order to analyze the effects of genetic variants from the perspective of protein 3D structures. We analyzed genetic variants associated with diseases and compared their structural arrangements to that of neutral variants. Additionally, we discussed how structural information can improve methods that aim to predict the consequences of genetic variants.Das Leben speichert Informationen mit der Hilfe von langen Biopolymermolekülketten. Man kann solche Ketten durch Buchstabensequenzen beschreiben. Computer speichern Informationen in Sequenzen von Nullen und Einsen. Dies prädestiniert Computer zur Verabeitung biologischer Daten und tatsächlich hat die Bioinformatik, mit großem Erfolg, Methoden und Werkzeuge entwickelt, die auf der Verarbeitung solcher Sequenzen basieren. Allerdings, spielt sich die Funktionalität von Biomolekülen, insbesonders die von Proteinen, im drei-dimensionalen (3D) Raum ab. Und deshalb werden bioinformatische Methoden, die sich auf Sequenzdaten beschränken niemals in der Lage sein, mikrobiologische Vorgänge funktionell zu beschreiben. Diese Thesis widmet sich der Integration von Protein 3D Strukturinformationen in die Abläufe bioinformatischer Methodiken. Wir haben eine algorithmische Pipeline entwickelt, die es ermöglicht Proteinsequenzen auf Proteinstrukturen abzubilden um so eine zusätzliche Informationsquelle beizusteuern. Wir benutzten diese Methodik um die Effekte von genetischen Variationen aus der Sichtweise von Proteinstrukturen zu analysieren. Wir haben die Tendenzen der räumlichen Verteilung von genetischen Varianten, die man mit Krankheiten in Verbidung gebracht hat, analysiert und sie mit denen von neutralen Varianten verglichen. Desweiteren, haben wir geprüft in wie weit das Einbeziehen strukureller Daten die Vorhersage von Konsequenzen genetischer Varianten verbessert
    corecore