73 research outputs found

    An integrated computational pipeline and database to support whole-genome sequence annotation

    Get PDF
    We describe here our experience in annotating the Drosophila melanogaster genome sequence, in the course of which we developed several new open-source software tools and a database schema to support large-scale genome annotation. We have developed these into an integrated and reusable software system for whole-genome annotation. The key contributions to overall annotation quality are the marshalling of high-quality sequences for alignments and the design of a system with an adaptable and expandable flexible architecture

    Predicting disease-associated substitution of a single amino acid by analyzing residue interactions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The rapid accumulation of data on non-synonymous single nucleotide polymorphisms (nsSNPs, also called SAPs) should allow us to further our understanding of the underlying disease-associated mechanisms. Here, we use complex networks to study the role of an amino acid in both local and global structures and determine the extent to which disease-associated and polymorphic SAPs differ in terms of their interactions to other residues.</p> <p>Results</p> <p>We found that SAPs can be well characterized by network topological features. Mutations are probably disease-associated when they occur at a site with a high centrality value and/or high degree value in a protein structure network. We also discovered that study of the neighboring residues around a mutation site can help to determine whether the mutation is disease-related or not. We compiled a dataset from the Swiss-Prot variant pages and constructed a model to predict disease-associated SAPs based on the random forest algorithm. The values of total accuracy and MCC were 83.0% and 0.64, respectively, as determined by 5-fold cross-validation. With an independent dataset, our model achieved a total accuracy of 80.8% and MCC of 0.59, respectively.</p> <p>Conclusions</p> <p>The satisfactory performance suggests that network topological features can be used as quantification measures to determine the importance of a site on a protein, and this approach can complement existing methods for prediction of disease-associated SAPs. Moreover, the use of this method in SAP studies would help to determine the underlying linkage between SAPs and diseases through extensive investigation of mutual interactions between residues.</p

    Physics of Neutron Star Crusts

    Get PDF
    The physics of neutron star crusts is vast, involving many different research fields, from nuclear and condensed matter physics to general relativity. This review summarizes the progress, which has been achieved over the last few years, in modeling neutron star crusts, both at the microscopic and macroscopic levels. The confrontation of these theoretical models with observations is also briefly discussed.Comment: 182 pages, published version available at <http://www.livingreviews.org/lrr-2008-10

    Characterizing Mutational Heterogeneity in a Glioblastoma Patient with Double Recurrence

    Get PDF
    Human cancers are driven by the acquisition of somatic mutations. Separating the driving mutations from those that are random consequences of general genomic instability remains a challenge. New sequencing technology makes it possible to detect mutations that are present in only a minority of cells in a heterogeneous tumor population. We sought to leverage the power of ultra-deep sequencing to study various levels of tumor heterogeneity in the serial recurrences of a single glioblastoma multiforme patient. Our goal was to gain insight into the temporal succession of DNA base-level lesions by querying intra- and inter-tumoral cell populations in the same patient over time. We performed targeted “next-generation" sequencing on seven samples from the same patient: two foci within the primary tumor, two foci within an initial recurrence, two foci within a second recurrence, and normal blood. Our study reveals multiple levels of mutational heterogeneity. We found variable frequencies of specific EGFR, PIK3CA, PTEN, and TP53 base substitutions within individual tumor regions and across distinct regions within the same tumor. In addition, specific mutations emerge and disappear along the temporal spectrum from tumor at the time of diagnosis to second recurrence, demonstrating evolution during tumor progression. Our results shed light on the spatial and temporal complexity of brain tumors. As sequencing costs continue to decline and deep sequencing technology eventually moves into the clinic, this approach may provide guidance for treatment choices as we embark on the path to personalized cancer medicine

    Improving the prediction of disease-related variants using protein three-dimensional structure

    Get PDF
    Background: Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. Non-synonymous SNPs occurring in coding regions result in single amino acid polymorphisms (SAPs) that may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performance, the quality of these predictions can be further improved by introducing new features derived from three-dimensional protein structures.Results: In this paper, we present a structure-based machine learning approach for predicting disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features derived from the protein's sequence, structure, and function. After dataset balancing, the structure-based method (SVM-3D) reaches an overall accuracy of 85%, a correlation coefficient of 0.70, and an area under the receiving operating characteristic curve (AUC) of 0.92. When compared with a similar sequence-based predictor, SVM-3D results in an increase of the overall accuracy and AUC by 3%, and correlation coefficient by 0.06. The robustness of this improvement has been tested on different datasets and in all the cases SVM-3D performs better than previously developed methods even when compared with PolyPhen2, which explicitly considers in input protein structure information.Conclusion: This work demonstrates that structural information can increase the accuracy of disease-related SAPs identification. Our results also quantify the magnitude of improvement on a large dataset. This improvement is in agreement with previously observed results, where structure information enhanced the prediction of protein stability changes upon mutation. Although the structural information contained in the Protein Data Bank is limiting the application and the performance of our structure-based method, we expect that SVM-3D will result in higher accuracy when more structural date become available. \ua9 2011 Capriotti; licensee BioMed Central Ltd

    wKinMut: An integrated tool for the analysis and interpretation of mutations in human protein kinases

    Get PDF
    BACKGROUND: Protein kinases are involved in relevant physiological functions and a broad number of mutations in this superfamily have been reported in the literature to affect protein function and stability. Unfortunately, the exploration of the consequences on the phenotypes of each individual mutation remains a considerable challenge. RESULTS: The wKinMut web-server offers direct prediction of the potential pathogenicity of the mutations from a number of methods, including our recently developed prediction method based on the combination of information from a range of diverse sources, including physicochemical properties and functional annotations from FireDB and Swissprot and kinase-specific characteristics such as the membership to specific kinase groups, the annotation with disease-associated GO terms or the occurrence of the mutation in PFAM domains, and the relevance of the residues in determining kinase subfamily specificity from S3Det. This predictor yields interesting results that compare favourably with other methods in the field when applied to protein kinases. Together with the predictions, wKinMut offers a number of integrated services for the analysis of mutations. These include: the classification of the kinase, information about associations of the kinase with other proteins extracted from iHop, the mapping of the mutations onto PDB structures, pathogenicity records from a number of databases and the classification of mutations in large-scale cancer studies. Importantly, wKinMut is connected with the SNP2L system that extracts mentions of mutations directly from the literature, and therefore increases the possibilities of finding interesting functional information associated to the studied mutations. CONCLUSIONS: wKinMut facilitates the exploration of the information available about individual mutations by integrating prediction approaches with the automatic extraction of information from the literature (text mining) and several state-of-the-art databases. wKinMut has been used during the last year for the analysis of the consequences of mutations in the context of a number of cancer genome projects, including the recent analysis of Chronic Lymphocytic Leukemia cases and is publicly available at http://wkinmut.bioinfo.cnio.es

    Statistical method on nonrandom clustering with application to somatic mutations in cancer

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Human cancer is caused by the accumulation of tumor-specific mutations in oncogenes and tumor suppressors that confer a selective growth advantage to cells. As a consequence of genomic instability and high levels of proliferation, many passenger mutations that do not contribute to the cancer phenotype arise alongside mutations that drive oncogenesis. While several approaches have been developed to separate driver mutations from passengers, few approaches can specifically identify activating driver mutations in oncogenes, which are more amenable for pharmacological intervention.</p> <p>Results</p> <p>We propose a new statistical method for detecting activating mutations in cancer by identifying nonrandom clusters of amino acid mutations in protein sequences. A probability model is derived using order statistics assuming that the location of amino acid mutations on a protein follows a uniform distribution. Our statistical measure is the differences between pair-wise order statistics, which is equivalent to the size of an amino acid mutation cluster, and the probabilities are derived from exact and approximate distributions of the statistical measure. Using data in the Catalog of Somatic Mutations in Cancer (COSMIC) database, we have demonstrated that our method detects well-known clusters of activating mutations in KRAS, BRAF, PI3K, and <it>β</it>-catenin. The method can also identify new cancer targets as well as gain-of-function mutations in tumor suppressors.</p> <p>Conclusions</p> <p>Our proposed method is useful to discover activating driver mutations in cancer by identifying nonrandom clusters of somatic amino acid mutations in protein sequences.</p

    Discovering cancer genes by integrating network and functional properties

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of novel cancer-causing genes is one of the main goals in cancer research. The rapid accumulation of genome-wide protein-protein interaction (PPI) data in humans has provided a new basis for studying the topological features of cancer genes in cellular networks. It is important to integrate multiple genomic data sources, including PPI networks, protein domains and Gene Ontology (GO) annotations, to facilitate the identification of cancer genes.</p> <p>Methods</p> <p>Topological features of the PPI network, as well as protein domain compositions, enrichment of gene ontology categories, sequence and evolutionary conservation features were extracted and compared between cancer genes and other genes. The predictive power of various classifiers for identification of cancer genes was evaluated by cross validation. Experimental validation of a subset of the prediction results was conducted using siRNA knockdown and viability assays in human colon cancer cell line DLD-1.</p> <p>Results</p> <p>Cross validation demonstrated advantageous performance of classifiers based on support vector machines (SVMs) with the inclusion of the topological features from the PPI network, protein domain compositions and GO annotations. We then applied the trained SVM classifier to human genes to prioritize putative cancer genes. siRNA knock-down of several SVM predicted cancer genes displayed greatly reduced cell viability in human colon cancer cell line DLD-1.</p> <p>Conclusion</p> <p>Topological features of PPI networks, protein domain compositions and GO annotations are good predictors of cancer genes. The SVM classifier integrates multiple features and as such is useful for prioritizing candidate cancer genes for experimental validations.</p
    corecore