77,610 research outputs found

    Seeing the Forest for the Trees: Using the Gene Ontology to Restructure Hierarchical Clustering

    Get PDF
    Motivation: There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity. Results: We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein–protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein–protein interaction data. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.Lynne and William Frankel Center for Computer Science; Paul Ivanier center for robotics research and production; National Institutes of Health (R01 HG003367-01A1

    Simplified amino acid alphabets based on deviation of conditional probability from random background

    Get PDF
    The primitive data for deducing the Miyazawa-Jernigan contact energy or BLOSUM score matrix consists of pair frequency counts. Each amino acid corresponds to a conditional probability distribution. Based on the deviation of such conditional probability from random background, a scheme for reduction of amino acid alphabet is proposed. It is observed that evident discrepancy exists between reduced alphabets obtained from raw data of the Miyazawa-Jernigan's and BLOSUM's residue pair counts. Taking homologous sequence database SCOP40 as a test set, we detect homology with the obtained coarse-grained substitution matrices. It is verified that the reduced alphabets obtained well preserve information contained in the original 20-letter alphabet.Comment: 9 pages,3figure

    Extracting Biomolecular Interactions Using Semantic Parsing of Biomedical Text

    Full text link
    We advance the state of the art in biomolecular interaction extraction with three contributions: (i) We show that deep, Abstract Meaning Representations (AMR) significantly improve the accuracy of a biomolecular interaction extraction system when compared to a baseline that relies solely on surface- and syntax-based features; (ii) In contrast with previous approaches that infer relations on a sentence-by-sentence basis, we expand our framework to enable consistent predictions over sets of sentences (documents); (iii) We further modify and expand a graph kernel learning framework to enable concurrent exploitation of automatically induced AMR (semantic) and dependency structure (syntactic) representations. Our experiments show that our approach yields interaction extraction systems that are more robust in environments where there is a significant mismatch between training and test conditions.Comment: Appearing in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16

    Protein structure validation and refinement using amide proton chemical shifts derived from quantum mechanics

    Get PDF
    We present the ProCS method for the rapid and accurate prediction of protein backbone amide proton chemical shifts - sensitive probes of the geometry of key hydrogen bonds that determine protein structure. ProCS is parameterized against quantum mechanical (QM) calculations and reproduces high level QM results obtained for a small protein with an RMSD of 0.25 ppm (r = 0.94). ProCS is interfaced with the PHAISTOS protein simulation program and is used to infer statistical protein ensembles that reflect experimentally measured amide proton chemical shift values. Such chemical shift-based structural refinements, starting from high-resolution X-ray structures of Protein G, ubiquitin, and SMN Tudor Domain, result in average chemical shifts, hydrogen bond geometries, and trans-hydrogen bond (h3JNC') spin-spin coupling constants that are in excellent agreement with experiment. We show that the structural sensitivity of the QM-based amide proton chemical shift predictions is needed to refine protein structures to this agreement. The ProCS method thus offers a powerful new tool for refining the structures of hydrogen bonding networks to high accuracy with many potential applications such as protein flexibility in ligand binding.Comment: PLOS ONE accepted, Nov 201

    Comparing reverse complementary genomic words based on their distance distributions and frequencies

    Get PDF
    In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is explored also, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff's rules. This study uses both the complete human genome and its repeat-masked version.Comment: Post-print of a paper accepted to publication in "Interdisciplinary Sciences: Computational Life Sciences" (ISSN: 1913-2751, ESSN: 1867-1462

    Optimality of the genetic code with respect to protein stability and amino acid frequencies

    Get PDF
    How robust is the natural genetic code with respect to mistranslation errors? It has long been known that the genetic code is very efficient in limiting the effect of point mutation. A misread codon will commonly code either for the same amino acid or for a similar one in terms of its biochemical properties, so the structure and function of the coded protein remain relatively unaltered. Previous studies have attempted to address this question more quantitatively, namely by statistically estimating the fraction of randomly generated codes that do better than the genetic code regarding its overall robustness. In this paper, we extend these results by investigating the role of amino acid frequencies in the optimality of the genetic code. When measuring the relative fitness of the natural code with respect to a random code, it is indeed natural to assume that a translation error affecting a frequent amino acid is less favorable than that of a rare one, at equal mutation cost. We find that taking the amino acid frequency into account accordingly decreases the fraction of random codes that beat the natural code, making the latter comparatively even more robust. This effect is particularly pronounced when more refined measures of the amino acid substitution cost are used than hydrophobicity. To show this, we devise a new cost function by evaluating with computer experiments the change in folding free energy caused by all possible single-site mutations in a set of known protein structures. With this cost function, we estimate that of the order of one random code out of 100 millions is more fit than the natural code when taking amino acid frequencies into account. The genetic code seems therefore structured so as to minimize the consequences of translation errors on the 3D structure and stability of proteins.Comment: 31 pages, 2 figures, postscript fil
    corecore