Search CORE

77,610 research outputs found

Seeing the Forest for the Trees: Using the Gene Ontology to Restructure Hierarchical Clustering

Author: Dotan-Cohen Dikla
Kasif Simon
Melkman Avraham A.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 03/06/2009
Field of study

Motivation: There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity. Results: We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein–protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein–protein interaction data. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.Lynne and William Frankel Center for Computer Science; Paul Ivanier center for robotics research and production; National Institutes of Health (R01 HG003367-01A1

Boston University Institutional Repository (OpenBU)

PubMed Central

Simplified amino acid alphabets based on deviation of conditional probability from random background

Author: A. Godzik
A.G. Murzin
C.E. Schafmeister
D.S. Riddle
Di Liu
H.S. Chan
J. Wang
Ji Qi
K.W. Plaxco
L.R. Murphy
M. Munson
S. Henikoff
S. Miyazawa
S.E. Brenner
S.F. Altschul
S.F. Altschul
Wei-Mou Zheng
Xin Liu
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2002
Field of study

The primitive data for deducing the Miyazawa-Jernigan contact energy or BLOSUM score matrix consists of pair frequency counts. Each amino acid corresponds to a conditional probability distribution. Based on the deviation of such conditional probability from random background, a scheme for reduction of amino acid alphabet is proposed. It is observed that evident discrepancy exists between reduced alphabets obtained from raw data of the Miyazawa-Jernigan's and BLOSUM's residue pair counts. Taking homologous sequence database SCOP40 as a test set, we detect homology with the obtained coarse-grained substitution matrices. It is verified that the reduced alphabets obtained well preserve information contained in the original 20-letter alphabet.Comment: 9 pages,3figure

arXiv.org e-Print Archive

Crossref

CERN Document Server

Extracting Biomolecular Interactions Using Semantic Parsing of Biomedical Text

Author: Galstyan Aram
Garg Sahil
Hermjakob Ulf
Marcu Daniel
Publication venue
Publication date: 04/12/2015
Field of study

We advance the state of the art in biomolecular interaction extraction with three contributions: (i) We show that deep, Abstract Meaning Representations (AMR) significantly improve the accuracy of a biomolecular interaction extraction system when compared to a baseline that relies solely on surface- and syntax-based features; (ii) In contrast with previous approaches that infer relations on a sentence-by-sentence basis, we expand our framework to enable consistent predictions over sets of sentences (documents); (iii) We further modify and expand a graph kernel learning framework to enable concurrent exploitation of automatically induced AMR (semantic) and dependency structure (syntactic) representations. Our experiments show that our approach yields interaction extraction systems that are more robust in environments where there is a significant mismatch between training and test conditions.Comment: Appearing in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Protein structure validation and refinement using amide proton chemical shifts derived from quantum mechanics

Author: Boomsma Wouter
Borg Mikael
Christensen Anders S.
Hamelryck Thomas
Jensen Jan H.
Lindorff-Larsen Kresten
Linnet Troels E.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

We present the ProCS method for the rapid and accurate prediction of protein backbone amide proton chemical shifts - sensitive probes of the geometry of key hydrogen bonds that determine protein structure. ProCS is parameterized against quantum mechanical (QM) calculations and reproduces high level QM results obtained for a small protein with an RMSD of 0.25 ppm (r = 0.94). ProCS is interfaced with the PHAISTOS protein simulation program and is used to infer statistical protein ensembles that reflect experimentally measured amide proton chemical shift values. Such chemical shift-based structural refinements, starting from high-resolution X-ray structures of Protein G, ubiquitin, and SMN Tudor Domain, result in average chemical shifts, hydrogen bond geometries, and trans-hydrogen bond (h3JNC') spin-spin coupling constants that are in excellent agreement with experiment. We show that the structural sensitivity of the QM-based amide proton chemical shift predictions is needed to refine protein structures to this agreement. The ProCS method thus offers a powerful new tool for refining the structures of hydrogen bonding networks to high accuracy with many potential applications such as protein flexibility in ligand binding.Comment: PLOS ONE accepted, Nov 201

arXiv.org e-Print Archive

Directory of Open Access Journals

Copenhagen University Research Information System

PubMed Central

FigShare

Comparing reverse complementary genomic words based on their distance distributions and frequencies

Author: Afreixo Vera
Bastos Carlos A. C.
Brito Paula
Pinho Armando
Raymaekers Jakob
Rousseeuw Peter
Silva Raquel M.
Tavares Ana Helena
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/10/2017
Field of study

In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is explored also, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff's rules. This study uses both the complete human genome and its repeat-masked version.Comment: Post-print of a paper accepted to publication in "Interdisciplinary Sciences: Computational Life Sciences" (ISSN: 1913-2751, ESSN: 1867-1462

arXiv.org e-Print Archive

Maastricht University Research Portal

Repositório Institucional da Universidade de Aveiro

Optimality of the genetic code with respect to protein stability and amino acid frequencies

Author: Cerf Nicolas
Gilis Dimitri
Massar Serge
Rooman Marianne
Publication venue
Publication date: 01/01/2001
Field of study

How robust is the natural genetic code with respect to mistranslation errors? It has long been known that the genetic code is very efficient in limiting the effect of point mutation. A misread codon will commonly code either for the same amino acid or for a similar one in terms of its biochemical properties, so the structure and function of the coded protein remain relatively unaltered. Previous studies have attempted to address this question more quantitatively, namely by statistically estimating the fraction of randomly generated codes that do better than the genetic code regarding its overall robustness. In this paper, we extend these results by investigating the role of amino acid frequencies in the optimality of the genetic code. When measuring the relative fitness of the natural code with respect to a random code, it is indeed natural to assume that a translation error affecting a frequent amino acid is less favorable than that of a rare one, at equal mutation cost. We find that taking the amino acid frequency into account accordingly decreases the fraction of random codes that beat the natural code, making the latter comparatively even more robust. This effect is particularly pronounced when more refined measures of the amino acid substitution cost are used than hydrophobicity. To show this, we devise a new cost function by evaluating with computer experiments the change in folding free energy caused by all possible single-site mutations in a set of known protein structures. With this cost function, we estimate that of the order of one random code out of 100 millions is more fit than the natural code when taking amino acid frequencies into account. The genetic code seems therefore structured so as to minimize the consequences of translation errors on the 3D structure and stability of proteins.Comment: 31 pages, 2 figures, postscript fil

arXiv.org e-Print Archive

PubMed Central

DI-fusion

Recommended from our members

Clustering Scatter Plots Using Data Depth Measures.

Author: Borneman James
Braun Jonathan
Cui Xinping
Jeske Daniel R
Li Xiaoxiao
Zhang Zhanpan
Publication venue: eScholarship, University of California
Publication date: 01/01/2011
Field of study

Clustering is rapidly becoming a powerful data mining technique, and has been broadly applied to many domains such as bioinformatics and text mining. However, the existing methods can only deal with a data matrix of scalars. In this paper, we introduce a hierarchical clustering procedure that can handle a data matrix of scatter plots. To more accurately reflect the nature of data, we introduce a dissimilarity statistic based on "data depth" to measure the discrepancy between two bivariate distributions without oversimplifying the nature of the underlying pattern. We then combine hypothesis testing with hierarchical clustering to simultaneously cluster the rows and columns of the data matrix of scatter plots. We also propose novel painting metrics and construct heat maps to allow visualization of the clusters. We demonstrate the utility and power of our new clustering method through simulation studies and application to a microbe-host-interaction study

eScholarship - University of California