77,610 research outputs found
Seeing the Forest for the Trees: Using the Gene Ontology to Restructure Hierarchical Clustering
Motivation: There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity. Results: We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein–protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein–protein interaction data. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.Lynne and William Frankel Center for Computer Science; Paul Ivanier center for robotics research and production; National Institutes of Health (R01 HG003367-01A1
Simplified amino acid alphabets based on deviation of conditional probability from random background
The primitive data for deducing the Miyazawa-Jernigan contact energy or
BLOSUM score matrix consists of pair frequency counts. Each amino acid
corresponds to a conditional probability distribution. Based on the deviation
of such conditional probability from random background, a scheme for reduction
of amino acid alphabet is proposed. It is observed that evident discrepancy
exists between reduced alphabets obtained from raw data of the
Miyazawa-Jernigan's and BLOSUM's residue pair counts. Taking homologous
sequence database SCOP40 as a test set, we detect homology with the obtained
coarse-grained substitution matrices. It is verified that the reduced alphabets
obtained well preserve information contained in the original 20-letter
alphabet.Comment: 9 pages,3figure
Extracting Biomolecular Interactions Using Semantic Parsing of Biomedical Text
We advance the state of the art in biomolecular interaction extraction with
three contributions: (i) We show that deep, Abstract Meaning Representations
(AMR) significantly improve the accuracy of a biomolecular interaction
extraction system when compared to a baseline that relies solely on surface-
and syntax-based features; (ii) In contrast with previous approaches that infer
relations on a sentence-by-sentence basis, we expand our framework to enable
consistent predictions over sets of sentences (documents); (iii) We further
modify and expand a graph kernel learning framework to enable concurrent
exploitation of automatically induced AMR (semantic) and dependency structure
(syntactic) representations. Our experiments show that our approach yields
interaction extraction systems that are more robust in environments where there
is a significant mismatch between training and test conditions.Comment: Appearing in Proceedings of the Thirtieth AAAI Conference on
Artificial Intelligence (AAAI-16
Protein structure validation and refinement using amide proton chemical shifts derived from quantum mechanics
We present the ProCS method for the rapid and accurate prediction of protein
backbone amide proton chemical shifts - sensitive probes of the geometry of key
hydrogen bonds that determine protein structure. ProCS is parameterized against
quantum mechanical (QM) calculations and reproduces high level QM results
obtained for a small protein with an RMSD of 0.25 ppm (r = 0.94). ProCS is
interfaced with the PHAISTOS protein simulation program and is used to infer
statistical protein ensembles that reflect experimentally measured amide proton
chemical shift values. Such chemical shift-based structural refinements,
starting from high-resolution X-ray structures of Protein G, ubiquitin, and SMN
Tudor Domain, result in average chemical shifts, hydrogen bond geometries, and
trans-hydrogen bond (h3JNC') spin-spin coupling constants that are in excellent
agreement with experiment. We show that the structural sensitivity of the
QM-based amide proton chemical shift predictions is needed to refine protein
structures to this agreement. The ProCS method thus offers a powerful new tool
for refining the structures of hydrogen bonding networks to high accuracy with
many potential applications such as protein flexibility in ligand binding.Comment: PLOS ONE accepted, Nov 201
Comparing reverse complementary genomic words based on their distance distributions and frequencies
In this work we study reverse complementary genomic word pairs in the human
DNA, by comparing both the distance distribution and the frequency of a word to
those of its reverse complement. Several measures of dissimilarity between
distance distributions are considered, and it is found that the peak
dissimilarity works best in this setting. We report the existence of reverse
complementary word pairs with very dissimilar distance distributions, as well
as word pairs with very similar distance distributions even when both
distributions are irregular and contain strong peaks. The association between
distribution dissimilarity and frequency discrepancy is explored also, and it
is speculated that symmetric pairs combining low and high values of each
measure may uncover features of interest. Taken together, our results suggest
that some asymmetries in the human genome go far beyond Chargaff's rules. This
study uses both the complete human genome and its repeat-masked version.Comment: Post-print of a paper accepted to publication in "Interdisciplinary
Sciences: Computational Life Sciences" (ISSN: 1913-2751, ESSN: 1867-1462
Optimality of the genetic code with respect to protein stability and amino acid frequencies
How robust is the natural genetic code with respect to mistranslation errors?
It has long been known that the genetic code is very efficient in limiting the
effect of point mutation. A misread codon will commonly code either for the
same amino acid or for a similar one in terms of its biochemical properties, so
the structure and function of the coded protein remain relatively unaltered.
Previous studies have attempted to address this question more quantitatively,
namely by statistically estimating the fraction of randomly generated codes
that do better than the genetic code regarding its overall robustness. In this
paper, we extend these results by investigating the role of amino acid
frequencies in the optimality of the genetic code. When measuring the relative
fitness of the natural code with respect to a random code, it is indeed natural
to assume that a translation error affecting a frequent amino acid is less
favorable than that of a rare one, at equal mutation cost. We find that taking
the amino acid frequency into account accordingly decreases the fraction of
random codes that beat the natural code, making the latter comparatively even
more robust. This effect is particularly pronounced when more refined measures
of the amino acid substitution cost are used than hydrophobicity. To show this,
we devise a new cost function by evaluating with computer experiments the
change in folding free energy caused by all possible single-site mutations in a
set of known protein structures. With this cost function, we estimate that of
the order of one random code out of 100 millions is more fit than the natural
code when taking amino acid frequencies into account. The genetic code seems
therefore structured so as to minimize the consequences of translation errors
on the 3D structure and stability of proteins.Comment: 31 pages, 2 figures, postscript fil
Recommended from our members
Clustering Scatter Plots Using Data Depth Measures.
Clustering is rapidly becoming a powerful data mining technique, and has been broadly applied to many domains such as bioinformatics and text mining. However, the existing methods can only deal with a data matrix of scalars. In this paper, we introduce a hierarchical clustering procedure that can handle a data matrix of scatter plots. To more accurately reflect the nature of data, we introduce a dissimilarity statistic based on "data depth" to measure the discrepancy between two bivariate distributions without oversimplifying the nature of the underlying pattern. We then combine hypothesis testing with hierarchical clustering to simultaneously cluster the rows and columns of the data matrix of scatter plots. We also propose novel painting metrics and construct heat maps to allow visualization of the clusters. We demonstrate the utility and power of our new clustering method through simulation studies and application to a microbe-host-interaction study
- …