Search CORE

44 research outputs found

Entropy-scaling search of massive biological data

Author: Berger Bonnie
Daniels Noah M.
Danko David Christian
Yu Y. William
Publication venue: 'Elsevier BV'
Publication date: 01/06/2015
Field of study

Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

arXiv.org e-Print Archive

Elsevier - Publisher Connector

DSpace@MIT

PubMed Central

Maximum Common Subgraph Isomorphism Algorithms

Author: Duesbury E.
Holliday J.D.
Willett P.
Publication venue
Publication date: 01/04/2017
Field of study

Maximum common subgraph (MCS) isomorphism algorithms play an important role in chemoinformatics by providing an effective mechanism for the alignment of pairs of chemical structures. This article discusses the various types of MCS that can be identified when two graphs are compared and reviews some of the algorithms that are available for this purpose, focusing on those that are, or may be, applicable to the matching of chemical graphs

White Rose Research Online

FunTree: a resource for exploring the functional evolution of structurally defined enzyme superfamilies

Author: A. L. Cuff
Bartlett
Bashton
Brown
C. A. Orengo
Furnham
G. L. Holliday
Glasner
I. Sillitoe
J. M. Thornton
Kanehisa
N. Furnham
Orengo
Porter
R. A. Laskowski
Rentzsch
S. A. Rahman
Schnoes
Shi
Todd
Valdar
Publication venue: Oxford University Press
Publication date: 17/10/2011
Field of study

FunTree is a new resource that brings together sequence, structure, phylogenetic, chemical and mechanistic information for structurally defined enzyme superfamilies. Gathering together this range of data into a single resource allows the investigation of how novel enzyme functions have evolved within a structurally defined superfamily as well as providing a means to analyse trends across many superfamilies. This is done not only within the context of an enzyme's sequence and structure but also the relationships of their reactions. Developed in tandem with the CATH database, it currently comprises 276 superfamilies covering ∼1800 (70%) of sequence assigned enzyme reactions. Central to the resource are phylogenetic trees generated from structurally informed multiple sequence alignments using both domain structural alignments supplemented with domain sequences and whole sequence alignments based on commonality of multi-domain architectures. These trees are decorated with functional annotations such as metabolite similarity as well as annotations from manually curated resources such the catalytic site atlas and MACiE for enzyme mechanisms. The resource is freely available through a web interface: www.ebi.ac.uk/thorton-srv/databases/FunTree

Crossref

LSHTM Research Online

PubMed Central

EC-BLAST: a tool to automatically search and compare enzyme reactions.

Author: A Dalby
A Theocharidis
AL Cuff
C Jochum
C Steinbeck
DARSD Latino
F Mu
Gemma L Holliday
I Ugi
J Lees
J-L Faulon
Janet M Thornton
K Tipton
L Chen
M Kanehisa
M Kotera
M Leber
Nicholas Furnham
NM O'Boyle
NM O'Boyle
Q-Y Zhang
RHS Thompson
RS Cahn
S Heller
SA Rahman
SA Rahman
Sergio Martinez Cuesta
Syed Asad Rahman
T Sing
V Egelhofer
V Prelog
WL Chen
Y Yamanishi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/01/2014
Field of study

We present EC-BLAST (http://www.ebi.ac.uk/thornton-srv/software/rbl/), an algorithm and Web tool for quantitative similarity searches between enzyme reactions at three levels: bond change, reaction center and reaction structure similarity. It uses bond changes and reaction patterns for all known biochemical reactions derived from atom-atom mapping across each reaction. EC-BLAST has the potential to improve enzyme classification, identify previously uncharacterized or new biochemical transformations, improve the assignment of enzyme function to sequences, and assist in enzyme engineering

CiteSeerX

Crossref

LSHTM Research Online

PubMed Central

A practical Java tool for small-molecule compound appraisal

Author: A Hofmann
A Hofmann
A Tiwari
Alain-Dominique Gorse
Andreas Hofmann
C Steinbeck
D Gorse
David Camp
E Bolton
J Baell
J Barker
JB Baell
Jonathan Baell
Lyndel Mason
Neil D Young
Parisa Amani
Paul Taylor
Robin B Gasser
S Asad Rahman
S Krause
Sarah Preston
T Kuhn
Todd Sneyd
Ulla-Maja Bailey
WA Warr
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The increased use of small-molecule compound screening by new users from a variety of different academic backgrounds calls for adequate software to administer, appraise, analyse and exchange information obtained from screening experiments. While software and spreadsheet solutions exist, there is a need for software that can be easily deployed and is convenient to use.The Java application cApp addresses this need and aids in the handling and storage of information on small-molecule compounds. The software is intended for the appraisal of compounds with respect to their physico-chemical properties, analysis in relation to adherence to likeness rules as well as recognition of pan-assay interference components and cross-linking with identical entries in the PubChem Compound Database. Results are displayed in a tabular form in a graphical interface, but can also be written in an HTML or PDF format. The output of data in ASCII format allows for further processing of data using other suitable programs. Other features include similarity searches against user-provided compound libraries and the PubChem Compound Database, as well as compound clustering based on a MaxMin algorithm.cApp is a personal database solution for small-molecule compounds which can handle all major chemical formats. Being a standalone software, it has no other dependency than the Java virtual machine and is thus conveniently deployed. It streamlines the analysis of molecules with respect to physico-chemical properties and drug discovery criteria; cApp is distributed under the GNU Affero General Public License version 3 and available from http://www.structuralchemistry.org/pcsb/. To download cApp, users will be asked for their name, institution and email address. A detailed manual can also be downloaded from this site, and online tutorials are available at http://www.structuralchemistry.org/pcsb/capp.php

Crossref

PubMed Central

Edinburgh Research Explorer

Monash University Research Portal

University of Melbourne Institutional Repository

University of Queensland eSpace

Computational biology in the 21st century

Author: Berger Leighton Bonnie
Daniels Noah
Yu Yun William
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/05/2018
Field of study

Computational biologists answer biological and biomedical questions by using computation in support of—or in place of—laboratory procedures, hoping to obtain more accurate answers at a greatly reduced cost. The past two decades have seen unprecedented technological progress with regard to generating biological data; next-generation sequencing, mass spectrometry, microarrays, cryo-electron microscopy, and other highthroughput approaches have led to an explosion of data. However, this explosion is a mixed blessing. On the one hand, the scale and scope of data should allow new insights into genetic and infectious diseases, cancer, basic biology, and even human migration patterns. On the other hand, researchers are generating datasets so massive that it has become difficult to analyze them to discover patterns that give clues to the underlying biological processes.National Institutes of Health. (U.S.) ( grant GM108348)Hertz Foundatio

DSpace@MIT

The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes.

Author: de Beer Tjaart AP
Furnham Nicholas
Holliday Gemma L
Jacobsen Julius OB
Pearson William R
Thornton Janet M
Publication venue: 'Oxford University Press (OUP)'
Publication date: 05/12/2013
Field of study

Understanding which are the catalytic residues in an enzyme and what function they perform is crucial to many biology studies, particularly those leading to new therapeutics and enzyme design. The original version of the Catalytic Site Atlas (CSA) (http://www.ebi.ac.uk/thornton-srv/databases/CSA) published in 2004, which catalogs the residues involved in enzyme catalysis in experimentally determined protein structures, had only 177 curated entries and employed a simplistic approach to expanding these annotations to homologous enzyme structures. Here we present a new version of the CSA (CSA 2.0), which greatly expands the number of both curated (968) and automatically annotated catalytic sites in enzyme structures, utilizing a new method for annotation transfer. The curated entries are used, along with the variation in residue type from the sequence comparison, to generate 3D templates of the catalytic sites, which in turn can be used to find catalytic sites in new structures. To ease the transfer of CSA annotations to other resources a new ontology has been developed: the Enzyme Mechanism Ontology, which has permitted the transfer of annotations to Mechanism, Annotation and Classification in Enzymes (MACiE) and UniProt Knowledge Base (UniProtKB) resources. The CSA database schema has been re-designed and both the CSA data and search capabilities are presented in a new modern web interface

LSHTM Research Online

PubMed Central

A Realistic Model under which the Genetic Code is Optimal

Author: Buhrman Harry
Klau Gunnar W.
Schaffner Christian
Speijer Dave
Stougie Leen
van der Gulik Peter T. S.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The genetic code has a high level of error robustness. Using values of hydrophobicity scales as a proxy for amino acid character, and the Mean Square measure as a function quantifying error robustness, a value can be obtained for a genetic code which reflects the error robustness of that code. By comparing this value with a distribution of values belonging to codes generated by random permutations of amino acid assignments, the level of error robustness of a genetic code can be quantified. We present a calculation in which the standard genetic code is shown to be optimal. We obtain this result by (1) using recently updated values of polar requirement as input; (2) fixing seven assignments (Ile, Trp, His, Phe, Tyr, Arg, and Leu) based on aptamer considerations; and (3) using known biosynthetic relations of the 20 amino acids. This last point is reflected in an approach of subdivision (restricting the random reallocation of assignments to amino acid subgroups, the set of 20 being divided in four such subgroups). The three approaches to explain robustness of the code (specific selection for robustness, amino acid-RNA interactions leading to assignments, or a slow growth process of assignment patterns) are reexamined in light of our findings. We offer a comprehensive hypothesis, stressing the importance of biosynthetic relations, with the code evolving from an early stage with just glycine and alanine, via intermediate stages, towards 64 codons carrying todays meaning.Comment: 22 pages, 3 figures, 4 tables Journal of Molecular Evolution, July 201

arXiv.org e-Print Archive

CiteSeerX

VU Research Portal

CWI's Institutional Repository

International Migration, Integration and Social Cohesion online publications

The utility of geometrical and chemical restraint information extracted from predicted ligand-binding sites in protein structure refinement

Author: Brylinski Michal
Lee Seung Yup
Skolnick Jeffrey
Zhou Hongyi
Publication venue: LSU Digital Commons
Publication date: 01/03/2011
Field of study

Exhaustive exploration of molecular interactions at the level of complete proteomes requires efficient and reliable computational approaches to protein function inference. Ligand docking and ranking techniques show considerable promise in their ability to quantify the interactions between proteins and small molecules. Despite the advances in the development of docking approaches and scoring functions, the genome-wide application of many ligand docking/screening algorithms is limited by the quality of the binding sites in theoretical receptor models constructed by protein structure prediction. In this study, we describe a new template-based method for the local refinement of ligand-binding regions in protein models using remotely related templates identified by threading. We designed a Support Vector Regression (SVR) model that selects correct binding site geometries in a large ensemble of multiple receptor conformations. The SVR model employs several scoring functions that impose geometrical restraints on the Cα positions, account for the specific chemical environment within a binding site and optimize the interactions with putative ligands. The SVR score is well correlated with the RMSD from the native structure; in 47% (70%) of the cases, the Pearson\u27s correlation coefficient is \u3e0.5 (\u3e0.3). When applied to weakly homologous models, the average heavy atom, local RMSD from the native structure of the top-ranked (best of top five) binding site geometries is 3.1. Å (2.9. Å) for roughly half of the targets; this represents a 0.1 (0.3). Å average improvement over the original predicted structure. Focusing on the subset of strongly conserved residues, the average heavy atom RMSD is 2.6. Å (2.3. Å). Furthermore, we estimate the upper bound of template-based binding site refinement using only weakly related proteins to be ∼2.6. Å RMSD. This value also corresponds to the plasticity of the ligand-binding regions in distant homologues. The Binding Site Refinement (BSR) approach is available to the scientific community as a web server that can be accessed at http://cssb.biology.gatech.edu/bsr/. © 2010 Elsevier Inc

PubMed Central

Louisiana State University

Finding Characteristic Substructures for Metabolite Classes

Author: Elshamy Samy
Hufsky Franziska
Ludwig Marcus
Publication venue: OASIcs - OpenAccess Series in Informatics. German Conference on Bioinformatics 2012
Publication date: 01/01/2012
Field of study

We introduce a method for finding a characteristic substructure for a set of molecular structures. Different from common approaches, such as computing the maximum common subgraph, the resulting substructure does not have to be contained in its exact form in all input molecules. Our approach is part of the identification pipeline for unknown metabolites using fragmentation trees. Searching databases using fragmentation tree alignment results in hit lists containing compounds with large structural similarity to the unknown metabolite. The characteristic substructure of the molecules in the hit list may be a key structural element of the unknown compound and might be used as starting point for structure elucidation. We evaluate our method on different data sets and find that it retrieves essential substructures if the input lists are not too heterogeneous. We apply our method to predict structural elements for five unknown samples from Icelandic poppy

Dagstuhl Research Online Publication Server