Search CORE

Aston Publications Explorer

Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set

Author: BFJ Manly
CI Castillo-Davis
David Johnson
DB Searls
DB Searls
DD Womble
E Badidi
F Antequera
J Krueger
J Theilhaber
JD Wren
JD Wren
JF Costello
JM Claverie
Jonathan D Wren
JR Quinlan
K Davies
K Nakai
L Stein
Le Gruenwald
LV Zhang
M Ashburner
M Gardiner-Garden
M Safran
P Clark
RS Michalski
S Foissac
S Muggleton
SP Shah
TV Venkatesh
V Bajic
W Frawley
WM Shui
WM Shui
Y Liu
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

There is an enormous amount of information encoded in each genome – enough to create living, responsive and adaptive organisms. Raw sequence data alone is not enough to understand function, mechanisms or interactions. Changes in a single base pair can lead to disease, such as sickle-cell anemia, while some large megabase deletions have no apparent phenotypic effect. Genomic features are varied in their data types and annotation of these features is spread across multiple databases. Herein, we develop a method to automate exploration of genomes by iteratively exploring sequence data for correlations and building upon them. First, to integrate and compare different annotation sources, a sequence matrix (SM) is developed to contain position-dependant information. Second, a classification tree is developed for matrix row types, specifying how each data type is to be treated with respect to other data types for analysis purposes. Third, correlative analyses are developed to analyze features of each matrix row in terms of the other rows, guided by the classification tree as to which analyses are appropriate. A prototype was developed and successful in detecting coinciding genomic features among genes, exons, repetitive elements and CpG islands

Springer - Publisher Connector

Stringency of the 2-His–1-Asp Active-Site Motif in Prolyl 4-Hydroxylase

Author: A Lamberg
AD Winter
C Wong
DP Galonic
EA Kersteen
Floyd Romesberg
GD Straganz
IJ Clifton
J Myllyharju
J Myllyharju
K McGinnis
K Vuori
K Vuori
KD Koehntop
Kelly L. Gorres
Khian Hong Pua
KL Gorres
KL Gorres
L Friedman
L Tuderman
LC Blasiak
LM Hoffart
M Costas
MD Shoulders
MK Koski
PF Fitzpatrick
PK Grzyska
Ronald T. Raines
S Leitgeb
T Holster
T Searls
Y Qiao
Publication venue: Public Library of Science
Publication date: 01/11/2009
Field of study

The non-heme iron(II) dioxygenase family of enzymes contain a common 2-His–1-carboxylate iron-binding motif. These enzymes catalyze a wide variety of oxidative reactions, such as the hydroxylation of aliphatic C–H bonds. Prolyl 4-hydroxylase (P4H) is an α-ketoglutarate-dependent iron(II) dioxygenase that catalyzes the post-translational hydroxylation of proline residues in protocollagen strands, stabilizing the ensuing triple helix. Human P4H residues His412, Asp414, and His483 have been identified as an iron-coordinating 2-His–1-carboxylate motif. Enzymes that catalyze oxidative halogenation do so by a mechanism similar to that of P4H. These halogenases retain the active-site histidine residues, but the carboxylate ligand is replaced with a halide ion. We replaced Asp414 of P4H with alanine (to mimic the active site of a halogenase) and with glycine. These substitutions do not, however, convert P4H into a halogenase. Moreover, the hydroxylase activity of D414A P4H cannot be rescued with small molecules. In addition, rearranging the two His and one Asp residues in the active site eliminates hydroxylase activity. Our results demonstrate a high stringency for the iron-binding residues in the P4H active site. We conclude that P4H, which catalyzes an especially demanding chemical transformation, is recalcitrant to change

Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Author: B Langmead
Christian Otto
Cynthia M. Sharma
David B. Searls
G Myers
H Li
H Li
H Lin
JC Dohm
JM Rothberg
Jörg Hackermüller
Jörg Vogel
K Prüfer
M Crochemore
MI Abouelhoda
P Ferragina
Peter F. Stadler
Philipp Khaitovich
R Li
S Bennett
S Huse
S Karlin
SM Rumble
Stefan Kurtz
Steve Hoffmann
W Chang
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/

Fraunhofer-ePrints

Modeling Structure-Function Relationships in Synthetic DNA Sequences using Attribute Grammars

Recognizing that certain biological functions can be associated with specific DNA sequences has led various fields of biology to adopt the notion of the genetic part. This concept provides a finer level of granularity than the traditional notion of the gene. However, a method of formally relating how a set of parts relates to a function has not yet emerged. Synthetic biology both demands such a formalism and provides an ideal setting for testing hypotheses about relationships between DNA sequences and phenotypes beyond the gene-centric methods used in genetics. Attribute grammars are used in computer science to translate the text of a program source code into the computational operations it represents. By associating attributes with parts, modifying the value of these attributes using rules that describe the structure of DNA sequences, and using a multi-pass compilation process, it is possible to translate DNA sequences into molecular interaction network models. These capabilities are illustrated by simple example grammars expressing how gene expression rates are dependent upon single or multiple parts. The translation process is validated by systematically generating, translating, and simulating the phenotype of all the sequences in the design space generated by a small library of genetic parts. Attribute grammars represent a flexible framework connecting parts with models of biological function. They will be instrumental for building mathematical models of libraries of genetic constructs synthesized to characterize the function of genetic parts. This formalism is also expected to provide a solid foundation for the development of computer assisted design applications for synthetic biology

Edinburgh Research Explorer

The University of Manchester - Institutional Repository

Inroads to Predict in Vivo Toxicology—An Introduction to the eTOX Project

Author: Andreas Sutter
Bender
Benz
Bhhatarai
Boelsterli
Bologa
Button
Car
Christof H. Schwab
Cruciani
David J. Heard
David K. Watson
Ellison
Federsel
Ferran Sanz
François Pognan
Funk
Garcia-Serna
Giri
Greene
Gubbels-van Hal
Halliwell
Hancox
Hardy
Jaeschke
Judson
Jörg D. Wichard
Katharine Briggs
Kodavanti
Krallinger
Kramer
Kruhlak
Low
Manuel Pastor
Marchant
Mekenyan
Montserrat Cases
Morelli
Mudd
Naven
Obiol-Pardo
Pitluk
Richard
Russell
Schneider
Searls
Szakács
Thomas Steger-Hartmann
Zhu
Zollnera
Publication venue: Molecular Diversity Preservation International (MDPI)
Publication date: 01/01/2012
Field of study

There is a widespread awareness that the wealth of preclinical toxicity data that the pharmaceutical industry has generated in recent decades is not exploited as efficiently as it could be. Enhanced data availability for compound comparison (“read-across”), or for data mining to build predictive tools, should lead to a more efficient drug development process and contribute to the reduction of animal use (3Rs principle). In order to achieve these goals, a consortium approach, grouping numbers of relevant partners, is required. The eTOX (“electronic toxicity”) consortium represents such a project and is a public-private partnership within the framework of the European Innovative Medicines Initiative (IMI). The project aims at the development of in silico prediction systems for organ and in vivo toxicity. The backbone of the project will be a database consisting of preclinical toxicity data for drug compounds or candidates extracted from previously unpublished, legacy reports from thirteen European and European operation-based pharmaceutical companies. The database will be enhanced by incorporation of publically available, high quality toxicology data. Seven academic institutes and five small-to-medium size enterprises (SMEs) contribute with their expertise in data gathering, database curation, data mining, chemoinformatics and predictive systems development. The outcome of the project will be a predictive system contributing to early potential hazard identification and risk assessment during the drug development process. The concept and strategy of the eTOX project is described here, together with current achievements and future deliverables

Multidisciplinary Digital Publishing Institute

CiteSeerX

UPF Digital Repository

Systematic Planning of Genome-Scale Experiments in Poorly Studied Species

Author: Amy Caudy
AP Gasch
B Efron
C Chitikila
C Huttenhower
C Huttenhower
C Shaffer
CA Ball
CL Myers
CL Myers
David B. Searls
DC Hess
G Yvert
H Parkinson
I Lee
J Ihmels
JC Rutherford
K Morik
K Xia
L Pena-Castillo
M Kellis
MA Hibbs
Maitreya Dunham
O Troyanskaya
Olga Troyanskaya
PM Fernandes
PT Spellman
R Edgar
RA Fisher
RB Brem
RB Brem
RD King
RJ Marinelli
S Bandyopadhyay
S Bergmann
S Le Crom
SL Tai
T Joachims
TR Hughes
VM Boer
VR Iyer
WJ Fu
Y Guan
Y Guan
Yuanfang Guan
Publication venue: Public Library of Science
Publication date: 01/03/2010
Field of study

Genome-scale datasets have been used extensively in model organisms to screen for specific candidates or to predict functions for uncharacterized genes. However, despite the availability of extensive knowledge in model organisms, the planning of genome-scale experiments in poorly studied species is still based on the intuition of experts or heuristic trials. We propose that computational and systematic approaches can be applied to drive the experiment planning process in poorly studied species based on available data and knowledge in closely related model organisms. In this paper, we suggest a computational strategy for recommending genome-scale experiments based on their capability to interrogate diverse biological processes to enable protein function assignment. To this end, we use the data-rich functional genomics compendium of the model organism to quantify the accuracy of each dataset in predicting each specific biological process and the overlap in such coverage between different datasets. Our approach uses an optimized combination of these quantifications to recommend an ordered list of experiments for accurately annotating most proteins in the poorly studied related organisms to most biological processes, as well as a set of experiments that target each specific biological process. The effectiveness of this experiment- planning system is demonstrated for two related yeast species: the model organism Saccharomyces cerevisiae and the comparatively poorly studied Saccharomyces bayanus. Our system recommended a set of S. bayanus experiments based on an S. cerevisiae microarray data compendium. In silico evaluations estimate that less than 10% of the experiments could achieve similar functional coverage to the whole microarray compendium. This estimation was confirmed by performing the recommended experiments in S. bayanus, therefore significantly reducing the labor devoted to characterize the poorly studied genome. This experiment-planning framework could readily be adapted to the design of other types of large-scale experiments as well as other groups of organisms

Context-driven discovery of gene cassettes in mobile integrons using a computational grammar

Author: A Moura
ACE Darling
AL Delcher
AL Delcher
CJ van Rijsbergen
D Frishman
DA Rowe-Magnus
DB Searls
E Rivas
Enrico Coiera
F Baquero
F Meyer
F Meyer
Guy Tsafnat
H Quesneville
HW Stokes
HW Stokes
IT Paulsen
J Fleiss
J Landis
Jaron Schaeffer
Jon R Iredell
K Rutherford
L Stein
M Ashburner
M Kanehisa
MA Andrade
MJ Joss
R Overbeek
RM Hall
RS Levings
S Ji
S Leung
Sally R Partridge
SF Altschul
SR Partridge
U Bohnebeck
WR Pearson
Y Boucher
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Gene discovery algorithms typically examine sequence data for low level patterns. A novel method to computationally discover higher order DNA structures is presented, using a context sensitive grammar. The algorithm was applied to the discovery of gene cassettes associated with integrons. The discovery and annotation of antibiotic resistance genes in such cassettes is essential for effective monitoring of antibiotic resistance patterns and formulation of public health antibiotic prescription policies. Results We discovered two new putative gene cassettes using the method, from 276 integron features and 978 GenBank sequences. The system achieved <it>κ </it>= 0.972 annotation agreement with an expert gold standard of 300 sequences. In rediscovery experiments, we deleted 789,196 cassette instances over 2030 experiments and correctly relabelled 85.6% (<it>α </it>≥ 95%, <it>E </it>≤ 1%, mean sensitivity = 0.86, specificity = 1, F-score = 0.93), with no false positives. Error analysis demonstrated that for 72,338 missed deletions, two adjacent deleted cassettes were labeled as a single cassette, increasing performance to 94.8% (mean sensitivity = 0.92, specificity = 1, F-score = 0.96). Conclusion Using grammars we were able to represent heuristic background knowledge about large and complex structures in DNA. Importantly, we were also able to use the context embedded in the model to discover new putative antibiotic resistance gene cassettes. The method is complementary to existing automatic annotation systems which operate at the sequence level.</p

Springer - Publisher Connector

Macquarie University ResearchOnline

Directed acyclic graph kernels for structural RNA analysis

Author: B Knudsen
B Schölkopf
CB Do
D Haussler
D Sankoff
DB Searls
DM Tax
E Rivas
EK Freyhult
H Kiryu
H Saigo
I Holmes
IL Hofacker
IL Hofacker
J Hertel
J Hertel
JD Thompson
JS McCaskill
JS Pedersen
JW Brown
K Sato
Kengo Sato
Kiyoshi Asai
MA Rosenblad
P Pacheco
RD Dowell
RE Fan
RJ Klein
S Washietl
S Washietl
S Will
SR Eddy
SR Eddy
SR Eddy
T Babak
T Kin
Toutai Mituyama
W Deng
Y Sakakibara
Y Sakakibara
Y Sakakibara
Yasubumi Sakakibara
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs) have been reported by numerous researchers. In order to analyze ncRNAs by kernel methods including support vector machines, we propose stem kernels as an extension of string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures. However, applying stem kernels directly to large data sets of ncRNAs is impractical due to their computational complexity. Results We have developed a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences that significantly increases the computation speed of stem kernels. Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences. Our kernels outperformed the existing methods with respect to the detection of known ncRNAs and kernel hierarchical clustering. Conclusion Stem kernels can be utilized as a reliable similarity measure of structural RNAs, and can be used in various kernel-based applications.</p

Springer - Publisher Connector