Search CORE

2,339 research outputs found

Integration of Biological Sources: Exploring the Case of Protein Homology

Author: Boerman Tjeerd W.
Keulen Maurice van
Severing Edouard I.
Vet Paul van der
Publication venue: University of Twente, Centre for Telematics and Information Technology
Publication date: 01/01/2011
Field of study

Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heteroge- neous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioin- formatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Un- certain databases are able to contain several possi- ble worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration

University of Twente Research Information

Big tranSMART for clinical decision making

Author: Wang Shicai
Publication venue: Computing, Imperial College London
Publication date: 01/05/2016
Field of study

Molecular profiling data based patient stratification plays a key role in clinical decision making, such as identification of disease subgroups and prediction of treatment responses of individual subjects. Many existing knowledge management systems like tranSMART enable scientists to do such analysis. But in the big data era, molecular profiling data size increases sharply due to new biological techniques, such as next generation sequencing. None of the existing storage systems work well while considering the three ”V” features of big data (Volume, Variety, and Velocity). New Key Value data stores like Apache HBase and Google Bigtable can provide high speed queries by the Key. These databases can be modeled as Distributed Ordered Table (DOT), which horizontally partitions a table into regions and distributes regions to region servers by the Key. However, none of existing data models work well for DOT. A Collaborative Genomic Data Model (CGDM) has been designed to solve all these is- sues. CGDM creates three Collaborative Global Clustering Index Tables to improve the data query velocity. Microarray implementation of CGDM on HBase performed up to 246, 7 and 20 times faster than the relational data model on HBase, MySQL Cluster and MongoDB. Single nucleotide polymorphism implementation of CGDM on HBase outperformed the relational model on HBase and MySQL Cluster by up to 351 and 9 times. Raw sequence implementation of CGDM on HBase gains up to 440-fold and 22-fold speedup, compared to the sequence alignment map format implemented in HBase and a binary alignment map server. The integration into tranSMART shows up to 7-fold speedup in the data export function. In addition, a popular hierarchical clustering algorithm in tranSMART has been used as an application to indicate how CGDM can influence the velocity of the algorithm. The optimized method using CGDM performs more than 7 times faster than the same method using the relational model implemented in MySQL Cluster.Open Acces

Spiral - Imperial College Digital Repository

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central

Data Mining: The Next Generation

Author: Agrawal Rakesh
Bollinger Toni
Clifton Christopher W.
Dzeroski Saso
Freytag Johann-Christoph
Hipp Jochen
Keim Daniel
Kramer Stefan
Kriegel Hans-Peter
Leser Ulf
Liu Bing
Mannila Heikki
Meo Rosa
Morishita Shinichi
Ng Raymond
Pei Jian
Raghavan Prabhakar
Ramakrishnan Raghu
Spiliopoulou Myra
Srivastava Jaideep
Torra Vicenc
Publication venue: Dagstuhl Seminar Proceedings. 04292 - Perspectives Workshop: Data Mining: The Next Generation
Publication date: 01/01/2005
Field of study

Dagstuhl Research Online Publication Server

Genetic and Computational Identification of a Conserved Bacterial Metabolic Module

Author: A Davidson
A Hottes
A Majumder
A Marchler-Bauer
Andrew T. Martens
Antal F. Novak
B Ely
Balaji S. Srinivasan
BL Turner
BS Srinivasan
Cara C. Boutte
E Krings
EE Fetsch
EJ Mullaney
F Pazos
G Jiang
GB Kiss
GE Crooks
H Barbier-Brygoo
H Kawsar
J Flannick
J Fry
Jason A. Flannick
JW Gober
K Yoshida
K-I Yoshida
K-I Yoshida
L Breiman
M Evinger
M Galbraith
M Kanehisa
M Kanehisa
M Pellegrini
M Thanbichler
MF Roberts
Michael T. Laub
MJ Yebra
N Pobigaylo
P Poole
Patrick H. Viollier
PH Viollier
R Ramaley
S Rossbach
Sean Crosson
Serafim Batzoglou
SV Albers
T Bailey
T Bayes
T Berman
T Berman
TM Finan
VG Tusher
W Anderson
W Anderson
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

We have experimentally and computationally defined a set of genes that form a conserved metabolic module in the α-proteobacterium Caulobacter crescentus and used this module to illustrate a schema for the propagation of pathway-level annotation across bacterial genera. Applying comprehensive forward and reverse genetic methods and genome-wide transcriptional analysis, we (1) confirmed the presence of genes involved in catabolism of the abundant environmental sugar myo-inositol, (2) defined an operon encoding an ABC-family myo-inositol transmembrane transporter, and (3) identified a novel myo-inositol regulator protein and cis-acting regulatory motif that control expression of genes in this metabolic module. Despite being encoded from non-contiguous loci on the C. crescentus chromosome, these myo-inositol catabolic enzymes and transporter proteins form a tightly linked functional group in a computationally inferred network of protein associations. Primary sequence comparison was not sufficient to confidently extend annotation of all components of this novel metabolic module to related bacterial genera. Consequently, we implemented the Graemlin multiple-network alignment algorithm to generate cross-species predictions of genes involved in myo-inositol transport and catabolism in other α-proteobacteria. Although the chromosomal organization of genes in this functional module varied between species, the upstream regions of genes in this aligned network were enriched for the same palindromic cis-regulatory motif identified experimentally in C. crescentus. Transposon disruption of the operon encoding the computationally predicted ABC myo-inositol transporter of Sinorhizobium meliloti abolished growth on myo-inositol as the sole carbon source, confirming our cross-genera functional prediction. Thus, we have defined regulatory, transport, and catabolic genes and a cis-acting regulatory sequence that form a conserved module required for myo-inositol metabolism in select α-proteobacteria. Moreover, this study describes a forward validation of gene-network alignment, and illustrates a strategy for reliably transferring pathway-level annotation across bacterial species

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Archive ouverte UNIGE

Recommended from our members

Genetic and Computational Identification of a Conserved Bacterial Metabolic Module

Author: Batzoglou Serafim
Boutte Cara C.
Crosson Sean
Flannick Jason A.
Martens Andrew T.
Novak Antal F.
Srinivasan Balaji S.
Viollier Patrick H.
Publication venue
Publication date: 03/01/2024
Field of study

Knowledge UChicago

A Molecular Biology Database Digest

Author: Bry François
Kröger Peer
Publication venue
Publication date: 01/01/2000
Field of study

Computational Biology or Bioinformatics has been defined as the application of mathematical and Computer Science methods to solving problems in Molecular Biology that require large scale data, computation, and analysis [18]. As expected, Molecular Biology databases play an essential role in Computational Biology research and development. This paper introduces into current Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the integration of Molecular Biology data from different sources. This paper is primarily intended for an audience of computer scientists with a limited background in Biology

CiteSeerX

Open Access LMU

Automated paleontology of repetitive DNA with REANNOTATE

Author: Pereira Vini
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Dispersed repeats are a major component of eukaryotic genomes and drivers of genome evolution. Annotation of DNA sequences homologous to known repetitive elements has been mainly performed with the program R<smcaps>EPEAT</smcaps>M<smcaps>ASKER</smcaps>. Sequences annotated by R<smcaps>EPEAT</smcaps>M<smcaps>ASKER</smcaps> often correspond to fragments of repetitive elements resulting from the insertion of younger elements or other rearrangements. Although R<smcaps>EPEAT</smcaps>M<smcaps>ASKER</smcaps> annotation is indispensable for studying genome biology, this annotation does not contain much information on the common origin of fossil fragments that share an insertion event, especially where clusters of nested insertions of repetitive elements have occurred. Results Here I present RE<smcaps>ANNOTATE</smcaps>, a computational tool to process R<smcaps>EPEAT</smcaps>M<smcaps>ASKER</smcaps> annotation for automated i) defragmentation of dispersed repetitive elements, ii) resolution of the temporal order of insertions in clusters of nested elements, and iii) estimating the age of the elements, if they have long terminal repeats. I have re-annotated the repetitive content of human chromosomes, providing evidence for a recent expansion of satellite repeats on the Y chromosome and, from the retroviral age distribution, for a higher rate of evolution on the Y relative to autosomes. Conclusion RE<smcaps>ANNOTATE</smcaps> is ready to process existing annotation for automated evolutionary analysis of all types of complex repeats in any genome. The tool is freely available under the GPL at <url>http://www.bioinformatics.org/reannotate</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central