2,339 research outputs found

    Integration of Biological Sources: Exploring the Case of Protein Homology

    Get PDF
    Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heteroge- neous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioin- formatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Un- certain databases are able to contain several possi- ble worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration

    Big tranSMART for clinical decision making

    Get PDF
    Molecular profiling data based patient stratification plays a key role in clinical decision making, such as identification of disease subgroups and prediction of treatment responses of individual subjects. Many existing knowledge management systems like tranSMART enable scientists to do such analysis. But in the big data era, molecular profiling data size increases sharply due to new biological techniques, such as next generation sequencing. None of the existing storage systems work well while considering the three ”V” features of big data (Volume, Variety, and Velocity). New Key Value data stores like Apache HBase and Google Bigtable can provide high speed queries by the Key. These databases can be modeled as Distributed Ordered Table (DOT), which horizontally partitions a table into regions and distributes regions to region servers by the Key. However, none of existing data models work well for DOT. A Collaborative Genomic Data Model (CGDM) has been designed to solve all these is- sues. CGDM creates three Collaborative Global Clustering Index Tables to improve the data query velocity. Microarray implementation of CGDM on HBase performed up to 246, 7 and 20 times faster than the relational data model on HBase, MySQL Cluster and MongoDB. Single nucleotide polymorphism implementation of CGDM on HBase outperformed the relational model on HBase and MySQL Cluster by up to 351 and 9 times. Raw sequence implementation of CGDM on HBase gains up to 440-fold and 22-fold speedup, compared to the sequence alignment map format implemented in HBase and a binary alignment map server. The integration into tranSMART shows up to 7-fold speedup in the data export function. In addition, a popular hierarchical clustering algorithm in tranSMART has been used as an application to indicate how CGDM can influence the velocity of the algorithm. The optimized method using CGDM performs more than 7 times faster than the same method using the relational model implemented in MySQL Cluster.Open Acces

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Genetic and Computational Identification of a Conserved Bacterial Metabolic Module

    Get PDF
    We have experimentally and computationally defined a set of genes that form a conserved metabolic module in the α-proteobacterium Caulobacter crescentus and used this module to illustrate a schema for the propagation of pathway-level annotation across bacterial genera. Applying comprehensive forward and reverse genetic methods and genome-wide transcriptional analysis, we (1) confirmed the presence of genes involved in catabolism of the abundant environmental sugar myo-inositol, (2) defined an operon encoding an ABC-family myo-inositol transmembrane transporter, and (3) identified a novel myo-inositol regulator protein and cis-acting regulatory motif that control expression of genes in this metabolic module. Despite being encoded from non-contiguous loci on the C. crescentus chromosome, these myo-inositol catabolic enzymes and transporter proteins form a tightly linked functional group in a computationally inferred network of protein associations. Primary sequence comparison was not sufficient to confidently extend annotation of all components of this novel metabolic module to related bacterial genera. Consequently, we implemented the Graemlin multiple-network alignment algorithm to generate cross-species predictions of genes involved in myo-inositol transport and catabolism in other α-proteobacteria. Although the chromosomal organization of genes in this functional module varied between species, the upstream regions of genes in this aligned network were enriched for the same palindromic cis-regulatory motif identified experimentally in C. crescentus. Transposon disruption of the operon encoding the computationally predicted ABC myo-inositol transporter of Sinorhizobium meliloti abolished growth on myo-inositol as the sole carbon source, confirming our cross-genera functional prediction. Thus, we have defined regulatory, transport, and catabolic genes and a cis-acting regulatory sequence that form a conserved module required for myo-inositol metabolism in select α-proteobacteria. Moreover, this study describes a forward validation of gene-network alignment, and illustrates a strategy for reliably transferring pathway-level annotation across bacterial species

    A Molecular Biology Database Digest

    Get PDF
    Computational Biology or Bioinformatics has been defined as the application of mathematical and Computer Science methods to solving problems in Molecular Biology that require large scale data, computation, and analysis [18]. As expected, Molecular Biology databases play an essential role in Computational Biology research and development. This paper introduces into current Molecular Biology databases, stressing data modeling, data acquisition, data retrieval, and the integration of Molecular Biology data from different sources. This paper is primarily intended for an audience of computer scientists with a limited background in Biology

    Automated paleontology of repetitive DNA with REANNOTATE

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Dispersed repeats are a major component of eukaryotic genomes and drivers of genome evolution. Annotation of DNA sequences homologous to known repetitive elements has been mainly performed with the program R<smcaps>EPEAT</smcaps>M<smcaps>ASKER</smcaps>. Sequences annotated by R<smcaps>EPEAT</smcaps>M<smcaps>ASKER</smcaps> often correspond to fragments of repetitive elements resulting from the insertion of younger elements or other rearrangements. Although R<smcaps>EPEAT</smcaps>M<smcaps>ASKER</smcaps> annotation is indispensable for studying genome biology, this annotation does not contain much information on the common origin of fossil fragments that share an insertion event, especially where clusters of nested insertions of repetitive elements have occurred.</p> <p>Results</p> <p>Here I present RE<smcaps>ANNOTATE</smcaps>, a computational tool to process R<smcaps>EPEAT</smcaps>M<smcaps>ASKER</smcaps> annotation for automated i) defragmentation of dispersed repetitive elements, ii) resolution of the temporal order of insertions in clusters of nested elements, and iii) estimating the age of the elements, if they have long terminal repeats. I have re-annotated the repetitive content of human chromosomes, providing evidence for a recent expansion of satellite repeats on the Y chromosome and, from the retroviral age distribution, for a higher rate of evolution on the Y relative to autosomes.</p> <p>Conclusion</p> <p>RE<smcaps>ANNOTATE</smcaps> is ready to process existing annotation for automated evolutionary analysis of all types of complex repeats in any genome. The tool is freely available under the GPL at <url>http://www.bioinformatics.org/reannotate</url>.</p
    corecore