17,174 research outputs found

    FITBAR: a web tool for the robust prediction of prokaryotic regulons

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The binding of regulatory proteins to their specific DNA targets determines the accurate expression of the neighboring genes. The <it>in silico </it>prediction of new binding sites in completely sequenced genomes is a key aspect in the deeper understanding of gene regulatory networks. Several algorithms have been described to discriminate against false-positives in the prediction of new binding targets; however none of them has been implemented so far to assist the detection of binding sites at the genomic scale.</p> <p>Results</p> <p>FITBAR (Fast Investigation Tool for Bacterial and Archaeal Regulons) is a web service designed to identify new protein binding sites on fully sequenced prokaryotic genomes. This tool consists in a workbench where the significance of the predictions can be compared using different statistical methods, a feature not found in existing resources. The Local Markov Model and the Compound Importance Sampling algorithms have been implemented to compute the P-value of newly discovered binding sites. In addition, FITBAR provides two optimized genomic scanning algorithms using either log-odds or entropy-weighted position-specific scoring matrices. Other significant features include the production of a detailed genomic context map for each detected binding site and the export of the search results in spreadsheet and portable document formats. FITBAR discovery of a high affinity <it>Escherichia coli </it>NagC binding site was validated experimentally <it>in vitro </it>as well as <it>in vivo </it>and published.</p> <p>Conclusions</p> <p>FITBAR was developed in order to allow fast, accurate and statistically robust predictions of prokaryotic regulons. This feature constitutes the main advantage of this web tool over other matrix search programs and does not impair its performance. The web service is available at <url>http://archaea.u-psud.fr/fitbar</url>.</p

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Simultaneous identification of specifically interacting paralogs and inter-protein contacts by Direct-Coupling Analysis

    Full text link
    Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue-residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has in turn been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being co-localized in operons. Here we show that the Direct-Coupling Analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify inter-protein residue-residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.Comment: Main Text 19 pages Supp. Inf. 16 page

    A Seeded Genetic Algorithm for RNA Secondary Structural Prediction with Pseudoknots

    Get PDF
    This work explores a new approach in using genetic algorithm to predict RNA secondary structures with pseudoknots. Since only a small portion of most RNA structures is comprised of pseudoknots, the majority of structural elements from an optimal pseudoknot-free structure are likely to be part of the true structure. Thus seeding the genetic algorithm with optimal pseudoknot-free structures will more likely lead it to the true structure than a randomly generated population. The genetic algorithm uses the known energy models with an additional augmentation to allow complex pseudoknots. The nearest-neighbor energy model is used in conjunction with Turner’s thermodynamic parameters for pseudoknot-free structures, and the H-type pseudoknot energy estimation for simple pseudoknots. Testing with known pseudoknot sequences from PseudoBase shows that it out performs some of the current popular algorithms

    SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors

    Full text link
    The maximal sensitivity of the Smith-Waterman (SW) algorithm has enabled its wide use in biological sequence database search. Unfortunately, the high sensitivity comes at the expense of quadratic time complexity, which makes the algorithm computationally demanding for big databases. In this paper, we present SWAPHI, the first parallelized algorithm employing Xeon Phi coprocessors to accelerate SW protein database search. SWAPHI is designed based on the scale-and-vectorize approach, i.e. it boosts alignment speed by effectively utilizing both the coarse-grained parallelism from the many co-processing cores (scale) and the fine-grained parallelism from the 512-bit wide single instruction, multiple data (SIMD) vectors within each core (vectorize). By searching against the large UniProtKB/TrEMBL protein database, SWAPHI achieves a performance of up to 58.8 billion cell updates per second (GCUPS) on one coprocessor and up to 228.4 GCUPS on four coprocessors. Furthermore, it demonstrates good parallel scalability on varying number of coprocessors, and is also superior to both SWIPE on 16 high-end CPU cores and BLAST+ on 8 cores when using four coprocessors, with the maximum speedup of 1.52 and 1.86, respectively. SWAPHI is written in C++ language (with a set of SIMD intrinsics), and is freely available at http://swaphi.sourceforge.net.Comment: A short version of this paper has been accepted by the IEEE ASAP 2014 conferenc

    Spectral Sequence Motif Discovery

    Full text link
    Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, motif finding algorithms of increasingly high performance are required to process the big datasets produced by new high-throughput sequencing technologies. Most existing algorithms are computationally demanding and often cannot support the large size of new experimental data. We present a new motif discovery algorithm that is built on a recent machine learning technique, referred to as Method of Moments. Based on spectral decompositions, this method is robust under model misspecification and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. In a few minutes, we can process datasets of hundreds of thousand sequences and extract motif profiles that match those computed by various state-of-the-art algorithms.Comment: 20 pages, 3 figures, 1 tabl
    corecore