7,901 research outputs found

    Efficient seeding techniques for protein similarity search

    Get PDF
    We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix

    Efficient seeding techniques for protein similarity search

    Get PDF
    We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

    A unifying framework for seed sensitivity and its application to subset seeds

    Get PDF
    We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds

    ViCTree: an automated framework for taxonomic classification from protein sequences

    Get PDF
    Motivation: The increasing rate of submission of genetic sequences into public databases is providing a growing resource for classifying the organisms that these sequences represent. To aid viral classification, we have developed ViCTree, which automatically integrates the relevant sets of sequences in NCBI GenBank and transforms them into an interactive maximum likelihood phylogenetic tree that can be updated automatically. ViCTree incorporates ViCTreeView, which is a JavaScript-based visualisation tool that enables the tree to be explored interactively in the context of pairwise distance data. Results: To demonstrate utility, ViCTree was applied to subfamily Densovirinae of family Parvoviridae. This led to the identification of six new species of insect virus. Availability: ViCTree is open-source and can be run on any Linux- or Unix-based computer or cluster. A tutorial, the documentation and the source code are available under a GPL3 license, and can be accessed at http://bioinformatics.cvr.ac.uk/victree_web/

    PLAST: parallel local alignment search tool for database comparison

    Get PDF
    Background: Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same time, the internal architecture of current microprocessors is tending towards more parallelism, leading to the use of chips with two, four and more cores integrated on the same die. The main purpose of this work was to design an effective algorithm to fit with the parallel capabilities of modern microprocessors. Results: A parallel algorithm for comparing large genomic banks and targeting middle-range computers has been developed and implemented in PLAST software. The algorithm exploits two key parallel features of existing and future microprocessors: the SIMD programming model (SSE instruction set) and the multithreading concept (multicore). Compared to multithreaded BLAST software, tests performed on an 8-processor server have shown speedup ranging from 3 to 6 with a similar level of accuracy. Conclusions: A parallel algorithmic approach driven by the knowledge of the internal microprocessor architecture allows significant speedup to be obtained while preserving standard sensitivity for similarity search problems.

    Compressive genomics for protein databases

    Get PDF
    Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools. Results: We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP’s runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search. Availability: CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/ Contact: [email protected]

    Molecular Evolution and Functional Diversification of Replication Protein A1 in Plants

    Get PDF
    Replication protein A (RPA) is a heterotrimeric, single-stranded DNA binding complex required for eukaryotic DNA replication, repair, and recombination. RPA is composed of three subunits, RPA1, RPA2, and RPA3. In contrast to single RPA subunit genes generally found in animals and yeast, plants encode multiple paralogs of RPA subunits, suggesting subfunctionalization. Genetic analysis demonstrates that five Arabidopsis thaliana RPA1 paralogs (RPA1A to RPA1E) have unique and overlapping functions in DNA replication, repair, and meiosis. We hypothesize here that RPA1 subfunctionalities will be reflected in major structural and sequence differences among the paralogs. To address this, we analyzed amino acid and nucleotide sequences of RPA1 paralogs from 25 complete genomes representing a wide spectrum of plants and unicellular green algae. We find here that the plant RPA1 gene family is divided into three general groups termed RPA1A, RPA1B, and RPA1C, which likely arose from two progenitor groups in unicellular green algae. In the family Brassicaceae the RPA1B and RPA1C groups have further expanded to include two unique sub-functional paralogs RPA1D and RPA1E, respectively. In addition, RPA1 groups have unique domains, motifs, cis-elements, gene expression profiles, and pattern of conservation that are consistent with proposed functions in monocot and dicot species, including a novel C-terminal zinc-finger domain found only in plant RPA1C-like sequences. These results allow for improved prediction of RPA1 subunit functions in newly sequenced plant genomes, and potentially provide a unique molecular tool to improve classification of Brassicaceae species
    • …
    corecore