Search CORE

7,901 research outputs found

Efficient seeding techniques for protein similarity search

Author: Furletova Eugenia
Gambin Anna
Kucherov Gregory
Lasota Slawomir
Noé Laurent
Roytberg Mihkail
Szczurek Ewa
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Efficient seeding techniques for protein similarity search

Author: Roytberg Mihkail
Gambin Anna
Noé Laurent
Lasota Slawomir
Furletova Eugenia
Szczurek Ewa
Kucherov Gregory
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

arXiv.org e-Print Archive

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Entropy-scaling search of massive biological data

Author: Berger Bonnie
Daniels Noah M.
Danko David Christian
Yu Y. William
Publication venue: 'Elsevier BV'
Publication date: 01/06/2015
Field of study

Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

arXiv.org e-Print Archive

Elsevier - Publisher Connector

DSpace@MIT

Crossref

PubMed Central

A unifying framework for seed sensitivity and its application to subset seeds

Author: A. Finkelstein
A.V. Aho
B. Brejova
B. Brejova
B. Brejova
B. Ma
D. Brown
G. Kucherov
G. Kucherov
I.H. Yang
J. Buhler
J. Xu
J.D. Ullman
K. Choi
K.P. Choi
S. Altschul
S. Burkhardt
W. Chen
W.J. Kent
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2004
Field of study

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated probability distribution, and a seed model -- that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

PubMed Central

HAL Descartes

Hal-Diderot

ViCTree: an automated framework for taxonomic classification from protein sequences

Author: Adams
Adams
Altschul
Andrew J Davison
Anil S Thanki
Bao
Cotmore
Di Tommaso
Edgar
Fu
Hibbett
Izquierdo-Carrasco
Janet Kelso
Joseph Hughes
Kapli
Katoh
Katoh
Kozlov
Lauber
Löytynoja
Löytynoja
Nishimura
Sejal Modha
Sievers
Simmonds
Simmonds
Smith
Stamatakis
Susan F Cotmore
Thompson
Vilella
Wu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 20/02/2018
Field of study

Motivation: The increasing rate of submission of genetic sequences into public databases is providing a growing resource for classifying the organisms that these sequences represent. To aid viral classification, we have developed ViCTree, which automatically integrates the relevant sets of sequences in NCBI GenBank and transforms them into an interactive maximum likelihood phylogenetic tree that can be updated automatically. ViCTree incorporates ViCTreeView, which is a JavaScript-based visualisation tool that enables the tree to be explored interactively in the context of pairwise distance data. Results: To demonstrate utility, ViCTree was applied to subfamily Densovirinae of family Parvoviridae. This led to the identification of six new species of insect virus. Availability: ViCTree is open-source and can be run on any Linux- or Unix-based computer or cluster. A tutorial, the documentation and the source code are available under a GPL3 license, and can be accessed at http://bioinformatics.cvr.ac.uk/victree_web/

Crossref

Enlighten

PLAST: parallel local alignment search tool for database comparison

Author: A Jacob
D Lavenier
Dominique Lavenier
GM Amdahl
H Zhang
Hoa Van Nguyen
KM Chao
M Farrar
M Gertz
M Pop
M Roytberg
N Firasta
S Karlin
SF Altschul
SF Altschul
SF Altschul
T Rognes
TF Smith
V Sachdeva
W Hu
W Liu
WR Pearson
X Fei
YK Yu
YK Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background: Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same time, the internal architecture of current microprocessors is tending towards more parallelism, leading to the use of chips with two, four and more cores integrated on the same die. The main purpose of this work was to design an effective algorithm to fit with the parallel capabilities of modern microprocessors. Results: A parallel algorithm for comparing large genomic banks and targeting middle-range computers has been developed and implemented in PLAST software. The algorithm exploits two key parallel features of existing and future microprocessors: the SIMD programming model (SSE instruction set) and the multithreading concept (multicore). Compared to multithreaded BLAST software, tests performed on an 8-processor server have shown speedup ranging from 3 to 6 with a similar level of accuracy. Conclusions: A parallel algorithmic approach driven by the knowledge of the internal microprocessor architecture allows significant speedup to be obtained while preserving standard sensitivity for similarity search problems.

HAL-CentraleSupelec

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

INRIA a CCSD electronic archive server

Compressive genomics for protein databases

Author: A. Gallant
B. Berger
Boratyn
Cameron
Chen
Chen
Gross
Huttenhower
J. Peng
Kahn
Kircher
Kosloff
L. J. Cowen
Loewenstein
Loh
M. Baym
McDonnell
Murzin
N. M. Daniels
Needleman
Remmert
Rost
Schatz
Soding
Tatusov
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/06/2013
Field of study

Motivation: The exponential growth of protein sequence databases has increasingly made the fundamental question of searching for homologs a computational bottleneck. The amount of unique data, however, is not growing nearly as fast; we can exploit this fact to greatly accelerate homology search. Acceleration of programs in the popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly but also the huge collection of other current programs that primarily interact with large protein databases via precisely these tools. Results: We introduce a suite of homology search tools, powered by compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than and comparably accurate with all known state-of-the-art tools, including HHblits, DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows direct substitution into existing analysis pipelines. The key idea is that we introduce a local similarity-based compression scheme that allows us to operate directly on the compressed data. Importantly, CaBLASTP’s runtime scales almost linearly in the amount of unique data, as opposed to current BLASTP variants, which scale linearly in the size of the full protein database being searched. Our compressive algorithms will speed-up many tasks, such as protein structure prediction and orthology mapping, which rely heavily on homology search. Availability: CaBLASTP is available under the GNU Public License at http://cablastp.csail.mit.edu/ Contact: [email protected]

DSpace@MIT

Crossref

Harvard University - DASH

PubMed Central

Molecular Evolution and Functional Diversification of Replication Protein A1 in Plants

Author: Aklilu Behailu B.
Culligan Kevin M.
Publication venue: University of New Hampshire Scholars\u27 Repository
Publication date: 01/01/2016
Field of study

Replication protein A (RPA) is a heterotrimeric, single-stranded DNA binding complex required for eukaryotic DNA replication, repair, and recombination. RPA is composed of three subunits, RPA1, RPA2, and RPA3. In contrast to single RPA subunit genes generally found in animals and yeast, plants encode multiple paralogs of RPA subunits, suggesting subfunctionalization. Genetic analysis demonstrates that five Arabidopsis thaliana RPA1 paralogs (RPA1A to RPA1E) have unique and overlapping functions in DNA replication, repair, and meiosis. We hypothesize here that RPA1 subfunctionalities will be reflected in major structural and sequence differences among the paralogs. To address this, we analyzed amino acid and nucleotide sequences of RPA1 paralogs from 25 complete genomes representing a wide spectrum of plants and unicellular green algae. We find here that the plant RPA1 gene family is divided into three general groups termed RPA1A, RPA1B, and RPA1C, which likely arose from two progenitor groups in unicellular green algae. In the family Brassicaceae the RPA1B and RPA1C groups have further expanded to include two unique sub-functional paralogs RPA1D and RPA1E, respectively. In addition, RPA1 groups have unique domains, motifs, cis-elements, gene expression profiles, and pattern of conservation that are consistent with proposed functions in monocot and dicot species, including a novel C-terminal zinc-finger domain found only in plant RPA1C-like sequences. These results allow for improved prediction of RPA1 subunit functions in newly sequenced plant genomes, and potentially provide a unique molecular tool to improve classification of Brassicaceae species

Crossref

Directory of Open Access Journals

Frontiers - Publisher Connector

PubMed Central

UNH Scholars' Repository