Search CORE

7 research outputs found

Fast Spaced Seed Hashing

Author: Comin Matteo
Girotto Samuele
Pizzi Cinzia
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)
Publication date: 01/01/2017
Field of study

Hashing k-mers is a common function across many bioinformatics applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Hence, the ability to speed up hashing operations of spaced seeds would have a major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. In this paper we address the problem of efficient spaced seed hashing. The proposed algorithm exploits the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hash. We report a series of experiments on NGS reads hashing using several spaced seeds. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6x to 5.3x, depending on the structure of the spaced seed

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università di Padova

Multiple seeds sensitivity using a single seed with threshold

Author: Egidi Lavinia
Manzini Giovanni
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2015
Field of study

Spaced seeds are a fundamental tool for similarity search in biosequences. The best sensitivity/selectivity trade-offs are obtained using many seeds simultaneously: This is known as the multiple seed approach. Unfortunately, spaced seeds use a large amount of memory and the available RAM is a practical limit to the number of seeds one can use simultaneously. Inspired by some recent results on lossless seeds, we revisit the approach of using a single spaced seed and considering two regions homologous if the seed hits in at least t sufficiently close positions. We show that by choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, we derive single seeds that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature. In addition, the choice of the threshold t can be adjusted to modify sensitivity and selectivity a posteriori, thus enabling a more accurate search in the specific instance at issue. The seeds we propose also exhibit robustness and allow flexibility in usage

Crossref

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

Hit integration for identifying optimal spaced seeds

Author: B Brejova
B Ma
B Ma
DYF Mak
FP Preparata
G Kucherov
H Anton
I Herms
IH Yang
J Buhler
J Stoer
J Xu
J Yang
KP Choi
KP Choi
L Ilie
L Ilie
L Zhou
M Farach-Colton
M Li
M Li
Seong-Bae Park
SF Altschul
TF Smith
U Keich
WJ Kent
Won-Hyoung Chung
WR Pearson
X Gao
Y Sun
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Improvements on Seeding Based Protein Sequence Similarity Search

Author: Li Weiming
Publication venue: Scholarship@Western
Publication date: 01/01/2012
Field of study

The primary goal of bioinformatics is to increase an understanding in the biology of organisms. Computational, statistical, and mathematical theories and techniques have been developed on formal and practical problems that assist to achieve this primary goal. For the past three decades, the primary application of bioinformatics has been biological data analysis. The DNA or protein sequence similarity search is perhaps the most common, yet vitally important task for analyzing biological data. The sequence similarity search is a process of finding optimal sequence alignments. On the theoretical level, the problem of sequence similarity search is complex. On the applicational level, the sequences similarity search onto a biological database has been one of the most basic tasks today. Using traditional quadratic time complexity solutions becomes a challenge due to the size of the database. Seeding (or filtration) based approaches, which trade sensitivity for speed, are a popular choice among those available. Two main phases usually exist in a seeding based approach. The first phase is referred to as the hit generation, and the second phase is referred to as the hit extension. In this thesis, two improvements on the seeding based protein sequence similarity search are presented. First, for the hit generation, a new seeding idea, namely spaced k-mer neighbors, is presented. We present our effective algorithms to find a good set of spaced k-mer neighbors. Secondly, for the hit generation, a new method, namely HexFilter, is proposed to reduce the number of hit extensions while achieving better selectivity. We show our HexFilters with optimized configurations

CiteSeerX

Scholarship@Western

Novel computational techniques for mapping and classifying Next-Generation Sequencing data

Author: Břinda Karel
Publication venue
Publication date
Field of study

Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

ZENODO

On the complexity of the spaced seeds

Author: Bin Ma
Ming Li
Publication venue
Publication date: 01/01/2007
Field of study

Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. These seeds are serving many homology queries daily. However the computational complexity of finding the optimal spaced seeds remains to be open. In this paper, we prove that computing hit probability of a spaced seed in a uniform homology region is NP-hard, but it admits a probabilistic PTAS. We also show that the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length

CiteSeerX

Elsevier - Publisher Connector

On the complexity of the spaced seeds

Author: Altschul
Bin Ma
Birkhoff
Brejovà
Choi
Choi
Garey
Gotea
Int'l Mouse Genome Sequencing Consortium
Keich
Li
Li
Lipman
Ma
Mahaney
Minc
Ming Li
Needleman
Nicodéme
Smith
Xu
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref