65 research outputs found

    A Bloom filter based semi-index on qq-grams

    Full text link
    We present a simple qq-gram based semi-index, which allows to look for a pattern typically only in a small fraction of text blocks. Several space-time tradeoffs are presented. Experiments on Pizza & Chili datasets show that our solution is up to three orders of magnitude faster than the Claude et al. \cite{CNPSTjda10} semi-index at a comparable space usage

    Computation of Sensitive Multiple Spaced Seeds

    Get PDF
    Similarity search is one of the most important problem in bioinformatics, with application in read mapping, homology search, oligonucleotide design, etc. Similarity search is time and memory intensive, hence heuristic methods using multiple spaced seeds are commonly employed. A spaced seed is a string of 1 and *, where 1 represents a match position and * represent don\u27t care position. Seeds are used to discover regions with identity, thus, it is imperative to design seeds of high sensitivity, so as to maximize the number of hits. We present SpEED2, a software program to generate multiple spaced seeds of high sensitivity. It uses a novel seed optimization approach and it outperforms all the leading programs used for designing multiple spaced seeds like Iedera, AcoSeeD, and rasbhari. Our algorithm will benefit several software that is dependent on good quality seeds for its operation like PatternHunter for similarity search, SHRiMP and BFAST for read mapping, bestPrimer for designing primers, and many more

    Speeding up index construction with GPU for DNA data sequences

    Get PDF
    The advancement of technology in scientific community has produced terabytes of biological data.This datum includes DNA sequences.String matching algorithm which is traditionally used to match DNA sequences now takes much longer time to execute because of the large size of DNA data and also the small number of alphabets.To overcome this problem, the indexing methods such as suffix arrays or suffix trees have been introduced.In this study we used suffix arrays as indexing algorithm because it is more applicable, not complex and used less space compared to suffix trees.The parallel method is then introduced to speed up the index construction process. Graphic processor unit (GPU) is used to parallelize a segment of an indexing algorithm. In this research, we used a GPU to parallelize the sorting part of suffix array construction algorithm.Our results show that the GPU is able to accelerate the process of building the index of the suffix array by 1.68 times faster than without GPU

    Suffix Arrays Construction and Their Use in Bioinformatics

    Get PDF
    Práce pojednává o perspektivní datové struktuře, která se nazývá sufixové pole. Tato datová struktura je zde podrobněji popsána a v práci je dále uvedeno rozdělení algoritmů pro konstrukci tohoto pole. Je zde popsáno několik konstrukčních algoritmů a nejpodrobněji se práce zaobírá algoritmem nazývaným qsufsort. Nakonec si ukážeme využití sufixového pole pro vyhledávání přesných (pomocí binárního vyhledávání) a přibližných (metoda QUASAR) vzorů v sekvencích DNA.This work describes perspective data structure called suffix array. This data structure is described in more detail and this paper also contains taxonomy of suffix array construction algorithms. A few algorithms are described more precisely and most space is devoted to algorithm called qsufsort. Finally we will show how can be suffix array used in practice. This work shows usage of suffix array in exact (binary search) and approximate (QUASAR) string matching in DNA sequences.

    Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS

    Get PDF
    Motivation: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. Results: Here we present a method for ‘split’ read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. Availability: SplazerS is available from http://www.seqan.de/projects/ splazers

    Circular sequence comparison: algorithms and applications

    Get PDF
    Background: Sequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. Results: In this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art

    The Gapped-Factor Tree

    Get PDF
    International audienceWe present a data structure to index a specific kind of factors, that is of substrings, called gapped-factors. A gapped-factor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gapped-factors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in linear time and space. Such a data structure may play an important role in various pattern matching and motif inference problems, for instance in text filtration

    Compact q-gram Profiling of Compressed Strings

    Get PDF
    We consider the problem of computing the q-gram profile of a string \str of size NN compressed by a context-free grammar with nn production rules. We present an algorithm that runs in O(Nα)O(N-\alpha) expected time and uses O(n+q+\kq) space, where NαqnN-\alpha\leq qn is the exact number of characters decompressed by the algorithm and \kq\leq N-\alpha is the number of distinct q-grams in \str. This simultaneously matches the current best known time bound and improves the best known space bound. Our space bound is asymptotically optimal in the sense that any algorithm storing the grammar and the q-gram profile must use \Omega(n+q+\kq) space. To achieve this we introduce the q-gram graph that space-efficiently captures the structure of a string with respect to its q-grams, and show how to construct it from a grammar

    Indices and Applications in High-Throughput Sequencing

    Get PDF
    Recent advances in sequencing technology allow to produce billions of base pairs per day in the form of reads of length 100 bp an longer and current developments promise the personal $1,000 genome in a couple of years. The analysis of these unprecedented amounts of data demands for efficient data structures and algorithms. One such data structures is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis we propose 3 substring indices, which we extend to be applicable to millions of sequences. We devise internal and external memory construction algorithms and a uniform framework for accessing the generalized suffix tree. Additionally we propose different index-based applications, e.g. exact and approximate pattern matching and different repeat search algorithms. Second, we present the read mapping tool RazerS, which aligns millions of single or paired-end reads of arbitrary lengths to their potential genomic origin using either Hamming or edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present a novel approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. We compare RazerS with other state-of-the-art read mappers and show that it has the highest sensitivity and a comparable performance on various real-world datasets. At last, we propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel and lightweight algorithm that is faster and uses less memory than the best available algorithms. We show its applicability for mining multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to generalize the emerging substring mining problem to multiple databases. To demonstrate the improvement of our algorithm we compared to recent approaches on real-world experiments of various string domains, e.g. natural language, DNA, or protein sequences
    corecore