Search CORE

1,142 research outputs found

Full-fledged Real-Time Indexing for Constant Size Alphabets

Author: Kucherov Gregory
Nekrich Yakov
Publication venue
Publication date: 06/07/2013
Field of study

In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to

T

in O(1) worst-case time. At any moment, we can report all occurrences of a pattern

P

in the current text in

O(|P|+k)

time, where

|P|

is the length of

P

and

k

is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Reconsidering the significance of genomic word frequency

Author: Csűrös Miklós
Kucherov Gregory
Noé Laurent
Publication venue
Publication date: 14/09/2006
Field of study

We propose that the distribution of DNA words in genomic sequences can be primarily characterized by a double Pareto-lognormal distribution, which explains lognormal and power-law features found across all known genomes. Such a distribution may be the result of completely random sequence evolution by duplication processes. The parametrization of genomic word frequencies allows for an assessment of significance for frequent or rare sequence motifs

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

Estimating seed sensitivity on homogeneous alignments

Author: Kucherov Gregory
Noe Laurent
Ponty Yann
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2003
Field of study

We address the problem of estimating the sensitivity of seed-based similarity search algorithms. In contrast to approaches based on Markov models [18, 6, 3, 4, 10], we study the estimation based on homogeneous alignments. We describe an algorithm for counting and random generation of those alignments and an algorithm for exact computation of the sensitivity for a broad class of seed strategies. We provide experimental results demonstrating a bias introduced by ignoring the homogeneousness condition

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

RNF: a general framework to evaluate NGS read mappers

Author: Boeva Valentina
Břinda Karel
Kucherov Gregory
Publication venue: 'Oxford University Press (OUP)'
Publication date: 02/04/2015
Field of study

Aligning reads to a reference sequence is a fundamental step in numerous bioinformatics pipelines. As a consequence, the sensitivity and precision of the mapping tool, applied with certain parameters to certain data, can critically affect the accuracy of produced results (e.g., in variant calling applications). Therefore, there has been an increasing demand of methods for comparing mappers and for measuring effects of their parameters. Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created. In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads. In order to solve this obstacle, we have created a generic format RNF (Read Naming Format) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RNF containing two principal components. MIShmash applies one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim etc.) and transforms the generated reads into RNF format. LAVEnder evaluates then a given read mapper using simulated reads in RNF format. A special attention is payed to mapping qualities that serve for parametrization of ROC curves, and to evaluation of the effect of read sample contamination

arXiv.org e-Print Archive

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

On the combinatorics of suffix arrays

Author: Kucherov Gregory
Tóthmérész Lilla
Vialette Stéphane
Publication venue
Publication date: 18/06/2012
Field of study

We prove several combinatorial properties of suffix arrays, including a characterization of suffix arrays through a bijection with a certain well-defined class of permutations. Our approach is based on the characterization of Burrows-Wheeler arrays given in [1], that we apply by reducing suffix sorting to cyclic shift sorting through the use of an additional sentinel symbol. We show that the characterization of suffix arrays for a special case of binary alphabet given in [2] easily follows from our characterization. Based on our results, we also provide simple proofs for the enumeration results for suffix arrays, obtained in [3]. Our approach to characterizing suffix arrays is the first that exploits their relationship with Burrows-Wheeler permutations

arXiv.org e-Print Archive

Crossref

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Spaced seeds improve k-mer-based metagenomic classification

Author: Brinda Karel
Kucherov Gregory
Sykulski Maciej
Publication venue: 'Oxford University Press (OUP)'
Publication date: 09/07/2015
Field of study

Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds provide a significant improvement of classification accuracy as opposed to traditional contiguous k-mers. We support this thesis through a series a different computational experiments, including simulations of large-scale metagenomic projects. Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.Comment: 23 page

arXiv.org e-Print Archive

Crossref

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM