Search CORE

39 research outputs found

Computing an Evolutionary Ordering is Hard

Author: Bulteau Laurent
Sacomoto Gustavo
Sinaimeri Blerina
Publication venue
Publication date: 24/10/2014
Field of study

We prove that computing an evolutionary ordering of a family of sets, i.e. an ordering where each set intersects with --but is not included in-- the union earlier sets, is NP-hard

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Efficiently listing bounded length st-paths

Author: Rizzi Romeo
Sacomoto Gustavo
Sagot Marie-France
Publication venue
Publication date: 01/01/2014
Field of study

The problem of listing the

K

shortest simple (loopless)

st

-paths in a graph has been studied since the early 1960s. For a non-negatively weighted graph with

n

vertices and

m

edges, the most efficient solution is an

O(K(mn + n^2 \log n))

algorithm for directed graphs by Yen and Lawler [Management Science, 1971 and 1972], and an

O(K(m+n \log n))

algorithm for the undirected version by Katoh et al. [Networks, 1982], both using

O(Kn + m)

space. In this work, we consider a different parameterization for this problem: instead of bounding the number of

st

-paths output, we bound their length. For the bounded length parameterization, we propose new non-trivial algorithms matching the time complexity of the classic algorithms but using only

O(m+n)

space. Moreover, we provide a unified framework such that the solutions to both parameterizations -- the classic

K

-shortest and the new length-bounded paths -- can be seen as two different traversals of a same tree, a Dijkstra-like and a DFS-like traversal, respectively.Comment: 12 pages, accepted to IWOCA 201

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

HAL Descartes

Catalogo dei prodotti della ricerca

A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs and its application to the detection of alternative splicing in RNA-seq data

Author: Lacroix Vincent
Sacomoto Gustavo
Sagot Marie-France
Publication venue
Publication date: 30/07/2013
Field of study

We present a new algorithm for enumerating bubbles with length constraints in directed graphs. This problem arises in transcriptomics, where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq. This is the first polynomial-delay algorithm for this problem and we show that in practice, it is faster than previous approaches. This enables us to deal with larger instances and therefore to discover novel alternative splicing events, especially long ones, that were previously overseen using existing methods.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Navigating in a sea of repeats in RNA-seq without drowning

Author: Lacroix Vincent
Marchet Camille
Miele Vincent
Sacomoto Gustavo
Sagot Marie-France
Sinaimeri Blerina
Publication venue
Publication date: 01/01/2014
Field of study

The main challenge in de novo assembly of NGS data is certainly to deal with repeats that are longer than the reads. This is particularly true for RNA- seq data, since coverage information cannot be used to flag repeated sequences, of which transposable elements are one of the main examples. Most transcriptome assemblers are based on de Bruijn graphs and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. The results of this work are twofold. First, we introduce a formal model for repre- senting high copy number repeats in RNA-seq data and exploit its properties for inferring a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying in a de Bruijn graph a subgraph with this charac- teristic is NP-complete. In a second step, we show that in the specific case of a local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs. In particular, we designed and implemented an algorithm to efficiently identify AS events that are not included in repeated regions. Finally, we validate our results using synthetic data. We also give an indication of the usefulness of our method on real data

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs

Author: Lacroix Vincent
Sacomoto Gustavo
Sagot Marie‑france
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

International audienceBackground: The problem of enumerating bubbles with length constraints in directed graphs arises in transcrip‑ tomics where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA‑seq. Results: We present a new algorithm for enumerating bubbles with length constraints in weighted directed graphs. This is the first polynomial delay algorithm for this problem and we show that in practice, it is faster than previous approaches. Conclusion: This settles one of the main open questions from Sacomoto et al. (BMC Bioinform 13:5, 2012). Moreover, the new algorithm allows us to deal with larger instances and possibly detect longer alternative splicing events

Crossref

INRIA a CCSD electronic archive server

PubMed Central

Hal-Diderot

A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs

Author: A Dobin
A Mortazavi
E Wang
G Robertson
GAT Sacomoto
Gustavo Sacomoto
Marie-France Sagot
MG Grabherr
MH Schulz
MR Bussieck
P Flicek
RK Ahuja
TH Cormen
Vincent Lacroix
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Computing and Listing st-Paths in Subway Networks

Author: Böhmová Kateřina
Mihalák Matúš
Pröger Tobias
Sacomoto Gustavo
Sagot Marie-France
Widmayer Peter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

International audienceGiven a set of paths (called lines) L, a subway network is a graph GL = (V, A) where V contains exactly the vertices and arcs of every line l ∈ L. An st-route is a pair (π, γ) where γ = (l1,. .. , l h) is a line sequence and π is an st-path in GL which is the concatena-tion of subpaths of the lines l1,. .. , l h , in this order. We study three related problems concerning traveling from s to t in GL. We present an efficient (i.e., polynomial-time) algorithm for computing an st-route (π, γ) where |γ| (i.e., the number of line changes plus one) is minimum among all st-routes. We show for the problem of finding an st-route (π, γ) that minimizes the number of different lines in γ, even computing an o(log |V |)-approximation is NP-hard. Finally, given an integer β, we present an algorithm for enumerating all st-paths π for which a route (π, γ) with |γ| ≤ β exists, and show that the running time of this algorithm is polynomial with respect to both input and output size

Maastricht University Research Portal

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads

Author: Lacroix Vincent
Lima Leandro
Lopez-Maestre Helene
Marchet Camille
Miele Vincent
Sacomoto Gustavo
Sagot Marie-France
Sinaimeri Blerina
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

International audienceAbstractBackground The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them.ResultsThe results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99–111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644–652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086–1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134–1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods

Crossref

INRIA a CCSD electronic archive server

PubMed Central

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Hal-Diderot

HAL-Rennes 1

Colib'read on galaxy : a tools suite dedicated to biological information extraction from raw NGS reads

Author: Alves-Carvalho Susete
Andrieux Alexan
Cazaux Bastien
Collin Olivier
El Aabidine Amal Zine
Lacroix Vincent
Le Bras Yvan
Lemaitre Claire
Marchet Camille
Miele Vincent
Monjeaud Cyril
Peterlongo Pierre
Rivals Eric
Sacomoto Gustavo
Salmela Leena
Uricaru Raluca
Publication venue
Publication date: 01/02/2016
Field of study

Background: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. Findings: Dedicated to 'whole-genome assembly-free' treatments, the Colib'read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. Conclusions: With the Colib'read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.Peer reviewe

HAL-CentraleSupelec

Crossref

Springer - Publisher Connector

INRIA a CCSD electronic archive server

PubMed Central

Helsingin yliopiston digitaalinen arkisto

Hal-Diderot

Oskar Bordeaux

HAL-Rennes 1