Search CORE

2,429 research outputs found

Multiple Comparative Metagenomics using Multiset k-mer Counting

Author: Benoit Gaëtan
Drezen Erwan
Lavenier Dominique
Lemaitre Claire
Mariadassou Mahendra
Peterlongo Pierre
Schbath Sophie
Publication venue
Publication date: 28/04/2016
Field of study

Background. Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. Methods. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Results. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling

arXiv.org e-Print Archive

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

Directory of Open Access Journals

SVJedi-graph: Structural Variant genotyping with long-reads using a variation graph

Author: Lemaitre Claire
Romain Sandra
Publication venue: HAL CCSD
Publication date: 06/07/2021
Field of study

International audienceStructural variants (SVs) are genomic segments of more than 50 bp that have been rearranged in the genome. The advent of third generation sequencing technologies has increased and enhanced their study, and a great number of SVs has already been discovered in the human genome. Complementary to their discovery, the genotyping of known SVs in newly sequenced individuals is of particular interest for several applications such as trait association and clinical diagnosis. Most of the SV genotypers currently available are designed for second generation sequencing data, although third generation sequencing data is more suited to study SVs due to their large range of sizes (up to few mega bases). As such, our team previously released SVJedi, the first SV genotyper dedicated to long read data[1]. The method is based on linear representations of the allelic sequences of each SV and each SV is represented and genotyped independently of the other ones. While this is very efficient for distant SVs, the method fails to genotype some closely located or overlapping SVs due to redundancy in representative allelic sequences

INRIA a CCSD electronic archive server

SVJedi-graph: genotyping close and overlapping structural variants with a variation graph and long-reads

Author: Lemaitre Claire
Romain Sandra
Publication venue: HAL CCSD
Publication date: 05/07/2022
Field of study

International audienceStructural variants (SVs) are genomic segments of more than 50 bp that have been rearranged in the genome. The advent of long-read sequencing technologies has increased and enhanced their study, and a great number of SVs has already been discovered in many species. Complementary to their discovery, the genotyping of known SVs in newly sequenced individuals is of particular interest for several applications such as trait association and clinical diagnosis. Due to SVs' large size range (up to a few megabases), long-reads are more suited for their study than short-reads. As such, our team previously released SVJedi [1], one of the first SV genotypers using long-read data. SVJedi's method of representing independently both SV's allelic sequences reduced reference bias in genotyping and showed improved genotyping performances. However, the method failed to genotype closely located or overlapping SVs due to redundancy in representative allelic sequences.To overcome this limitation, we present SVJedi-graph, a long-read SV genotyper based on a variation graph to represent SV alleles. The use of sequence graphs to represent SVs for genotyping is fairly recent [2,3,4,5], but existing methods are restricted to short-read data, and SVJedi-graph is the first graph-based SV genotyper using long-reads. In our method, we build the variation graph from a reference genome and a given set of SVs. The genome sequence is split in fragments at each SV’s start and end positions, and each fragment becomes a node in the graph. Edges are added between nodes to indicate reference and alternative paths for each SV, and additional nodes are added for insertions. Then, the long reads are mapped on the variation graph using GraphAligner [6] and the resulting alignments are filtered on their quality and mapping localization. Finally, the most likely genotype for each SV is predicted from the ratio between the number of reads supporting each allele.SVJedi-graph can genotype four SV types as of now, namely deletions, insertions, inversions and translocations. Running SVJedi-graph on simulated sets of deletions showed that the use of a variation graph was able to restore the genotyping quality on close and overlapping SVs. For instance, with asimulated set of deletions that had another close deletion 0 to 50 bp apart, we obtained a genotyping rate (proportion of SVs with a predicted genotype) of 99.9% and an accuracy (proportion of accurate genotype predicted among all predicted genotypes) of 99.0%, compared to a genotyping rate of 78.9% and an accuracy of 97.3% with SVJedi on the same dataset. We also tested our method on the real gold standard dataset of Genome In A Bottle (human individual HG002), and were able to obtain a higher genotyping rate than SVJedi on the same data (97.4% against 90.2%), with a similar or slightly better accuracy (92.9% against 92.2%). SVJedi-graph is distributed under an AGPL license and available on GitHub at https://github.com/SandraLouise/SVJedi-graph

INRIA a CCSD electronic archive server

LEVIATHAN: efficient discovery of large structural variants by leveraging long-range information from Linked-Reads data

Author: Legeai Fabrice
Lemaitre Claire
Morisse Pierre
Publication venue: HAL CCSD
Publication date: 06/07/2021
Field of study

National audienceLinked-Reads technologies, popularized by 10x Genomics, combine the highquality and low cost of short-reads sequencing with a long-range information by adding barcodes that tag reads originating from the same long DNA fragment. Thanks to their high-quality and long-range information, such reads are thus particularly useful for various applications such as genome scaffolding and structural variant calling. As a result, multiple structural variant calling methods were developed within the last few years. However, these methods were mainly tested on human data, and do not run well on non-human organisms, for which reference genomes are highly fragmented, or sequencing data display high levels of heterozygosity. Moreover, even on human data, most tools still require large amounts of computing resources. We present LEVIATHAN, a new structural variant calling tool that aims to address these issues, and especially better scale and apply to a wide variety of organisms. Our method relies on a barcode index, that allows to quickly compare the similarity of all possible pairs of regions in terms of amount of common barcodes. Region pairs sharing a sufficient number of barcodes are then considered as potential structural variants, and complementary, classical short reads methods are applied to further refine the breakpoint coordinates. Our experiments on simulated data underline that our method compares well to the state-of-the-art, both in terms of recall and precision, and also in terms of resource consumption. Moreover, LEVIATHAN was successfully applied to a real dataset from a non-model organism, while all other tools either failed to run or required unreasonable amounts of resources. LEVIATHAN is implemented in C++, supported on Linux platforms, and available under AGPL-3.0 License at https://github.com/morispi/LEVIATHAN

INRIA a CCSD electronic archive server

Décoder le génome : vers la compréhension du fonctionnement du SARS-CoV-2

Author: Lemaitre Claire
Salson Mikaël
Touzet Hélène
Publication venue: INRIA
Publication date: 26/04/2022
Field of study

National audienceUne longue suite de lettres. C'est ainsi qu'un génome, comme celui-du SARS-CoV-2 est représenté. Mais comment donner du sens à cette succession cryptique de A, C, G et T ? Où se trouvent les gènes ? Quels rôles jouent-ils ? Les outils de la bioinformatique permettent de bénéficier des connaissances acquises sur d'autres coronavirus pour les transférer au SARS-CoV-2

INRIA a CCSD electronic archive server

LRez: C++ API and toolkit for analyzing and managing Linked-Reads data

Author: Legeai Fabrice
Lemaitre Claire
Morisse Pierre
Publication venue: HAL CCSD
Publication date: 06/07/2021
Field of study

International audienceLinked-Reads technologies, such as 10x Genomics, Haplotagging, stLFR and TELL-Seq, partition and tag high-molecular-weight DNA molecules with a barcode using a microfluidic device prior to classical short-read sequencing. This way, Linked-Reads manage to combine the high-quality of the short reads and a long-range information which can be inferred by identifying distant reads belonging to the same DNA molecule with the help of the barcodes. This technology can thus efficiently be employed in various applications, such as structural variant calling, but also genome assembly, phasing and scaffoling. To benefit from Linked-Reads data, most methods first map the reads against a reference genome, and then rely on the analysis of the barcode contents of genomic regions, often requiring to fetch all reads or alignments with a given barcode. However, despite the fact that various tools and libraries are available for processing BAM files, to the best of our knowledge, no such tool currently exists for managing Linked-Reads barcodes, and allowing features such as indexing, querying, and comparisons of barcode contents. LRez aims to address this issue, by providing a complete and easy to use API and suite of tools which are directly compatible with various Linked-Reads sequencing technologies. LRez provides various functionalities such as extracting, indexing and querying Linked-Reads barcodes, in BAM, FASTQ, and gzipped FASTQ files (Table 1). The API is compiled as a shared library, helping its integration to external projects. Moreover, all functionalities are implemented in a thread-safe fashion. Our experiments show that, on a 70 GB Haplotagging BAM file from Heliconius erato [1], index construction took an hour, and resulted in an index occupying 11 GB of RAM. Using this index, querying time per barcode reached an average of 11 ms. In comparison, using a naive approach without a barcode-based index, querying time per barcode reached an hour

INRIA a CCSD electronic archive server

Comment la bioinformatique a résolu le puzzle du génome du SARS-CoV-2

Author: Lemaitre Claire
Salson Mikaël
Touzet Hélène
Publication venue: INRIA
Publication date: 26/04/2022
Field of study

National audienceConnaître le génome du SARS-CoV-2 a été une étape fondamentale dans la lutte contre l'épidémie de Covid-19. Cela a permis de rapidement identifier ses protéines, développer des tests, étudier son origine, suivre son évolution, etc. Mais comment à partir d'un simple écouvillon recouvert d'organismes variés, arrive-t-on à déterminer le génome du virus qui nous intéresse ? La bioinformatique propose des méthodes adaptées pour y arriver de manière très efficace

INRIA a CCSD electronic archive server

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Author: Benoit Gaëtan
Dayris Thibault
Drezen Erwan
Lavenier Dominique
Lemaitre Claire
Rizk Guillaume
Uricaru Raluca
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/09/2015
Field of study

International audienceData volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

PubMed Central

HAL-Rennes 1

Analysis of fine-scale mammalian evolutionary breakpoints provides new insight into their relation to genome organisation

Author: Arneodo Alain
Audit Benjamin
Gautier Christian
Lemaitre Claire
Sagot Marie-France
Tannier Eric
Zaghloul Lamia
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The Intergenic Breakage Model, which is the current model of structural genome evolution, considers that evolutionary rearrangement breakages happen with a uniform propensity along the genome but are selected against in genes, their regulatory regions and in-between. However, a growing body of evidence shows that there exists regions along mammalian genomes that present a high susceptibility to breakage. We reconsidered this question taking advantage of a recently published methodology for the precise detection of rearrangement breakpoints based on pairwise genome comparisons. Results We applied this methodology between the genome of human and those of five sequenced eutherian mammals which allowed us to delineate evolutionary breakpoint regions along the human genome with a finer resolution (median size 26.6 kb) than obtained before. We investigated the distribution of these breakpoints with respect to genome organisation into domains of different activity. In agreement with the Intergenic Breakage Model, we observed that breakpoints are under-represented in genes. Surprisingly however, the density of breakpoints in small intergenes (1 per Mb) appears significantly higher than in gene deserts (0.1 per Mb). More generally, we found a heterogeneous distribution of breakpoints that follows the organisation of the genome into isochores (breakpoints are more frequent in GC-rich regions). We then discuss the hypothesis that regions with an enhanced susceptibility to breakage correspond to regions of high transcriptional activity and replication initiation. Conclusion We propose a model to describe the heterogeneous distribution of evolutionary breakpoints along human chromosomes that combines natural selection and a mutational bias linked to local open chromatin state.</p

HAL-ENS-LYON

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

INRIA a CCSD electronic archive server

PubMed Central

HAL Descartes

HAL-Rennes 1