30 research outputs found

    MindTheGap: integrated detection and assembly of short and long insertions

    Get PDF
    Voir : http://mindthegap.genouest.orgInternational audienceMotivation: Insertions play an important role in genome evolution. However, such variants are difficult to detect from short read sequencing data, especially when they exceed the paired-end insert size. Many approaches have been proposed to call short insertion variants based on paired-end mapping. However, there remains a lack of practical methods to detect and assemble long variants. Results: We propose here an original method, called MINDTHEGAP, for the integrated detection and assembly of insertion variants from re-sequencing data. Importantly, it is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. MINDTHEGAP uses an efficient k-mer based method to detect insertion sites in a reference genome, and subsequently assemble them from the donor reads. MINDTHEGAP showed high recall and precision on simulated datasets of various genome complexities. When applied to real C. elegans and human NA12878 datasets, MINDTHEGAP detected and correctly assembled insertions longer than 1 kb, using at most 14 GB of memory.Availability: http://mindthegap.genouest.or

    Improvement of the assembly of heterozygous genomes of non-model organisms, a case study of the genomes of two Spodoptera frugiperda host strains

    Get PDF
    International audienceThe extraction of biological information from the draft genomes of non-model organisms may result in unattainable, incomplete, or even wrong conclusions. In particular, the combination of a high level of heterozygosity and short reads sequencing may have major impact in the annotation of genes [1,2]. This wrong gene content assessment is usually the consequence of the high fragmentation of the genome sequence but it may also come from an overestimation of the genome size. The latter because the assembly of an heterozygous region for which there is a significant divergence between the two haplotypes leads sometimes to the construction of two different contigs, instead of one consensus sequence. To date, new assemblers such as Platanus [3], have been developed in regard to heterozygous data. But, the complete re-assembly of a genome leading to new automatic and manual annotations process is very cost-effective, and may still produce erroneous scaffolds and annotations. Thus, we set up a « soft » method to detect and correct false duplications due to heterozygosity in draft assemblies. In addition, to the identification and removal of the allelic regions (i.e. unmerged haplotypes), our protocol is able to relocate and merge supernumerary gene annotations.We applied this method as a pre-requisite for the comparison of the genomes of 2 Spodoptera frugiperda (Lepidoptera: Noctuidae) strains, in the frame of the WGS project supported by the Fall armyworm International Public Consortium (FAW-IPC). This moth is a well-known pest of crops throughout the Western hemisphere. This species consists of two strains adapted to different larval host-plants: the first feeds preferentially on corn, cotton and sorghum whereas the second is more associated with rice and several pasture grasses. While, the paired-end reads of the rice-variant have been directly assembled using Platanus [3], we cleaned up and corrected the first release of the corn-variant, leading to a drastic reduction of the genome assembly, with the removal of 88Mbp (17%) and the increase of the N50 from 39,593 to 52,781bp. The suppressed fragments included 3,746 gene predictions; about 80% of them have been either relocated or merged with their complementary allele. Subsequently, in order to identify new candidate genes or genomic regions involved in the host-plant adaptation, we compared the genomes and proteomes of the 2 different strains in order to identify orthologous genes, collinear regions and genome rearrangements, taking into consideration the inflated occurrence of splitted genes due to the high fragmentation of the genome

    Transcriptome characterization by RNA sequencing identifies a major molecular and clinical subdivision in chronic lymphocytic leukemia

    Get PDF
    Chronic lymphocytic leukemia (CLL) has heterogeneous clinical and biological behavior. Whole-genome and -exome sequencing has contributed to the characterization of the mutational spectrum of the disease, but the underlying transcriptional profile is still poorly understood. We have performed deep RNA sequencing in different subpopulations of normal B-lymphocytes and CLL cells from a cohort of 98 patients, and characterized the CLL transcriptional landscape with unprecedented resolution. We detected thousands of transcriptional elements differentially expressed between the CLL and normal B cells, including protein-coding genes, noncoding RNAs, and pseudogenes. Transposable elements are globally derepressed in CLL cells. In addition, two thousand genes-most of which are not differentially expressed-exhibit CLL-specific splicing patterns. Genes involved in metabolic pathways showed higher expression in CLL, while genes related to spliceosome, proteasome, and ribosome were among the most down-regulated in CLL. Clustering of the CLL samples according to RNA-seq derived gene expression levels unveiled two robust molecular subgroups, C1 and C2. C1/C2 subgroups and the mutational status of the immunoglobulin heavy variable (IGHV) region were the only independent variables in predicting time to treatment in a multivariate analysis with main clinico-biological features. This subdivision was validated in an independent cohort of patients monitored through DNA microarrays. Further analysis shows that B-cell receptor (BCR) activation in the microenvironment of the lymph node may be at the origin of the C1/C2 differences

    Extracorporeal Membrane Oxygenation for Severe Acute Respiratory Distress Syndrome associated with COVID-19: An Emulated Target Trial Analysis.

    Get PDF
    RATIONALE: Whether COVID patients may benefit from extracorporeal membrane oxygenation (ECMO) compared with conventional invasive mechanical ventilation (IMV) remains unknown. OBJECTIVES: To estimate the effect of ECMO on 90-Day mortality vs IMV only Methods: Among 4,244 critically ill adult patients with COVID-19 included in a multicenter cohort study, we emulated a target trial comparing the treatment strategies of initiating ECMO vs. no ECMO within 7 days of IMV in patients with severe acute respiratory distress syndrome (PaO2/FiO2 <80 or PaCO2 ≄60 mmHg). We controlled for confounding using a multivariable Cox model based on predefined variables. MAIN RESULTS: 1,235 patients met the full eligibility criteria for the emulated trial, among whom 164 patients initiated ECMO. The ECMO strategy had a higher survival probability at Day-7 from the onset of eligibility criteria (87% vs 83%, risk difference: 4%, 95% CI 0;9%) which decreased during follow-up (survival at Day-90: 63% vs 65%, risk difference: -2%, 95% CI -10;5%). However, ECMO was associated with higher survival when performed in high-volume ECMO centers or in regions where a specific ECMO network organization was set up to handle high demand, and when initiated within the first 4 days of MV and in profoundly hypoxemic patients. CONCLUSIONS: In an emulated trial based on a nationwide COVID-19 cohort, we found differential survival over time of an ECMO compared with a no-ECMO strategy. However, ECMO was consistently associated with better outcomes when performed in high-volume centers and in regions with ECMO capacities specifically organized to handle high demand. This article is open access and distributed under the terms of the Creative Commons Attribution Non-Commercial No Derivatives License 4.0 (http://creativecommons.org/licenses/by-nc-nd/4.0/)

    Identification and correction of genome mis-assemblies due to heterozygosity

    Get PDF
    International audienceAssembly tools are more and more efficient to reconstruct a genome from next-generation sequencing data but some problems remain. One of them corresponds to mis-assemblies due to heterozygosity. Indeed, the assembly of an heterozygous region for which there is a significant divergence between the two haplotypes, could lead to the construction of two different contigs, instead of one consensus sequence. This problem causes an assembly of an heterozygous genome larger than expected, and also a loss of information (heterozygous SNPs or indels cannot be found in the erroneous regions). We propose a strategy to detect and correct false duplications in assemblies based on several metrics. We identified two specific cases highlighting problems of heterozygosity. The first case involves scaffolds that are completely matching on another one. The second case corresponds to scaffolds matching together by their extremities. The two sequences involved in the match may actually correspond to two distinct alleles of a specific locus instead of two different locations in the genome. Ideally, an erroneous duplication would involve two divergent but similar assembly parts, not containing any heterozygous polymorphisms, and for which the merge of the two would lead to the expected read coverage for the resulting consensus sequence. As a consequence, to distinguish between true genomic duplications and alleles, we used various metrics : sequence similarity, length of the match, average read coverage, presence/absence of SNPs in the two concerned regions, number of mate pairs with expected (or not) insert size... As a result, selected allelic regions are used to construct a single sequence by removal of one of the two alleles or joining of scaffolds by their extremities. This allows to decrease redundancy in the genome assembly, to improve the scaffolding and then to increase the N50 statistic. We applied this method to a 526Mb highly heterozygous wild type insect genome assembly for which we expected a genome size around 400Mb only. A set of user-validated false duplications in this assembly enabled us to validate the method and to fit the set of criteria, in order to distinguish between true and artefactual duplications. We took advantage of this study to compare classical assemblers (Minia, Soap) with more recent tools that handle heterozygosity, such as Platanus. This highlighted the advantages of such new assemblers for diploid genomes. However, for already-built assemblies, we showed that our approach is a fast and easy way to discard as much as possible erroneous duplications, allowing their correction without resorting to a complete new assembly that would be more time-consuming

    Improvement of the assembly of heterozygous genomes of non-model organisms

    Get PDF
    International audienceWhereas the number of non-model organisms being sequenced has drastically increased, the extraction of biological information from such data is hampered by the low quality of the draft assemblies. In particular, the combination of a high level of heterozygosity and short reads sequencing leads to fragmented assembly and the overestimation of the gene content and of the genome size. Recently, new assemblers have been developed to better handle heterozygous data. But, the complete re-assembly of a genome involves automatic and manual re-annotations tasks that are very cost-effective. Thus, we present here a novel method to detect and correct false duplications due to heterozygosity (two alleles instead of one consensus sequence) in diploid draft assemblies. In addition, the method is able to relocate and merge supernumerary gene annotations.The method is based on a whole genome self-alignment (Lastz + AxtChain) allowing the detection of highly similar regions. These can have two origins: either allelic regions or duplicated regions. To distinguish between them, three criteria are used: 1/ their location inside scaffolds: contrary to duplications, unmerged haplotypes come from the same locus and must share the same genomic contexts, 2/ their cumulative read depth (close to the expected one) and 3/ their level of redundancy in the whole assembly. Next, Detected pairs of allelic regions needs to be merged into one unique sequence in the assembly: either by the complete deletion of the redundant scaffolds or by the construction of meta-scaffolds (scaffolds joined together) keeping only the allele present in the longest scaffold of the pair. Genes located on the merged alleles need to be correctly re-annotated. This is performed using Exonerate and Augustus. The former allows to identify the location of the deleted genes onto the remaining allele. The latter is used to predict new genes or consensus ones. We applied this method to an heterozygous wild type insect genome assembly. This leads to a drastic reduction of the genome assembly size (coherent with the expected size estimated by flow cytometry) and to the increase of the N50. Most of the new meta-scaffolds were confirmed by several additional resources : mate pairs, BAC ends sequence mapping and synteny analysis. Moreover, about 80% of gene predictions located in removed fragments have been either relocated or merged with their complementary allele

    Whole genome re-sequencing : lessons from unmapped reads

    Get PDF
    National audienceUnmapped reads are often discarded from the analysis of whole genome re-sequencing, while, opposingly, new biological information can be discovered from their analysis. In this pa- per, we investigated these reads from the re-sequencing data of thirty-three aphid genomes. The unmapped reads for each individual were retrieved from the results of the mapping of the sets of reads against the Acyrthosyphon Pisum reference genome, its mitochondrion genome and several known or putative symbiont genomes. These sets of unmapped reads were then cross-compared, this pointed out that a significant number of these sequences were conserved among individuals, especially when the latter are adapted to a same specific host plant, revealing that they may share crucial and functional material. Moreover, the analysis of the contigs resulting from the assem- blies of the unmapped reads gathered by biotype allowed us to discover putative novel sequences absent from the reference genomes and highlighted the possible presence of other symbionts in the pea aphid genome whose existence were not known previously. As a conclusion, this study emphasizes that using a default strategy (e.g for the mapping) may lead to the loss of important information, and must be accompanied by specific analyses depending on the biological model

    Management of a High-Risk Surgery with Emicizumab and Factor VIII in a Child with a Severe Hemophilia A and Inhibitor

    No full text
    International audienceThe recent development of a humanized, bi-specific, and monoclonal antibody mimicking the function of activated factor VIII was a revolution in the management of patients suffering from severe hemophilia A with inhibitors. The phase III randomized studies have shown a more efficient prophylaxis of this subcutaneous administered drug in these patients compared with recombinant FVIIa (rFVIIa) and activated prothrombin complex concentrates (aPCC). Nonetheless, there are “real life” matters that need to be explored in this new era of managing hemophilia patients, such as surgery management under emicizumab, especially in children. Here, we report the first case, to our knowledge, of major orthopedic surgery managed with factor VIII infusions in a child with inhibitor receiving emicizumab

    Data from: Identifying genomic hotspots of differentiation and candidate genes involved in the adaptive divergence of pea aphid host races

    No full text
    Identifying the genomic bases of adaptation to novel environments is a long-term objective in evolutionary biology. Because genetic differentiation is expected to increase between locally adapted populations at the genes targeted by selection, scanning the genome for elevated levels of differentiation is a first step towards deciphering the genomic architecture underlying adaptive divergence. The pea aphid Acyrthosiphon pisum is a model of choice to address this question, as it forms a large complex of plant-specialized races and cryptic species, resulting from recent adaptive radiation. Here, we characterized genome-wide polymorphisms in three pea aphid races specialized on alfalfa, clover and pea crops, respectively, which we sequenced in pools (poolseq). Using a model-based approach that explicitly accounts for selection, we identified 392 genomic hotspots of differentiation spanning 47.3 Mb and 2,484 genes. Most of these highly differentiated regions were located on the autosomes and overall differentiation was weaker on the X chromosome. High levels of absolute divergence between races within hotspots suggest that these regions experienced less gene flow than the rest of the genome, most likely by contributing to reproductive isolation. Moreover, population-specific analyses showed evidence of selection in every host race, depending on the hotspot considered. These hotspots were significantly enriched for candidate gene categories that control host plant selection and use. These genes encode 48 salivary proteins, 14 gustatory receptors, 10 odorant receptors, five P450 cytochromes and one chemosensory protein, which represent promising candidates for the genetic basis of host plant specialization and ecological isolation in the pea aphid complex

    SNP_refcounts.tab

    No full text
    Reference allele count data for each of the 1,588,558 SNPs in each BAM file (2 per host race, ALF: alfalfa, CLO: clover, PEA: Pea). Coordinates on the v2 assembly (SCAFFOLD_POSITION) are given as row names
    corecore