8 research outputs found

    SVenX: A highly parallelized pipeline for structural variation detection using linked read whole genome sequencing data

    Get PDF
    Genomic rearrangements larger than 50 bp are called structural variants. As a group, they affect the phenotypic diversity among humans and have been associated with many human disorders including neurodevelopmental disorder and cancer. Recent advances in whole genome sequencing (WGS) technologies have made it possible to identify many more disease-causing genetic variants relevant in clinical diagnostics and sometimes affecting treatment. Numerous approaches have been proposed to detect structural variants, but to acquire and filter out the most significant information from the multitude of called variants in the sequencing data has shown to be a challenge. Another obstacle is the high computational cost of data analyses and difficulties in configuring and operating the softwares and databases. Here, we present SVenX, a highly automated and parallelized pipeline that analyzes and call structural variants using linked read WGS data. It performs variant calling using three different approaches, as well as annotation of variants and variant filtering. We also introduce a new tool, SVGenT, that reanalyzes the called structural variants by performing de novo assembly using the aligned reads at the identified breakpoint junctions. By comparing assembled contigs and analyzing the read coverage between the breakpoint junctions, SVGenT improves both variant and genotype classification and the breakpoint localization.Tool for detection of genomic rearrangements in humans Genomic rearrangements larger than 50 base pairs are referred to as structural variants (SVs), and impact phenotypic differences between humans. Some of these variants have been associated with human diseases such as cancer and neurodevelopmental disorders. Recent advances in whole genome sequencing (WGS) technologies have made it possible to analyze and identify many structural variants. Yet, the existing tools used for analyzing these data are not perfect, and require a fair amount of knowledge in bioinformatics to operate. SVenX is a highly parallelized and automated pipeline, executing all steps from whole genome sequencing data to filtered SVs. This includes 1) verifying that all required data exist, 2) making sure no data duplications exist, 3) finding variants using different methods, and 4) annotating and filtering the detected SVs. SVenX performs 10 separate steps including 3 different variant detection tools (also known as variant callers). Normally, these steps are performed one by one, waiting for the output before running the next. Not only does it take longer for the programs to run with this approach, it also requires an employee to execute the steps. Except from the installation, SVenX takes at the most a few minutes to setup and launch and can analyze multiple samples of WGS data at the same time. The whole pipeline takes about 4 to 5 days to complete, requiring minimal work effort and bioinformatic knowledge. Another challenge in SV research is not only detecting the variants, but also to be confident that the detected SVs are true calls. The performance of existing variant callers differ significantly between each other. One tool can perform really good using one dataset and fail totally in detecting SVs in another dataset, while a second tool might be good in detecting only a single type of SV. Using multiple bioinformatics methods to detect SVs have shown to result in a higher detection rate. We have created a novel tool, SVGenT, that re-analyzes already detected SVs by doing de novo assembly. SVGenT classifies the SV type (deletion, duplication, inversion or break-end), genotype (homozygous or heterozygous), and update the genomic position of the SV breakpoints. SVGenT has been tested using two datasets: one public large-scale WGS dataset and one simulated dataset with 4000 SVs. Three different variant callers were used to detect the variants before SVGenT was run on the output files. The detection rate was calculated before and after SVGenT was applied. In most cases, SVGenT improved the classification of both SV-type and SV-genotype. Master’s Degree Project in Biology/Molecular Biology/Bioinformatics 60 credits 2017 Department of Biology, Lund University Advisor: Anna Lindstrand M.D., Ph.D. Karolinska Institutet

    LEVIATHAN: efficient discovery of large structural variants by leveraging long-range information from Linked-Reads data

    Get PDF
    National audienceLinked-Reads technologies, popularized by 10x Genomics, combine the highquality and low cost of short-reads sequencing with a long-range information by adding barcodes that tag reads originating from the same long DNA fragment. Thanks to their high-quality and long-range information, such reads are thus particularly useful for various applications such as genome scaffolding and structural variant calling. As a result, multiple structural variant calling methods were developed within the last few years. However, these methods were mainly tested on human data, and do not run well on non-human organisms, for which reference genomes are highly fragmented, or sequencing data display high levels of heterozygosity. Moreover, even on human data, most tools still require large amounts of computing resources. We present LEVIATHAN, a new structural variant calling tool that aims to address these issues, and especially better scale and apply to a wide variety of organisms. Our method relies on a barcode index, that allows to quickly compare the similarity of all possible pairs of regions in terms of amount of common barcodes. Region pairs sharing a sufficient number of barcodes are then considered as potential structural variants, and complementary, classical short reads methods are applied to further refine the breakpoint coordinates. Our experiments on simulated data underline that our method compares well to the state-of-the-art, both in terms of recall and precision, and also in terms of resource consumption. Moreover, LEVIATHAN was successfully applied to a real dataset from a non-model organism, while all other tools either failed to run or required unreasonable amounts of resources. LEVIATHAN is implemented in C++, supported on Linux platforms, and available under AGPL-3.0 License at https://github.com/morispi/LEVIATHAN

    Structural variant calling: the long and the short of it.

    Get PDF
    Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution-giving rise to the differences within populations and among species. Nevertheless, characterizing SVs and determining the optimal approach for a given experimental design remains a computational and scientific challenge. Multiple approaches have emerged to target various SV classes, zygosities, and size ranges. Here, we review these approaches with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach

    Transposable element insertions are associated with batesian mimicry in the pantropical butterfly Hypolimnas misippus

    Get PDF
    Hypolimnas misippus is a Batesian mimic of the toxic African Queen butterfly (Danaus chrysippus). Female H. misippus butterflies use two major wing patterning loci (M and A) to imitate three color morphs of D. chrysippus found in different regions of Africa. In this study, we examine the evolution of the M locus and identify it as an example of adaptive atavism. This phenomenon involves a morphological reversion to an ancestral character that results in an adaptive phenotype. We show that H. misippus has re-evolved an ancestral wing pattern present in other Hypolimnas species, repurposing it for Batesian mimicry of a D. chrysippus morph. Using haplotagging, a linked-read sequencing technology, and our new analytical tool, Wrath, we discover two large transposable element insertions located at the M locus and establish that these insertions are present in the dominant allele responsible for producing mimetic phenotype. By conducting a comparative analysis involving additional Hypolimnas species, we demonstrate that the dominant allele is derived. This suggests that, in the derived allele, the transposable elements disrupt a cis-regulatory element, leading to the reversion to an ancestral phenotype that is then utilized for Batesian mimicry of a distinct model, a different morph of D. chrysippus. Our findings present a compelling instance of convergent evolution and adaptive atavism, in which the same pattern element has independently evolved multiple times in Hypolimnas butterflies, repeatedly playing a role in Batesian mimicry of diverse model species

    Human endogenous retrovirus H protects the genome of human embryonic stem cells from mutagenic retroelements activity

    Get PDF
    Human pluripotent stem cells (hPSCs), which include human embryonic stem cells (hESCs) and human induced pluripotent stem cells (hiPSCs), infinitely self-renew, and can differentiate into any cell type on the human body [1–3]. hESCs are derived from early human embryos and became widely used to study the molecular pathways specific to human embryogenesis [1, 4–8]. Considering the ethical challenge in using embryo-derived cells and the possible immune rejection, hiPSCs are currently more common for regenerative therapies [3, 9–11]. hiPSCs are reprogrammed from a somatic cell line of a patient, genetically modified, and then differentiated to the desired lineage to transplant them back to the patient. hiPSCs are the future of personalized medicine, but not every hiPSC line can differentiate to every given cell type, as a result of cell heterogeneity. To reduce this heterogeneity, a naïve cell state might be a solution [3]. Whereas cultured hPSCs reside in a primed state, the cells of pre-implantation embryos resemble naïve pluripotency [12–16]. By adjusting culture conditions, it is possible to support hPSCs in a naïve state, similar in gene expression signature to early embryos [4, 5, 17–19]. The similarity is reflected as well in transcripts of some of the L1, Alu, and SVA retroelements (REs) [5]. These REs are phylogenetically young and still active human transposons, which might be detrimental for the integrity of the genome [20–27]. Our research group had previously derived the different types of naïve cells, resembling the later stages of pre-implantation development and highly expressing human endogenous retrovirus H (HERVH) [6]. HERVH is a phylogenetically older endogenous retrovirus, which was transposing following New- and Old-World monkey separation [28–30]. Now, HERVH can’t mobilize, but its transcripts were shown to support pluripotency in later stages of human embryogenesis, reprogramming, and in cultured primed hPSCs [6, 7, 31, 32]. Here I show that HERVH controls the transposition of young REs. In HERVH-depleted hESCs, L1 transposition increases, which is measured by two transposition assays. The active L1 elements drive the transposition of non-autonomous REs, resulting in the accumulation of de novo Alus and SVAs integrations, shown by whole-genome sequencing of cells undergoing stable HERVH knock-down. A subgroup of HERVH has the potential to control L1 transposition. These HERVHlin loci contain lin motif, two tandem LIN28A binding sites [33]. HERVHlin is supposedly evolutionary younger than the other HERVH. There are around 100 of HERVHlin sequences in the human, chimp, and gorilla genomes, while less exist in orangutans, and none in other primates. Based on the analysis of the previously published CLIP-seq data [33] and performed RIP-qPCRs, the lin motif allows LIN28A to bind HERVHlin more efficiently than other HERVH transcripts. LIN28A is known to inhibit the maturation of let-7 microRNA [34–37], which in turn controls the transposition of L1 [38]. HERVHlin sponging LIN28A to allow let-7-mediated inhibition of L1 might be the molecular mechanism of HERVH-controlled transposition of young REs. The supporting experiment shows that a let-7 independent L1-ORFeus reporter does not change the transposition activity in HERVH-depleted cells. HERVHlin embedded itself in a previously conservative pluripotency-specific LIN28A-let-7 pathway to protect the genome of hESCs from the mutagenic activity of REs. This is an example of a new evolutionary event where the selfish transposon HERVH evolved to compete with other transposable elements, which could be harmful to the host

    Discovering novel human structural variation from diverse populations and disease patients: an exploration of what human genomics misses by relying on reference-based analyses

    Get PDF
    Since the completion of the human genome project, the field of genomics has relied on the human reference genome for nearly all analyses. Population genetics, disease association studies, and beyond all begin by comparing an individual’s sequenced genome to the human reference. However, the human reference genome is not only still incomplete, but also not an accurate representation of humanity; it is derived primarily from a single individual, and cannot possibly represent the scope of human diversity. By using this genome as a template, we bias our studies. In this thesis we examine large regions of structural variation between individuals that are often missed by comparing solely to the human reference genome. We use multiple strategies to uncover variation, including performing localized assembly on whole genome sequencing reads not matching the reference genome from 910 individuals of African ancestry, and utilizing new, long-read sequencing technologies in disease patients. We demonstrate that vast amounts of sequence present in human populations, nearly 300 megabases in the case of the African ancestry dataset, are missing from the reference genome, as well as that many non-reference sequences are present in breast cancer and Mendelian disease patients, which could have yet-to-be-discovered disease relevance. We find evidence of novel non-reference sequences which are genic and transcribed in many individuals, which may have functional relevance. Finally we present strategies for integrating the wealth of short-read sequencing data currently available with the limited but growing number of newer, long-read sequenced samples to gain new insights previously inaccessible using short-read data alone
    corecore