6 research outputs found

    Single-cell strand sequencing for structural variant analysis and genome assembly

    Get PDF
    Rapid advances of DNA sequencing technologies and development of computational tools to analyze sequencing data has started a revolution in the field of genetics. DNA sequencing has applications in medical research, disease diagnosis and treatment, and population genetic studies. Different sequencing techniques have their own advantages and limitations, and they can be used together to solve genome assembly and genetic variant detection. The focus of this thesis is on a specific single-cell sequencing technology, called strand sequencing. With its chromosome and haplotype-specific strand information, this technique has very powerful signals for discovery of genomic structural variations, haplotype phasing, and chromosome clustering. We developed statistical and compuptational tools to exploit this information from strand sequencing technology. I first present a computational framework for detecting structural variations in single cells using strand sequencing data. The presented tool is able to detect different types of structural variations in single cells including copy number variations, inversions, and inverted duplications, and also more complex biological events such as translocations and breakage-fusion-bridge (BFB) cycles. These variations and genomic rearrangements have been observed in cancer, therefore the discovery of such events within cell populations can lead to a more accurate picture of cancer genomes and help in diagnosis. In the remainder of this thesis, I elaborate on two computational pipelines for clustering long DNA sequences by their original chromosome and haplotype in the absence of a reference genome. These pipelines are developed to facilitate genome assembly and de novo haplotype phasing in a fast and accurate manner. The resulting haplotype assemblies can be useful in studying genomic variations with no reference bias, gaining insights in population genetics, and detection of compound heterozygosity.Die rasanten Fortschritte im Bereich der DNA-Sequenzierung und die Entwicklung von Computerwerkzeugen für die Analyse von Sequenzierdaten haben eine Revolution auf dem Gebiet der Genetik ausgelöst. Die DNA-Sequenzierung findet Anwendung in der medizinischen Forschung, bei der Diagnose und Behandlung von Krankheiten und bei populationsgenetischen Studien. Verschiedene Sequenzierungstechniken haben jeweils ihre Vorteile und Grenzen, können aber kombiniert werden, um Genome zu assemblieren oder um genetische Varianten zu finden. Der Schwerpunkt dieser Arbeit liegt auf einer speziellen Einzelzell Sequenzierungstechnologie, genannt Strand-Seq. Mit ihren chromosomen- und haplotypspezifischen Stranginformationen liefert diese Technik sehr starke Signale für die Entdeckung genomischer Strukturvariationen, die Rekonstruktion von Haplotypen und das Chromosomenclustering. Wir haben statistische und computergestützte Werkzeuge entwickelt, um diese Informationen der Strand-Seq Technologie zu nutzen. Zunächst präsentiere ich einen mathematisches Modell für die Erkennung struktureller Variationen in einzelnen Zellen unter Verwendung von Strand-Seq Daten. Das vorgestellte Tool ist in der Lage, verschiedene Arten von Strukturvariationen in Einzelzellen zu erkennen, darunter Kopienzahlvariationen, Inversionen und invertierte Duplikationen sowie komplexere biologische Ereignisse wie Translokationen und Break-Fusion- Bridge-Zyklen (BFB). Diese Variationen und genomischen Umlagerungen wurden bei Krebs beobachtet, sodass der Nachweis solcher Ereignisse in Zellpopulationen zu einem genaueren Bild des Krebsgenoms führen und bei der Diagnose helfen kann. Im Folgenden stelle ich zwei Computerpipelines vor, mit denen lange DNA-Sequenzen nach ihrem ursprünglichen Chromosom und Haplotyp geclustert werden können, wenn kein Referenzgenom verfügbar ist. Diese Pipelines wurden entwickelt, um die Genomassemblierung und die de novo Rekonstruktion von Haplotypen auf schnelle und genaue Weise zu erleichtern. Die daraus resultierenden Haplotypen können bei der Untersuchung genomischer Variationen ohne Referenzverzerrung, bei der Gewinnung von Einblicken in die Populationsgenetik und beim Nachweis von zusammengesetzter Heterozygotie nützlich sein

    Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads

    Get PDF
    Human genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing with continuous long-read or high-fidelity sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms

    Strand-seq Enables Reliable Separation of Long Reads by Chromosome via Expectation Maximization

    Get PDF
    <p>Given file contains fastq formatted Strand-seq reads originating from 132 single-cell libraries. The fastq file was generated from mapped BAM files containing pair-end reads. In this file only first mate of each pair is reported and at the same time chromosome location, flag and library name is appended to each read name.</p

    Single-cell analysis of structural variations and complex rearrangements with tri-channel processing

    Full text link
    Structural variation (SV), involving deletions, duplications, inversions and translocations of DNA segments, is a major source of genetic variability in somatic cells and can dysregulate cancer-related pathways. However, discovering somatic SVs in single cells has been challenging, with copy-number-neutral and complex variants typically escaping detection. Here we describe single-cell tri-channel processing (scTRIP), a computational framework that integrates read depth, template strand and haplotype phase to comprehensively discover SVs in individual cells. We surveyed SV landscapes of 565 single cells, including transformed epithelial cells and patient-derived leukemic samples, to discover abundant SV classes, including inversions, translocations and complex DNA rearrangements. Analysis of the leukemic samples revealed four times more somatic SVs than cytogenetic karyotyping, submicroscopic copy-number alterations, oncogenic copy-neutral rearrangements and a subclonal chromothripsis event. Advancing current methods, single-cell tri-channel processing can directly measure SV mutational processes in individual cells, such as breakage-fusion-bridge cycles, facilitating studies of clonal evolution, genetic mosaicism and SV formation mechanisms, which could improve disease classification for precision medicine

    Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads.

    No full text
    Human genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value \u3e 40) and highly contiguous (contig N50 \u3e 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms

    Haplotype-resolved diverse human genomes and integrated analysis of structural variation.

    No full text
    Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population
    corecore