15 research outputs found
Combination of short-read, long-read, and optical mapping assemblies reveals large-scale tandem repeat arrays with population genetic implications
Accurate and contiguous genome assembly is key to a comprehensive understanding of the processes shaping genomic diversity and evolution. Yet, it is frequently constrained by constitutive heterochromatin, usually characterized by highly repetitive DNA. As a key feature of genome architecture associated with centromeric and subtelomeric regions, it locally influences meiotic recombination. In this study, we assess the impact of large tandem repeat arrays on the recombination rate landscape in an avian speciation model, the Eurasian crow. We assembled two high-quality genome references using single-molecule real-time sequencing (long-read assembly [LR]) and single-molecule optical maps (optical map assembly [OM]). A three-way comparison including the published short-read assembly (SR) constructed for the same individual allowed assessing assembly properties and pinpointing misassemblies. By combining information from all three assemblies, we characterized 36 previously unidentified large repetitive regions in the proximity of sequence assembly breakpoints, the majority of which contained complex arrays of a 14-kb satellite repeat or its 1.2-kb subunit. Using whole-genome population resequencing data, we estimated the population-scaled recombination rate (Ï) and found it to be significantly reduced in these regions. These findings are consistent with an effect of low recombination in regions adjacent to centromeric or subtelomeric heterochromatin and add to our understanding of the processes generating widespread heterogeneity in genetic diversity and differentiation along the genome. By combining three different technologies, our results highlight the importance of adding a layer of information on genome structure that is inaccessible to each approach independently
CRISPR/Cas9-targeted enrichment and long-read sequencing of the Fuchs endothelial corneal dystrophyâassociated TCF4 triplet repeat
PURPOSE: To demonstrate the utility of an amplification-free long-read sequencing method to characterize the Fuchs endothelial corneal dystrophy (FECD)-associated intronic TCF4 triplet repeat (CTG18.1). METHODS: We applied an amplification-free method, utilizing the CRISPR/Cas9 system, in combination with PacBio single-molecule real-time (SMRT) long-read sequencing, to study CTG18.1. FECD patient samples displaying a diverse range of CTG18.1 allele lengths and zygosity status (nâ=â11) were analyzed. A robust data analysis pipeline was developed to effectively filter, align, and interrogate CTG18.1-specific reads. All results were compared with conventional polymerase chain reaction (PCR)-based fragment analysis. RESULTS: CRISPR-guided SMRT sequencing of CTG18.1 provided accurate genotyping information for all samples and phasing was possible for 18/22 alleles sequenced. Repeat length instability was observed for all expanded (â„50 repeats) phased CTG18.1 alleles analyzed. Furthermore, higher levels of repeat instability were associated with increased CTG18.1 allele length (mode length â„91 repeats) indicating that expanded alleles behave dynamically. CONCLUSION: CRISPR-guided SMRT sequencing of CTG18.1 has revealed novel insights into CTG18.1 length instability. Furthermore, this study provides a framework to improve the molecular diagnostic accuracy for CTG18.1-mediated FECD, which we anticipate will become increasingly important as gene-directed therapies are developed for this common age-related and sight threatening disease
Targeted Long-read Sequencing : Development and Applications in Medical Genetics
Targeted sequencing has the advantage of providing pinpointed DNA information, while costs and data-analysis efforts are reduced. If targeted sequencing is combined with single molecule long-read sequencing, it can become a powerful tool to investigate genomic regions traditionally difficult using the predominantly used short-read sequencing platforms, including repetitive regions and large structural variants. The aim of this thesis has been to develop and apply novel targeted long-read sequencing protocols to solve research questions of biomedical and clinical interest. In Paper I we utilized a new amplification-free targeted long-read sequencing method to study trinucleotide repeats in the huntingtin (HTT) gene, associated with Huntingtonâs disease. This method generated reads spanning the entire repeats, and we could accurately determine the repeat sizes in patient samples. Moreover, we could discover somatic variation of HTT repeat elements as a result of sequencing single, unamplified DNA molecules. In Paper II we present the Xdrop technology, a microfluidic-based system for targeted enrichment of large DNA molecules in droplets from low input samples. We applied the Xdrop technology to detect human papilloma virus 18 (HPV18) integration sites in the human genome of a cervical cancer cell line by targeting the virus genome. We also demonstrated its utility in detecting and phasing SNVs in the tumor suppressor gene TP53 in leukemia cells. In Paper III we employed targeted long-read sequencing to identify CRISPR-Cas9 off-target mutations in vitro with our two novel methods Nano-OTS and SMRT-OTS. Importantly, we were able to identify Cas9 cleavage sites in regions of the human genome that are difficult or impossible to assess using short-read sequencing. The aim of Paper IV was to investigate large structural variants (SVs) induced by CRISPR-Cas9 at on-target and off-target sites in genome edited zebrafish and their offspring. Nano-OTS was used to identify Cas9 off-target sites for four guide RNAs, which were also used for genome editing of fertilized fish eggs. Aided by long-read re-sequencing, we showed that Cas9 can induce large SVs at both on-target and off-target sites in vivo, and that these adverse variants can be passed on to the next generation. This thesis has highlighted a diversity of targeted long-read sequencing methods and some of their applications in medical genetics. We believe these methods could have an important place in future research and clinical diagnostics, and that the scope of their utility will be far beyond the applications demonstrated in this work
Targeted Long-read Sequencing : Development and Applications in Medical Genetics
Targeted sequencing has the advantage of providing pinpointed DNA information, while costs and data-analysis efforts are reduced. If targeted sequencing is combined with single molecule long-read sequencing, it can become a powerful tool to investigate genomic regions traditionally difficult using the predominantly used short-read sequencing platforms, including repetitive regions and large structural variants. The aim of this thesis has been to develop and apply novel targeted long-read sequencing protocols to solve research questions of biomedical and clinical interest. In Paper I we utilized a new amplification-free targeted long-read sequencing method to study trinucleotide repeats in the huntingtin (HTT) gene, associated with Huntingtonâs disease. This method generated reads spanning the entire repeats, and we could accurately determine the repeat sizes in patient samples. Moreover, we could discover somatic variation of HTT repeat elements as a result of sequencing single, unamplified DNA molecules. In Paper II we present the Xdrop technology, a microfluidic-based system for targeted enrichment of large DNA molecules in droplets from low input samples. We applied the Xdrop technology to detect human papilloma virus 18 (HPV18) integration sites in the human genome of a cervical cancer cell line by targeting the virus genome. We also demonstrated its utility in detecting and phasing SNVs in the tumor suppressor gene TP53 in leukemia cells. In Paper III we employed targeted long-read sequencing to identify CRISPR-Cas9 off-target mutations in vitro with our two novel methods Nano-OTS and SMRT-OTS. Importantly, we were able to identify Cas9 cleavage sites in regions of the human genome that are difficult or impossible to assess using short-read sequencing. The aim of Paper IV was to investigate large structural variants (SVs) induced by CRISPR-Cas9 at on-target and off-target sites in genome edited zebrafish and their offspring. Nano-OTS was used to identify Cas9 off-target sites for four guide RNAs, which were also used for genome editing of fertilized fish eggs. Aided by long-read re-sequencing, we showed that Cas9 can induce large SVs at both on-target and off-target sites in vivo, and that these adverse variants can be passed on to the next generation. This thesis has highlighted a diversity of targeted long-read sequencing methods and some of their applications in medical genetics. We believe these methods could have an important place in future research and clinical diagnostics, and that the scope of their utility will be far beyond the applications demonstrated in this work
Clonal distribution of BCR-ABL1 mutations and splice isoforms by single-molecule long-read RNA sequencing
Background: The evolution of mutations in the BCR-ABL1 fusion gene transcript renders CML patients resistant to tyrosine kinase inhibitor (TKI) based therapy. Thus screening for BCR-ABL1 mutations is recommended particularly in patients experiencing poor response to treatment. Herein we describe a novel approach for the detection and surveillance of BCR-ABL1 mutations in CML patients. Methods: To detect mutations in the BCR-ABL1 transcript we developed an assay based on the Pacific Biosciences (PacBio) sequencing technology, which allows for single-molecule long-read sequencing of BCR-ABL1 fusion transcript molecules. Samples from six patients with poor response to therapy were analyzed both at diagnosis and follow-up. cDNA was generated from total RNA and a 1,6 kb fragment encompassing the BCR-ABL1 transcript was amplified using long range PCR. To estimate the sensitivity of the assay, a serial dilution experiment was performed. Results: Over 10,000 full-length BCR-ABL1 sequences were obtained for all samples studied. Through the serial dilution analysis, mutations in CML patient samples could be detected down to a level of at least 1%. Notably, the assay was determined to be sufficiently sensitive even in patients harboring a low abundance of BCR-ABL1 levels. The PacBio sequencing successfully identified all mutations seen by standard methods. Importantly, we identified several mutations that escaped detection by the clinical routine analysis. Resistance mutations were found in all but one of the patients. Due to the long reads afforded by PacBio sequencing, compound mutations present in the same molecule were readily distinguished from independent alterations arising in different molecules. Moreover, several transcript isoforms of the BCR-ABL1 transcript were identified in two of the CML patients. Finally, our assay allowed for a quick turn around time allowing samples to be reported upon within 2 days. Conclusions: In summary the PacBio sequencing assay can be applied to detect BCR-ABL1 resistance mutations in both diagnostic and follow-up CML patient samples using a simple protocol applicable to routine diagnosis. The method besides its sensitivity, gives a complete view of the clonal distribution of mutations, which is of importance when making therapy decisions
A novel quantitative targeted analysis of X-chromosome inactivation (XCI) using nanopore sequencing
Abstract X-chromosome inactivation (XCI) analyses often assist in diagnostics of X-linked traits, however accurate assessment remains challenging with current methods. We developed a novel strategy using amplification-free Cas9 enrichment and Oxford nanopore technologies sequencing called XCI-ONT, to investigate and rigorously quantify XCI in human androgen receptor gene (AR) and human X-linked retinitis pigmentosa 2 gene (RP2). XCI-ONT measures methylation over 116 CpGs in AR and 58 CpGs in RP2, and separate parental X-chromosomes without PCR bias. We show the usefulness of the XCI-ONT strategy over the PCR-based golden standard XCI technique that only investigates one or two CpGs per gene. The results highlight the limitations of using the golden standard technique when the XCI pattern is partially skewed and the advantages of XCI-ONT to rigorously quantify XCI. This study provides a universal XCI-method on DNA, which is highly valuable in clinical and research framework of X-linked traits
CRISPR-Cas9 induces large structural variants at on-target and off-target sites in vivo that segregate across generations
CRISPR-Cas9 genome editing has potential to cure diseases without current treatments, but therapies must be safe. Here we show that CRISPR-Cas9 editing can introduce unintended mutations in vivo, which are passed on to the next generation. By editing fertilized zebrafish eggs using four guide RNAs selected for off-target activity in vitro, followed by long-read sequencing of DNA from >1100 larvae, juvenile and adult fish across two generations, we find that structural variants (SVs), i.e., insertions and deletions >= 50 bp, represent 6% of editing outcomes in founder larvae. These SVs occur both at on-target and off-target sites. Our results also illustrate that adult founder zebrafish are mosaic in their germ cells, and that 26% of their offspring carries an off-target mutation and 9% an SV. Hence, pre-testing for off-target activity and SVs using patient material is advisable in clinical applications, to reduce the risk of unanticipated effects with potentially large implications
Detailed analysis of HTT repeat elements in human blood using targeted amplification-free long-read sequencing
Amplification of DNA is required as a mandatory step during library preparation in most targeted sequencing protocols. This can be a critical limitation when targeting regions that are highly repetitive or with extreme guanine-cytosine (GC) content, including repeat expansions associated with human disease. Here, we used an amplification-free protocol for targeted enrichment utilizing the CRISPR/Cas9 system (No-Amp Targeted sequencing) in combination with single molecule, real-time (SMRT) sequencing for studying repeat elements in the huntingtin (HTT) gene, where an expanded CAG repeat is causative for Huntington disease. We also developed a robust data analysis pipeline for repeat element analysis that is independent of alignment of reads to a reference genome. The method was applied to 11 diagnostic blood samples, and for all 22 alleles the resulting CAG repeat count agreed with previous results based on fragment analysis. The amplification-free protocol also allowed for studying somatic variability of repeat elements in our samples, without the interference of PCR stutter. In summary, with No-Amp Targeted sequencing in combination with our analysis pipeline, we could accurately study repeat elements that are difficult to investigate using PCR-based methods
De Novo Assembly of Two Swedish Genomes Reveals Missing Segments from the Human GRCh38 Reference and Improves Variant Calling of Population-Scale Sequencing Data
The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data