Search CORE

921 research outputs found

Recommended from our members

De novo Nanopore read quality improvement using deep learning.

Author: Egan Rob
LaPierre Nathan
Wang Wei
Wang Zhong
Publication venue: eScholarship, University of California
Publication date: 01/11/2019
Field of study

BACKGROUND:Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. RESULTS:Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent "scrubbing" (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. CONCLUSIONS:MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub

eScholarship - University of California

Comparing assembly strategies for third-generation sequencing technologies across different genomes

Author: Bautista-Moreno Rocío
Espinosa García Elena María
Fernández Ivan
Larrosa-Jiménez Rafael
López-Zapata Emilio
Plata-González Óscar Guillermo
Publication venue: Elsevier
Publication date: 01/09/2023
Field of study

The recent advent of long-read sequencing technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore technology (ONT), has led to substantial accuracy and computational cost improvements. However, de novo whole-genome assembly still presents significant challenges related to the computational cost and the quality of the results. Accordingly, sequencing accuracy and throughput continue to improve, and many tools are constantly emerging. Therefore, selecting the correct sequencing platform, the proper sequencing depth and the assembly tools are necessary to perform high-quality assembly. This paper evaluates the primary assembly reconstruction from recent hybrid and non-hybrid pipelines on different genomes. We find that using PacBio high-fidelity long-read (HiFi) plays an essential role in haplotype construction with respect to ONT reads. However, we observe a substantial improvement in the correctness of the assembly from high-fidelity ONT datasets and combining it with HiFi or short-reads.Funding for open access charge: Universidad de Málaga / CBU

Repositorio Institucional Universidad de Málaga

Comparing assembly strategies for third-generation sequencing technologies across different genomes

Author: Bautista Moreno Rocío
Espinosa García Elena
Fernández Vega Ivan
Larrosa Jiménez Rafael
Lopez Zapata Emilio
Plata González Oscar Guillermo
Publication venue: Elsevier
Publication date: 01/09/2023
Field of study

The recent advent of long-read sequencing technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore technology (ONT), has led to substantial accuracy and computational cost improvements. However, de novo whole-genome assembly still presents significant challenges related to the computational cost and the quality of the results. Accordingly, sequencing accuracy and throughput continue to improve, and many tools are constantly emerging. Therefore, selecting the correct sequencing platform, the proper sequencing depth and the assembly tools are necessary to perform high-quality assembly. This paper evaluates the primary assembly reconstruction from recent hybrid and non-hybrid pipelines on different genomes. We find that using PacBio high-fidelity long-read (HiFi) plays an essential role in haplotype construction with respect to ONT reads. However, we observe a substantial improvement in the correctness of the assembly from high-fidelity ONT datasets and combining it with HiFi or short-reads.This work has been partially supported by the Spanish MINECO PID2019-105396RB-I00, Junta de Andalucia JA2018 P18-FR-3433, and UMA18-FEDERJA-197 projects. Funding for open access charge: Universidad de Málaga/CBUA.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Haplotype-aware Diplotyping from Noisy Long Reads

Author: Ebler J.
Haukness M.
Marschall T.
Paten B.
Pesout T.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

MPG.PuRe

Methods for Epigenetic Analyses from Long-Read Sequencing Data

Author: Snajder Rene Helmut
Publication venue
Publication date: 01/01/2023
Field of study

Epigenetics, particularly the study of DNA methylation, is a cornerstone field for our understanding of human development and disease. DNA methylation has been included in the "hallmarks of cancer" due to its important function as a biomarker and its contribution to carcinogenesis and cancer cell plasticity. Long-read sequencing technologies, such as the Oxford Nanopore Technologies platform, have evolved the study of structural variations, while at the same time allowing direct measurement of DNA methylation on the same reads. With this, new avenues of analysis have opened up, such as long-range allele-specific methylation analysis, methylation analysis on structural variations, or relating nearby epigenetic modalities on the same read to another. Basecalling and methylation calling of Nanopore reads is a computationally expensive task which requires complex machine learning architectures. Read-level methylation calls require different approaches to data management and analysis than ones developed for methylation frequencies measured from short-read technologies or array data. The 2-dimensional nature of read and genome associated DNA methylation calls, including methylation caller uncertainties, are much more storage costly than 1-dimensional methylation frequencies. Methods for storage, retrieval, and analysis of such data therefore require careful consideration. Downstream analysis tasks, such as methylation segmentation or differential methylation calling, have the potential of benefiting from read information and allow uncertainty propagation. These avenues had not been considered in existing tools. In my work, I explored the potential of long-read DNA methylation analysis and tackled some of the challenges of data management and downstream analysis using state of the art software architecture and machine learning methods. I defined a storage standard for reference anchored and read assigned DNA methylation calls, including methylation calling uncertainties and read annotations such as haplotype or sample information. This storage container is defined as a schema for the hierarchical data format version 5, includes an index for rapid access to genomic coordinates, and is optimized for parallel computing with even load balancing. It further includes a python API for creation, modification, and data access, including convenience functions for the extraction of important quality statistics via a command line interface. Furthermore, I developed software solutions for the segmentation and differential methylation testing of DNA methylation calls from Nanopore sequencing. This implementation takes advantage of the performance benefits provided by my high performance storage container. It includes a Bayesian methylome segmentation algorithm which allows for the consensus instance segmentation of multiple sample and/or haplotype assigned DNA methylation profiles, while considering methylation calling uncertainties. Based on this segmentation, the software can then perform differential methylation testing and provides a large number of options for statistical testing and multiple testing correction. I benchmarked all tools on both simulated and publicly available real data, and show the performance benefits compared to previously existing and concurrently developed solutions. Next, I applied the methods to a cancer study on a chromothriptic cancer sample from a patient with Sonic Hedgehog Medulloblastoma. I here report regulatory genomic regions differentially methylated before and after treatment, allele-specific methylation in the tumor, as well as methylation on chromothriptic structures. Finally, I developed specialized methylation callers for the combined DNA methylation profiling of CpG, GpC, and context-free adenine methylation. These callers can be used to measure chromatin accessibility in a NOMe-seq like setup, showing the potential of long-read sequencing for the profiling of transcription factor co-binding. In conclusion, this thesis presents and subsequently benchmarks new algorithmic and infrastructural solutions for the analysis of DNA methylation data from long-read sequencing

Heidelberger Dokumentenserver

High quality genome assemblies of Mycoplasma bovis using a taxon-specific Bonito basecaller for MinION and Flongle long-read nanopore sequencing

Author: Bokma Jade
Boyen Filip
Haesebrouck Freddy
Nauwynck Hans
Pardon Bart
Theuns Sebastiaan
Vereecke Nick
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Implementation of Third-Generation Sequencing approaches for Whole Genome Sequencing (WGS) all-in-one diagnostics in human and veterinary medicine, requires the rapid and accurate generation of consensus genomes. Over the last years, Oxford Nanopore Technologies (ONT) released various new devices (e.g. the Flongle R9.4.1 flow cell) and bioinformatics tools (e.g. the in 2019-released Bonito basecaller), allowing cheap and user-friendly cost-efficient introduction in various NGS workflows. While single read, overall consensus accuracies, and completeness of genome sequences has been improved dramatically, further improvements are required when working with non-frequently sequenced organisms like Mycoplasma bovis. As an important primary respiratory pathogen in cattle, rapid M. bovis diagnostics is crucial to allow timely and targeted disease control and prevention. Current complete diagnostics (including identification, strain typing, and antimicrobial resistance (AMR) detection) require combined culture-based and molecular approaches, of which the first can take 1–2 weeks. At present, cheap and quick long read all-in-one WGS approaches can only be implemented if increased accuracies and genome completeness can be obtained

Ghent University Academic Bibliography

A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers

Author: Alser Mohammed
Cavlak Meryem Banu
Corporaal Henk
Denolf Kristof
Firtina Can
Khodamoradi Alireza
Mutlu Onur
Singh Gagandeep
Publication venue
Publication date: 14/04/2023
Field of study

Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The accuracy and speed of basecalling have critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models to perform basecalling without considering the compute demands of such models, which leads to slow, inefficient, and memory-hungry basecallers. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. Our goal is to develop a comprehensive framework for creating deep learning-based basecallers that provide high efficiency and performance. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of RUBICON by developing RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. Compared to the fastest state-of-the-art basecaller, RUBICALL provides a 3.96x speedup with 2.97% higher accuracy. We show that RUBICON helps researchers develop hardware-optimized basecallers that are superior to expert-designed models

arXiv.org e-Print Archive

Recommended from our members

Examining bacterial variation with genome graphs and Nanopore sequencing

Author: Hall Michael
Publication venue: University of Cambridge
Publication date: 10/02/2022
Field of study

A bacterial species' genetic content can be remarkably fluid. The collection of genes found within a given species is called the pan-genome and is generally much larger than the gene repertoire of a single cell. A consequence of this pan-genome is that bacterial genomes are highly adaptable and thus variable. The dominant paradigm for analysing genetic variation relies on a central idea: all genomes in a species can be described as minor differences from a single reference genome, which serves as a coordinate system. As an introduction to this thesis, we outline why this approach is inadequate for bacteria and describe a new approach using genome graphs. In the first chapter, we present algorithms for de novo variant discovery within such genome graphs and evaluate their performance with empirical data. The remaining chapters address a question relating to a critical bacterial pathogen: can Nanopore sequencing of Mycobacterium tuberculosis provide high-quality public health information? We collect data from Madagascar, South Africa, and England to help answer this question. First, we assess outbreaks identified using single-reference and genome graph methods. Second, we evaluate antimicrobial resistance predictions and introduce a framework for using genome graphs to improve current methods. Lastly, we train an M. tuberculosis-specific Nanopore basecalling model with considerable accuracy improvement. Together, this thesis provides general methods for uncovering bacterial variation and applies them to an important global public health question.EMBL International PhD Programm

Apollo (Cambridge)