7 research outputs found

    Toward a Dynamic Threshold for Quality-Score Distortion in Reference-Based Alignment

    Get PDF
    The intrinsic high entropy metadata, known as quality scores, are largely the cause of the substantial size of sequence data files. Yet, there is no consensus on a viable reduction of the resolution of the quality score scale, arguably because of collateral side effects. In this paper we leverage on the penalty functions of HISAT2 aligner to rebin the quality score scale in such a way as to avoid any impact on sequence alignment, identifying alongside a distortion threshold. We tested our findings on whole-genome sequence and RNA sequence data, and contrasted the results with three methods for lossy distortion of the quality scores

    Better quality score compression through sequence-based quality smoothing

    Get PDF
    Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling

    DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing

    Full text link
    We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq

    Lossy compression of quality scores in differential gene expression: A first assessment and impact analysis

    Get PDF
    High-throughput sequencing of RNA molecules has enabled the quantitative analysis of gene expression at the expense of storage space and processing power. To alleviate these prob- lems, lossy compression methods of the quality scores associated to RNA sequencing data have recently been proposed, and the evaluation of their impact on downstream analyses is gaining attention. In this context, this work presents a first assessment of the impact of lossily compressed quality scores in RNA sequencing data on the performance of some of the most recent tools used for differential gene expression

    Compression algorithms for biomedical signals and nanopore sequencing data

    Get PDF
    The massive generation of biological digital information creates various computing challenges such as its storage and transmission. For example, biomedical signals, such as electroencephalograms (EEG), are recorded by multiple sensors over long periods of time, resulting in large volumes of data. Another example is genome DNA sequencing data, where the amount of data generated globally is seeing explosive growth, leading to increasing needs for processing, storage, and transmission resources. In this thesis we investigate the use of data compression techniques for this problem, in two different scenarios where computational efficiency is crucial. First we study the compression of multi-channel biomedical signals. We present a new lossless data compressor for multi-channel signals, GSC, which achieves compression performance similar to the state of the art, while being more computationally efficient than other available alternatives. The compressor uses two novel integer-based implementations of the predictive coding and expert advice schemes for multi-channel signals. We also develop a version of GSC optimized for EEG data. This version manages to significantly lower compression times while attaining similar compression performance for that specic type of signal. In a second scenario we study the compression of DNA sequencing data produced by nanopore sequencing technologies. We present two novel lossless compression algorithms specifically tailored to nanopore FASTQ files. ENANO is a reference-free compressor, which mainly focuses on the compression of quality scores. It achieves state of the art compression performance, while being fast and with low memory consumption when compared to other popular FASTQ compression tools. On the other hand, RENANO is a reference-based compressor, which improves on ENANO, by providing a more efficient base call sequence compression component. For RENANO two algorithms are introduced, corresponding to the following scenarios: a reference genome is available without cost to both the compressor and the decompressor; and the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed le. Both algorithms of RENANO significantly improve the compression performance of ENANO, with similar compression times, and higher memory requirements.La generaci贸n masiva de informaci贸n digital biol贸gica da lugar a m煤ltiples desaf铆os inform谩ticos, como su almacenamiento y transmisi贸n. Por ejemplo, las se帽ales biom茅dicas, como los electroencefalogramas (EEG), son generadas por m煤ltiples sensores registrando medidas en simultaneo durante largos per铆odos de tiempo, generando grandes vol煤menes de datos. Otro ejemplo son los datos de secuenciaci贸n de ADN, en donde la cantidad de datos a nivel mundial esta creciendo de forma explosiva, lo que da lugar a una gran necesidad de recursos de procesamiento, almacenamiento y transmisi贸n. En esta tesis investigamos como aplicar t茅cnicas de compresi贸n de datos para atacar este problema, en dos escenarios diferentes donde la eficiencia computacional juega un rol importante. Primero estudiamos la compresi贸n de se帽ales biom茅dicas multicanal. Comenzamos presentando un nuevo compresor de datos sin perdida para se帽ales multicanal, GSC, que logra obtener niveles de compresi贸n en el estado del arte y que al mismo tiempo es mas eficiente computacionalmente que otras alternativas disponibles. El compresor utiliza dos nuevas implementaciones de los esquemas de codificaci贸n predictiva y de asesoramiento de expertos para se帽ales multicanal, basadas en aritm茅tica de enteros. Tambi茅n presentamos una versi贸n de GSC optimizada para datos de EEG. Esta versi贸n logra reducir significativamente los tiempos de compresi贸n, sin deteriorar significativamente los niveles de compresi贸n para datos de EEG. En un segundo escenario estudiamos la compresi贸n de datos de secuenciaci贸n de ADN generados por tecnolog铆as de secuenciaci贸n por nanoporos. En este sentido, presentamos dos nuevos algoritmos de compresi贸n sin perdida, espec铆ficamente dise帽ados para archivos FASTQ generados por tecnolog铆a de nanoporos. ENANO es un compresor libre de referencia, enfocado principalmente en la compresi贸n de los valores de calidad de las bases. ENANO alcanza niveles de compresi贸n en el estado del arte, siendo a la vez mas eficiente computacionalmente que otras herramientas populares de compresi贸n de archivos FASTQ. Por otro lado, RENANO es un compresor basado en la utilizaci贸n de una referencia, que mejora el rendimiento de ENANO, a partir de un nuevo esquema de compresi贸n de las secuencias de bases. Presentamos dos variantes de RENANO, correspondientes a los siguientes escenarios: (i) se tiene a disposici贸n un genoma de referencia, tanto del lado del compresor como del descompresor, y (ii) se tiene un genoma de referencia disponible solo del lado del compresor, y se incluye una versi贸n compacta de la referencia en el archivo comprimido. Ambas variantes de RENANO mejoran significativamente los niveles compresi贸n de ENANO, alcanzando tiempos de compresi贸n similares y un mayor consumo de memoria

    Compression and interoperable representation of genomic information

    Get PDF
    corecore