749 research outputs found
Compression algorithms for biomedical signals and nanopore sequencing data
The massive generation of biological digital information creates various computing
challenges such as its storage and transmission. For example, biomedical
signals, such as electroencephalograms (EEG), are recorded by multiple sensors over
long periods of time, resulting in large volumes of data. Another example is genome
DNA sequencing data, where the amount of data generated globally is seeing explosive
growth, leading to increasing needs for processing, storage, and transmission
resources. In this thesis we investigate the use of data compression techniques for
this problem, in two different scenarios where computational efficiency is crucial.
First we study the compression of multi-channel biomedical signals. We present
a new lossless data compressor for multi-channel signals, GSC, which achieves compression
performance similar to the state of the art, while being more computationally
efficient than other available alternatives. The compressor uses two novel
integer-based implementations of the predictive coding and expert advice schemes
for multi-channel signals. We also develop a version of GSC optimized for EEG
data. This version manages to significantly lower compression times while attaining
similar compression performance for that specic type of signal.
In a second scenario we study the compression of DNA sequencing data produced
by nanopore sequencing technologies. We present two novel lossless compression algorithms
specifically tailored to nanopore FASTQ files. ENANO is a reference-free
compressor, which mainly focuses on the compression of quality scores. It achieves
state of the art compression performance, while being fast and with low memory
consumption when compared to other popular FASTQ compression tools. On the
other hand, RENANO is a reference-based compressor, which improves on ENANO,
by providing a more efficient base call sequence compression component. For RENANO
two algorithms are introduced, corresponding to the following scenarios: a
reference genome is available without cost to both the compressor and the decompressor;
and the reference genome is available only on the compressor side, and a
compacted version of the reference is included in the compressed le. Both algorithms
of RENANO significantly improve the compression performance of ENANO,
with similar compression times, and higher memory requirements.La generación masiva de información digital biológica da lugar a múltiples desafíos informáticos, como su almacenamiento y transmisión. Por ejemplo, las señales biomédicas, como los electroencefalogramas (EEG), son generadas por múltiples sensores registrando medidas en simultaneo durante largos períodos de tiempo,
generando grandes volúmenes de datos. Otro ejemplo son los datos de secuenciación de ADN, en donde la cantidad de datos a nivel mundial esta creciendo de forma explosiva, lo que da lugar a una gran necesidad de recursos de procesamiento, almacenamiento y transmisión. En esta tesis investigamos como aplicar técnicas de compresión de datos para atacar este problema, en dos escenarios diferentes donde
la eficiencia computacional juega un rol importante.
Primero estudiamos la compresión de señales biomédicas multicanal. Comenzamos presentando un nuevo compresor de datos sin perdida para señales multicanal, GSC, que logra obtener niveles de compresión en el estado del arte y que al mismo tiempo es mas eficiente computacionalmente que otras alternativas disponibles. El compresor utiliza dos nuevas implementaciones de los esquemas de codificación predictiva
y de asesoramiento de expertos para señales multicanal, basadas en aritmética
de enteros. También presentamos una versión de GSC optimizada para datos de
EEG. Esta versión logra reducir significativamente los tiempos de compresión, sin
deteriorar significativamente los niveles de compresión para datos de EEG.
En un segundo escenario estudiamos la compresión de datos de secuenciación
de ADN generados por tecnologías de secuenciación por nanoporos. En este sentido,
presentamos dos nuevos algoritmos de compresión sin perdida, específicamente
diseñados para archivos FASTQ generados por tecnología de nanoporos. ENANO
es un compresor libre de referencia, enfocado principalmente en la compresión de
los valores de calidad de las bases. ENANO alcanza niveles de compresión en el
estado del arte, siendo a la vez mas eficiente computacionalmente que otras herramientas
populares de compresión de archivos FASTQ. Por otro lado, RENANO es
un compresor basado en la utilización de una referencia, que mejora el rendimiento
de ENANO, a partir de un nuevo esquema de compresión de las secuencias de bases.
Presentamos dos variantes de RENANO, correspondientes a los siguientes escenarios:
(i) se tiene a disposición un genoma de referencia, tanto del lado del compresor
como del descompresor, y (ii) se tiene un genoma de referencia disponible solo del
lado del compresor, y se incluye una versión compacta de la referencia en el archivo
comprimido. Ambas variantes de RENANO mejoran significativamente los niveles
compresión de ENANO, alcanzando tiempos de compresión similares y un mayor
consumo de memoria
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
Motivation
The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for
compression and indexing of text data, but the cost of computing the BWT of
very large string collections has prevented these techniques from being widely
applied to the large sets of sequences often encountered as the outcome of DNA
sequencing experiments. In previous work, we presented a novel algorithm that
allows the BWT of human genome scale data to be computed on very moderate
hardware, thus enabling us to investigate the BWT as a tool for the compression
of such datasets.
Results
We first used simulated reads to explore the relationship between the level
of compression and the error rate, the length of the reads and the level of
sampling of the underlying genome and compare choices of second-stage
compression algorithm.
We demonstrate that compression may be greatly improved by a particular
reordering of the sequences in the collection and give a novel `implicit
sorting' strategy that enables these benefits to be realised without the
overhead of sorting the reads. With these techniques, a 45x coverage of real
human genome sequence data compresses losslessly to under 0.5 bits per base,
allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming
a small proportion of low-quality bases from the reads improves the compression
still further).
This is more than 4 times smaller than the size achieved by a standard
BWT-based compressor (bzip2) on the untrimmed reads, but an important further
advantage of our approach is that it facilitates the building of compressed
full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the
previously archived version. This submission registers the fact that the
advanced access version is now available at
http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract
. Bioinformatics should be considered as the original place of publication of
this article, please cite accordingl
- …