2 research outputs found
Compression algorithms for biomedical signals and nanopore sequencing data
The massive generation of biological digital information creates various computing
challenges such as its storage and transmission. For example, biomedical
signals, such as electroencephalograms (EEG), are recorded by multiple sensors over
long periods of time, resulting in large volumes of data. Another example is genome
DNA sequencing data, where the amount of data generated globally is seeing explosive
growth, leading to increasing needs for processing, storage, and transmission
resources. In this thesis we investigate the use of data compression techniques for
this problem, in two different scenarios where computational efficiency is crucial.
First we study the compression of multi-channel biomedical signals. We present
a new lossless data compressor for multi-channel signals, GSC, which achieves compression
performance similar to the state of the art, while being more computationally
efficient than other available alternatives. The compressor uses two novel
integer-based implementations of the predictive coding and expert advice schemes
for multi-channel signals. We also develop a version of GSC optimized for EEG
data. This version manages to significantly lower compression times while attaining
similar compression performance for that specic type of signal.
In a second scenario we study the compression of DNA sequencing data produced
by nanopore sequencing technologies. We present two novel lossless compression algorithms
specifically tailored to nanopore FASTQ files. ENANO is a reference-free
compressor, which mainly focuses on the compression of quality scores. It achieves
state of the art compression performance, while being fast and with low memory
consumption when compared to other popular FASTQ compression tools. On the
other hand, RENANO is a reference-based compressor, which improves on ENANO,
by providing a more efficient base call sequence compression component. For RENANO
two algorithms are introduced, corresponding to the following scenarios: a
reference genome is available without cost to both the compressor and the decompressor;
and the reference genome is available only on the compressor side, and a
compacted version of the reference is included in the compressed le. Both algorithms
of RENANO significantly improve the compression performance of ENANO,
with similar compression times, and higher memory requirements.La generaci贸n masiva de informaci贸n digital biol贸gica da lugar a m煤ltiples desaf铆os inform谩ticos, como su almacenamiento y transmisi贸n. Por ejemplo, las se帽ales biom茅dicas, como los electroencefalogramas (EEG), son generadas por m煤ltiples sensores registrando medidas en simultaneo durante largos per铆odos de tiempo,
generando grandes vol煤menes de datos. Otro ejemplo son los datos de secuenciaci贸n de ADN, en donde la cantidad de datos a nivel mundial esta creciendo de forma explosiva, lo que da lugar a una gran necesidad de recursos de procesamiento, almacenamiento y transmisi贸n. En esta tesis investigamos como aplicar t茅cnicas de compresi贸n de datos para atacar este problema, en dos escenarios diferentes donde
la eficiencia computacional juega un rol importante.
Primero estudiamos la compresi贸n de se帽ales biom茅dicas multicanal. Comenzamos presentando un nuevo compresor de datos sin perdida para se帽ales multicanal, GSC, que logra obtener niveles de compresi贸n en el estado del arte y que al mismo tiempo es mas eficiente computacionalmente que otras alternativas disponibles. El compresor utiliza dos nuevas implementaciones de los esquemas de codificaci贸n predictiva
y de asesoramiento de expertos para se帽ales multicanal, basadas en aritm茅tica
de enteros. Tambi茅n presentamos una versi贸n de GSC optimizada para datos de
EEG. Esta versi贸n logra reducir significativamente los tiempos de compresi贸n, sin
deteriorar significativamente los niveles de compresi贸n para datos de EEG.
En un segundo escenario estudiamos la compresi贸n de datos de secuenciaci贸n
de ADN generados por tecnolog铆as de secuenciaci贸n por nanoporos. En este sentido,
presentamos dos nuevos algoritmos de compresi贸n sin perdida, espec铆ficamente
dise帽ados para archivos FASTQ generados por tecnolog铆a de nanoporos. ENANO
es un compresor libre de referencia, enfocado principalmente en la compresi贸n de
los valores de calidad de las bases. ENANO alcanza niveles de compresi贸n en el
estado del arte, siendo a la vez mas eficiente computacionalmente que otras herramientas
populares de compresi贸n de archivos FASTQ. Por otro lado, RENANO es
un compresor basado en la utilizaci贸n de una referencia, que mejora el rendimiento
de ENANO, a partir de un nuevo esquema de compresi贸n de las secuencias de bases.
Presentamos dos variantes de RENANO, correspondientes a los siguientes escenarios:
(i) se tiene a disposici贸n un genoma de referencia, tanto del lado del compresor
como del descompresor, y (ii) se tiene un genoma de referencia disponible solo del
lado del compresor, y se incluye una versi贸n compacta de la referencia en el archivo
comprimido. Ambas variantes de RENANO mejoran significativamente los niveles
compresi贸n de ENANO, alcanzando tiempos de compresi贸n similares y un mayor
consumo de memoria
Recommended from our members
A unified approach to the analysis and design of digital line codes
In most areas of research the variety of possible approaches to analysis and design problems is very large. This is particularly true in the case of digital signal transmission where various conflicting requirements exist (e.g. minimum bandwidth for maximum information capacity and reliability). The lack of universally adopted analysis and evaluation methods is not due to any uncertainties or deficiencies in theoretical fundamentals, rather it is a problem of diversity of criteria and therefore modes of specification that apply.
The work presented in the thesis is concerned with the creation and evaluation of a universal algorithm suitable for the assessment of digital codes together with a systematic approach to the comparative evaluation of essential structural and spectral features of coding schemes.
The thesis begins with an overview of the basic theoretical principles of line coding as an essential part of the process of channel coding for reliable and efficient digital signal transmission. A general spectral analysis procedure is derived from the finite-state sequential machine model of fixed-length block coders, and is implemented in the form of a computer program. A technique for the conversion of coder rules, given in descriptive form into table and matrix form, suitable for the universal specification format used in the general spectral analysis procedure, is developed.
A new method of general classification of codes into categories, according to their complexity levels, is proposed. A modification of the spectral analysis routine into a universal block-code generating scheme is then introduced. The virtually unlimited capabilities for the design and analysis of new code structures is demonstrated. Following from this, a new method for evaluation of the performance of block codes is suggested. It is based on the introduction of an integral parameter, the Information Capacity, which determines the degree of possible spectrum modification for a particular coder specification. Using this method, it is demonstrated how an optimal combination of a code structure, spectral features and information capacity can be achieved.
The thesis concludes with a practical example of the application of the generalised analysis procedure, demonstrating the possibility to combine code multiplexing with modification of the spectrum of the line signal. A novel technique, based on the principles of spread spectrum for multichannel transmission, is proposed. It involves a Binary-Multiplexed Coding (BMC) scheme which is implemented in a generalised circuit, the performance of which is investigated and evaluated