49 research outputs found

    High-throughput DNA sequence data compression

    Get PDF

    Compression of DNA sequencing data

    Get PDF
    With the release of the latest generations of sequencing machines, the cost of sequencing a whole human genome has dropped to less than US$1,000. The potential applications in several fields lead to the forecast that the amount of DNA sequencing data will soon surpass the volume of other types of data, such as video data. In this dissertation, we present novel data compression technologies with the aim of enhancing storage, transmission, and processing of DNA sequencing data. The first contribution in this dissertation is a method for the compression of aligned reads, i.e., read-out sequence fragments that have been aligned to a reference sequence. The method improves compression by implicitly assembling local parts of the underlying sequences. Compared to the state of the art, our method achieves the best trade-off between memory usage and compressed size. Our second contribution is a method for the quantization and compression of quality scores, i.e., values that quantify the error probability of each read-out base. Specifically, we propose two Bayesian models that are used to precisely control the quantization. With our method it is possible to compress the data down to 0.15 bit per quality score. Notably, we can recommend a particular parametrization for one of our models which—by removing noise from the data as a side effect—does not lead to any degradation in the distortion metric. This parametrization achieves an average rate of 0.45 bit per quality score. The third contribution is the first implementation of an entropy codec compliant to MPEG-G. We show that, compared to the state of the art, our method achieves the best compression ranks on average, and that adding our method to CRAM would be beneficial both in terms of achievable compression and speed. Finally, we provide an overview of the standardization landscape, and in particular of MPEG-G, in which our contributions have been integrated.Mit der Einführung der neuesten Generationen von Sequenziermaschinen sind die Kosten für die Sequenzierung eines menschlichen Genoms auf weniger als 1.000 US-Dollar gesunken. Es wird prognostiziert, dass die Menge der Sequenzierungsdaten bald diejenige anderer Datentypen, wie z.B. Videodaten, übersteigen wird. Daher werden in dieser Arbeit neue Datenkompressionsverfahren zur Verbesserung der Speicherung, Übertragung und Verarbeitung von Sequenzierungsdaten vorgestellt. Der erste Beitrag in dieser Arbeit ist eine Methode zur Komprimierung von alignierten Reads, d.h. ausgelesenen Sequenzfragmenten, die an eine Referenzsequenz angeglichen wurden. Die Methode verbessert die Komprimierung, indem sie die Reads nutzt, um implizit lokale Teile der zugrunde liegenden Sequenzen zu schätzen. Im Vergleich zum Stand der Technik erzielt die Methode das beste Ergebnis in einer gemeinsamen Betrachtung von Speichernutzung und erzielter Komprimierung. Der zweite Beitrag ist eine Methode zur Quantisierung und Komprimierung von Qualitätswerten, welche die Fehlerwahrscheinlichkeit jeder ausgelesenen Base quantifizieren. Konkret werden zwei Bayes’sche Modelle vorgeschlagen, mit denen die Quantisierung präzise gesteuert werden kann. Mit der vorgeschlagenen Methode können die Daten auf bis zu 0,15 Bit pro Qualitätswert komprimiert werden. Besonders hervorzuheben ist, dass eine bestimmte Parametrisierung für eines der Modelle empfohlen werden kann, die – durch die Entfernung von Rauschen aus den Daten als Nebeneffekt – zu keiner Verschlechterung der Verzerrungsmetrik führt. Mit dieser Parametrisierung wird eine durchschnittliche Rate von 0,45 Bit pro Qualitätswert erreicht. Der dritte Beitrag ist die erste Implementierung eines MPEG-G-konformen Entropie-Codecs. Es wird gezeigt, dass der vorgeschlagene Codec die durchschnittlich besten Kompressionswerte im Vergleich zum Stand der Technik erzielt und dass die Aufnahme des Codecs in CRAM sowohl hinsichtlich der erreichbaren Kompression als auch der Geschwindigkeit von Vorteil wäre. Abschließend wird ein Überblick über Standards zur Komprimierung von Sequenzierungsdaten gegeben. Insbesondere wird hier auf MPEG-G eingangen, da alle Beiträge dieser Arbeit in MPEG-G integriert wurden

    3D Medical Image Lossless Compressor Using Deep Learning Approaches

    Get PDF
    The ever-increasing importance of accelerated information processing, communica-tion, and storing are major requirements within the big-data era revolution. With the extensive rise in data availability, handy information acquisition, and growing data rate, a critical challenge emerges in efficient handling. Even with advanced technical hardware developments and multiple Graphics Processing Units (GPUs) availability, this demand is still highly promoted to utilise these technologies effectively. Health-care systems are one of the domains yielding explosive data growth. Especially when considering their modern scanners abilities, which annually produce higher-resolution and more densely sampled medical images, with increasing requirements for massive storage capacity. The bottleneck in data transmission and storage would essentially be handled with an effective compression method. Since medical information is critical and imposes an influential role in diagnosis accuracy, it is strongly encouraged to guarantee exact reconstruction with no loss in quality, which is the main objective of any lossless compression algorithm. Given the revolutionary impact of Deep Learning (DL) methods in solving many tasks while achieving the state of the art results, includ-ing data compression, this opens tremendous opportunities for contributions. While considerable efforts have been made to address lossy performance using learning-based approaches, less attention was paid to address lossless compression. This PhD thesis investigates and proposes novel learning-based approaches for compressing 3D medical images losslessly.Firstly, we formulate the lossless compression task as a supervised sequential prediction problem, whereby a model learns a projection function to predict a target voxel given sequence of samples from its spatially surrounding voxels. Using such 3D local sampling information efficiently exploits spatial similarities and redundancies in a volumetric medical context by utilising such a prediction paradigm. The proposed NN-based data predictor is trained to minimise the differences with the original data values while the residual errors are encoded using arithmetic coding to allow lossless reconstruction.Following this, we explore the effectiveness of Recurrent Neural Networks (RNNs) as a 3D predictor for learning the mapping function from the spatial medical domain (16 bit-depths). We analyse Long Short-Term Memory (LSTM) models’ generalisabil-ity and robustness in capturing the 3D spatial dependencies of a voxel’s neighbourhood while utilising samples taken from various scanning settings. We evaluate our proposed MedZip models in compressing unseen Computerized Tomography (CT) and Magnetic Resonance Imaging (MRI) modalities losslessly, compared to other state-of-the-art lossless compression standards.This work investigates input configurations and sampling schemes for a many-to-one sequence prediction model, specifically for compressing 3D medical images (16 bit-depths) losslessly. The main objective is to determine the optimal practice for enabling the proposed LSTM model to achieve a high compression ratio and fast encoding-decoding performance. A solution for a non-deterministic environments problem was also proposed, allowing models to run in parallel form without much compression performance drop. Compared to well-known lossless codecs, experimental evaluations were carried out on datasets acquired by different hospitals, representing different body segments, and have distinct scanning modalities (i.e. CT and MRI).To conclude, we present a novel data-driven sampling scheme utilising weighted gradient scores for training LSTM prediction-based models. The objective is to determine whether some training samples are significantly more informative than others, specifically in medical domains where samples are available on a scale of billions. The effectiveness of models trained on the presented importance sampling scheme was evaluated compared to alternative strategies such as uniform, Gaussian, and sliced-based sampling

    Fatigue and drowsiness detection using inertial sensors and electrocardiogram

    Get PDF
    Dissertação realizada no âmbito de trabalho final de mestrado para a obtenção de grau de mestre em Engenharia Electrónica e TelecomunicaçõesThe interest in monitoring a driver’s performance has increased in the past years in order to make the roads safer both for drivers and pedestrians. With this thinking in mind, it arises the idea of developing a system to monitor driver’s fatigue and drowsiness to alert him, if needed, about his psychological and physical states. This dissertation is based on the CardioWheel system, developed by CardioID, and consists in monitoring the person’s ECG signal and to record the motion of the steering wheel during the journey. The ECG signal is extracted with dry-electrodes placed in a conductive leather covering the steering wheel that can sense the electrical signal caused by the heartbeat of the person while having the hands on the wheel. The steering wheel movement monitoring is performed with the help of a three-axis accelerometer placed in the middle of the steering wheel that records the proper acceleration variations while moving the steering wheel. With those accelerations it is possible to calculate the steering wheel rotation angle during all the journey. The amount of data acquired with this system undergoes a compression stage for transmission with the goal of reducing the necessary bandwidth. From the evaluated techniques for data compression, it was possible to conclude that the hybrid method using Linear Predictive Coding and Lempel-Ziv-Welch is the lossless technique with the highest Compression Ratio. However, the hybrid technique using amplitude scaling e DWT is the lossy method with the highest Compression Ratio and a reduced RMSE. The transmission of the compressed data is done via Bluetooth® Low Energy, available in the CardioWheel system, with an exclusive profile developed for this dissertation. This profile has the ability to transmit the ECG and accelerometer data in real time. To detect if the driver is becoming drowsy, were evaluated machine learning algorithms to detect fatigue and drowsiness patterns according to the received ECG and accelerometer data from the steering wheel. Many features were extracted to describe the main characteristics from both signals and, from all the tested techniques, the Support Vector Machine technique proved to be the best classification method with the higher accuracy in classification. With these tested results, it could be possible to implement an alarmistic system, to warn the driver about his psychological and physical states, increasing the safety in the roads.O interesse em monitorizar os condutores dos veículos durante a sua condução tem vindo a aumentar ao longo dos anos, com o objectivo de tornar as estradas mais seguras para condutores e peões. Com este pensamento, surgiu a ideia de desenvolver um sistema capaz de monitorizar a fadiga e a sonolência do condutor e, se necessário, alertá-lo sobre o seu estado físico e psicológico. O ADAS, conhecido como sendo um sistema de assistência avançada para os condutores, é um sistema que monitoriza o desempenho e o comportamento do automóvel, bem como as condições físicas e psicológicas do condutor. Este sistema pode ter um comportamento passivo, alertando os condutores para situações de perigo eminente para que o condutor consiga evitar esses perigos. O LDW, ou aviso de mudança de faixa, é capaz de alertar o condutor de uma saída involuntária de faixa e o FCW, ou aviso de colisão frontal, consegue alertar o condutor de uma colisão eminente, tendo em conta o veículo frontal. Por outro lado, o ADAS consegue concretizar acções de forma assegurar a segurança dos passageiros e dos peões. O AEB, ou travagem de emergência automática, identifica uma colisão eminente e trava sem intervenção do conduto e o LKA, ou assistente de manutenção de faixa, que movimenta o veículo para que este não saia da faixa de rodagem. Esta dissertação é baseada no projecto CardioWheel, desenvolvido pela empresa CardioID, e consiste na monitorização do sinal cardíaco do condutor e na gravação dos movimentos realizados pelo volante do veículo durante a condução. O sinal cardíaco, conhecido como ECG, é extraído através de eléctrodos secos fixados numa capa em pele colocada no volante, que conseguem captar o sinal eléctrico provocado pelo batimento cardíaco enquanto o condutor estiver com as mãos no volante. O controlo dos movimentos do volante, ou SWA, é conseguido através de um acelerómetro de 3 eixos colocado no centro do volante que grava as variações da aceleração instantânea enquanto o condutor movimenta o volante. Através dessas acelerações é possível calcular-se o ângulo de rotação do volante durante todo o percurso. Os dados adquiridos de ECG e SWA geram uma enorme quantidade de informação que tem que ser codificada de forma a reduzir a largura de banda necessária à transmissão. Técnicas no domínio do tempo, como o AZTEC, TP e o CORTES, estão bem documentadas como boas técnicas para compressão de sinal ECG onde o principal objectivo é a obtenção da pulsação cardíaca. Dadas as exigências do projecto, concluiu-se que estes métodos não seriam os melhores para preservar as características principais do sinal de forma a obter-se padrões de fadiga e sonolência. Outros métodos de codificação com e sem perdas foram testados tanto para compressão de sinal ECG como para SWA e pode-se concluir que o método híbrido de Codificação Linear Preditiva com a técnica Lempel-Ziv-Welch é o método sem perdas em que se obteve maior rácio de compressão. Por outro lado, outro método hibrido utilizando escalamento de amplitude com DWT, provou ser o método com perdas com maior rácio de compressão onde o erro quadrático médio é reduzido. A transmissão da informação comprimida é assegurada através de um módulo BLE, presente no CardioWheel, no entanto, foi possível concluir que outras tecnologias como ZigBee ou ANT seriam igualmente compatíveis com o propósito do projecto. Foi desenvolvido especificamente para este projecto um perfil BLE com a capacidade de transmitir a informação do sinal ECG e do acelerómetro em tempo real. Para detectar se o condutor está a apresentar sinais de fadiga ou sonolência, foram testados vários algoritmos de aprendizagem automática que, de acordo com a informação ECG e do acelerómetro enviada pelo volante, conseguem detectar esses padrões. A escala KSS, é uma escala subjectiva que identifica o nível de sonolência de uma pessoa e que permite a classificação do nível de sonolência do condutor. Para construir um algoritmo de inteligência artificial é necessário extrair-se características dos sinais a interpretar. Essas características têm que descrever o sinal de forma precisa para que os algoritmos de aprendizagem automática consigam interpretar e classificar cada sinal da forma adequada. Características como ritmo cardíaco ou amplitude da onda R são exemplos de características utilizadas para descrever o sinal ECG. Características como tempo com o volante estático e aceleração média são exemplos de características utilizadas para descrever o sinal de SWA. Para além das características, um algoritmo de aprendizagem automática necessita de uma base de dados que consiga cobrir todas as situações possíveis para que o algoritmo, olhando para os dados inseridos, consiga detectar os padrões nas características para cada resultado final possível. Métodos de regressão foram implementados de forma e testar o seu desempenho para um problema de classificação, no entanto, não provaram ser os melhores métodos para essa abordagem. De todas as técnicas de classificação testadas, o método de SVM, ou máquina de vectores de suporte, provou ser o que obtém melhores resultados de classificação. Com os resultados obtidos será possível implementar-se um sistema de alarmística que consiga avisar o condutor sobre o seu estado físico e psicológico, aumentando assim a segurança rodoviária.N/

    The 1995 Science Information Management and Data Compression Workshop

    Get PDF
    This document is the proceedings from the 'Science Information Management and Data Compression Workshop,' which was held on October 26-27, 1995, at the NASA Goddard Space Flight Center, Greenbelt, Maryland. The Workshop explored promising computational approaches for handling the collection, ingestion, archival, and retrieval of large quantities of data in future Earth and space science missions. It consisted of fourteen presentations covering a range of information management and data compression approaches that are being or have been integrated into actual or prototypical Earth or space science data information systems, or that hold promise for such an application. The Workshop was organized by James C. Tilton and Robert F. Cromp of the NASA Goddard Space Flight Center

    Compression algorithms for biomedical signals and nanopore sequencing data

    Get PDF
    The massive generation of biological digital information creates various computing challenges such as its storage and transmission. For example, biomedical signals, such as electroencephalograms (EEG), are recorded by multiple sensors over long periods of time, resulting in large volumes of data. Another example is genome DNA sequencing data, where the amount of data generated globally is seeing explosive growth, leading to increasing needs for processing, storage, and transmission resources. In this thesis we investigate the use of data compression techniques for this problem, in two different scenarios where computational efficiency is crucial. First we study the compression of multi-channel biomedical signals. We present a new lossless data compressor for multi-channel signals, GSC, which achieves compression performance similar to the state of the art, while being more computationally efficient than other available alternatives. The compressor uses two novel integer-based implementations of the predictive coding and expert advice schemes for multi-channel signals. We also develop a version of GSC optimized for EEG data. This version manages to significantly lower compression times while attaining similar compression performance for that specic type of signal. In a second scenario we study the compression of DNA sequencing data produced by nanopore sequencing technologies. We present two novel lossless compression algorithms specifically tailored to nanopore FASTQ files. ENANO is a reference-free compressor, which mainly focuses on the compression of quality scores. It achieves state of the art compression performance, while being fast and with low memory consumption when compared to other popular FASTQ compression tools. On the other hand, RENANO is a reference-based compressor, which improves on ENANO, by providing a more efficient base call sequence compression component. For RENANO two algorithms are introduced, corresponding to the following scenarios: a reference genome is available without cost to both the compressor and the decompressor; and the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed le. Both algorithms of RENANO significantly improve the compression performance of ENANO, with similar compression times, and higher memory requirements.La generación masiva de información digital biológica da lugar a múltiples desafíos informáticos, como su almacenamiento y transmisión. Por ejemplo, las señales biomédicas, como los electroencefalogramas (EEG), son generadas por múltiples sensores registrando medidas en simultaneo durante largos períodos de tiempo, generando grandes volúmenes de datos. Otro ejemplo son los datos de secuenciación de ADN, en donde la cantidad de datos a nivel mundial esta creciendo de forma explosiva, lo que da lugar a una gran necesidad de recursos de procesamiento, almacenamiento y transmisión. En esta tesis investigamos como aplicar técnicas de compresión de datos para atacar este problema, en dos escenarios diferentes donde la eficiencia computacional juega un rol importante. Primero estudiamos la compresión de señales biomédicas multicanal. Comenzamos presentando un nuevo compresor de datos sin perdida para señales multicanal, GSC, que logra obtener niveles de compresión en el estado del arte y que al mismo tiempo es mas eficiente computacionalmente que otras alternativas disponibles. El compresor utiliza dos nuevas implementaciones de los esquemas de codificación predictiva y de asesoramiento de expertos para señales multicanal, basadas en aritmética de enteros. También presentamos una versión de GSC optimizada para datos de EEG. Esta versión logra reducir significativamente los tiempos de compresión, sin deteriorar significativamente los niveles de compresión para datos de EEG. En un segundo escenario estudiamos la compresión de datos de secuenciación de ADN generados por tecnologías de secuenciación por nanoporos. En este sentido, presentamos dos nuevos algoritmos de compresión sin perdida, específicamente diseñados para archivos FASTQ generados por tecnología de nanoporos. ENANO es un compresor libre de referencia, enfocado principalmente en la compresión de los valores de calidad de las bases. ENANO alcanza niveles de compresión en el estado del arte, siendo a la vez mas eficiente computacionalmente que otras herramientas populares de compresión de archivos FASTQ. Por otro lado, RENANO es un compresor basado en la utilización de una referencia, que mejora el rendimiento de ENANO, a partir de un nuevo esquema de compresión de las secuencias de bases. Presentamos dos variantes de RENANO, correspondientes a los siguientes escenarios: (i) se tiene a disposición un genoma de referencia, tanto del lado del compresor como del descompresor, y (ii) se tiene un genoma de referencia disponible solo del lado del compresor, y se incluye una versión compacta de la referencia en el archivo comprimido. Ambas variantes de RENANO mejoran significativamente los niveles compresión de ENANO, alcanzando tiempos de compresión similares y un mayor consumo de memoria

    Compression and interoperable representation of genomic information

    Get PDF

    35th Symposium on Theoretical Aspects of Computer Science: STACS 2018, February 28-March 3, 2018, Caen, France

    Get PDF

    Lossy Time-Series Transformation Techniques in the Context of the Smart Grid

    Get PDF

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction
    corecore