291 research outputs found
Humeanism and Exceptions in the Fundamental Laws of Physics
It has been argued that the fundamental laws of physics do not face a ‘problem of provisos’ equivalent to that found in other scientific disciplines (Earman, Roberts and Smith 2002) and there is only the appearance of exceptions to physical laws if they are confused with differential equations of evolution type (Smith 2002). In this paper I argue that even if this is true, fundamental laws in physics still pose a major challenge to standard Humean approaches to lawhood, as they are not in any obvious sense about regularities in behaviour. A Humean approach to physical laws with exceptions is possible, however, if we adopt a view of laws that takes them to be the algorithms in the algorithmic compressions of empirical data. When this is supplemented with a distinction between lossy and lossless compression, we can explain exceptions in terms of compression artefacts present in the application of the lossy laws
Toward smart and efficient scientific data management
Scientific research generates vast amounts of data, and the scale of data has significantly increased with advancements in scientific applications. To manage this data effectively, lossy data compression techniques are necessary to reduce storage and transmission costs. Nevertheless, the use of lossy compression introduces uncertainties related to its performance. This dissertation aims to answer key questions surrounding lossy data compression, such as how the performance changes, how much reduction can be achieved, and how to optimize these techniques for modern scientific data management workflows.
One of the major challenges in adopting lossy compression techniques is the trade-off between data accuracy and compression performance, particularly the compression ratio. This trade-off is not well understood, leading to a trial-and-error approach in selecting appropriate setups. To address this, the dissertation analyzes and estimates the compression performance of two modern lossy compressors, SZ and ZFP, on HPC datasets at various error bounds. By predicting compression ratios based on intrinsic metrics collected under a given base error bound, the effectiveness of the estimation scheme is confirmed through evaluations using real HPC datasets.
Furthermore, as scientific simulations scale up on HPC systems, the disparity between computation and input/output (I/O) becomes a significant challenge. To overcome this, error-bounded lossy compression has emerged as a solution to bridge the gap between computation and I/O. Nonetheless, the lack of understanding of compression performance hinders the wider adoption of lossy compression. The dissertation aims to address this challenge by examining the complex interaction between data, error bounds, and compression algorithms, providing insights into compression performance and its implications for scientific production.
Lastly, the dissertation addresses the performance limitations of progressive data retrieval frameworks for post-hoc data analytics on full-resolution scientific simulation data. Existing frameworks suffer from over-pessimistic error control theory, leading to fetching more data than necessary for recomposition, resulting in additional I/O overhead. To enhance the performance of progressive retrieval, deep neural networks are leveraged to optimize the error control mechanism, reducing unnecessary data fetching and improving overall efficiency.
By tackling these challenges and providing insights, this dissertation contributes to the advancement of scientific data management, lossy data compression techniques, and HPC progressive data retrieval frameworks. The findings and methodologies presented pave the way for more efficient and effective management of large-scale scientific data, facilitating enhanced scientific research and discovery.
In future research, this dissertation highlights the importance of investigating the impact of lossy data compression on downstream analysis. On the one hand, more data reduction can be achieved under scenarios like image visualization where the error tolerance is very high, leading to less I/O and communication overhead. On the other hand, post-hoc calculations based on physical properties after compression may lead to misinterpretation, as the statistical information of such properties might be compromised during compression. Therefore, a comprehensive understanding of the impact of lossy data compression on each specific scenario is vital to ensure accurate analysis and interpretation of results
Compression algorithms for biomedical signals and nanopore sequencing data
The massive generation of biological digital information creates various computing
challenges such as its storage and transmission. For example, biomedical
signals, such as electroencephalograms (EEG), are recorded by multiple sensors over
long periods of time, resulting in large volumes of data. Another example is genome
DNA sequencing data, where the amount of data generated globally is seeing explosive
growth, leading to increasing needs for processing, storage, and transmission
resources. In this thesis we investigate the use of data compression techniques for
this problem, in two different scenarios where computational efficiency is crucial.
First we study the compression of multi-channel biomedical signals. We present
a new lossless data compressor for multi-channel signals, GSC, which achieves compression
performance similar to the state of the art, while being more computationally
efficient than other available alternatives. The compressor uses two novel
integer-based implementations of the predictive coding and expert advice schemes
for multi-channel signals. We also develop a version of GSC optimized for EEG
data. This version manages to significantly lower compression times while attaining
similar compression performance for that specic type of signal.
In a second scenario we study the compression of DNA sequencing data produced
by nanopore sequencing technologies. We present two novel lossless compression algorithms
specifically tailored to nanopore FASTQ files. ENANO is a reference-free
compressor, which mainly focuses on the compression of quality scores. It achieves
state of the art compression performance, while being fast and with low memory
consumption when compared to other popular FASTQ compression tools. On the
other hand, RENANO is a reference-based compressor, which improves on ENANO,
by providing a more efficient base call sequence compression component. For RENANO
two algorithms are introduced, corresponding to the following scenarios: a
reference genome is available without cost to both the compressor and the decompressor;
and the reference genome is available only on the compressor side, and a
compacted version of the reference is included in the compressed le. Both algorithms
of RENANO significantly improve the compression performance of ENANO,
with similar compression times, and higher memory requirements.La generación masiva de información digital biológica da lugar a múltiples desafíos informáticos, como su almacenamiento y transmisión. Por ejemplo, las señales biomédicas, como los electroencefalogramas (EEG), son generadas por múltiples sensores registrando medidas en simultaneo durante largos períodos de tiempo,
generando grandes volúmenes de datos. Otro ejemplo son los datos de secuenciación de ADN, en donde la cantidad de datos a nivel mundial esta creciendo de forma explosiva, lo que da lugar a una gran necesidad de recursos de procesamiento, almacenamiento y transmisión. En esta tesis investigamos como aplicar técnicas de compresión de datos para atacar este problema, en dos escenarios diferentes donde
la eficiencia computacional juega un rol importante.
Primero estudiamos la compresión de señales biomédicas multicanal. Comenzamos presentando un nuevo compresor de datos sin perdida para señales multicanal, GSC, que logra obtener niveles de compresión en el estado del arte y que al mismo tiempo es mas eficiente computacionalmente que otras alternativas disponibles. El compresor utiliza dos nuevas implementaciones de los esquemas de codificación predictiva
y de asesoramiento de expertos para señales multicanal, basadas en aritmética
de enteros. También presentamos una versión de GSC optimizada para datos de
EEG. Esta versión logra reducir significativamente los tiempos de compresión, sin
deteriorar significativamente los niveles de compresión para datos de EEG.
En un segundo escenario estudiamos la compresión de datos de secuenciación
de ADN generados por tecnologías de secuenciación por nanoporos. En este sentido,
presentamos dos nuevos algoritmos de compresión sin perdida, específicamente
diseñados para archivos FASTQ generados por tecnología de nanoporos. ENANO
es un compresor libre de referencia, enfocado principalmente en la compresión de
los valores de calidad de las bases. ENANO alcanza niveles de compresión en el
estado del arte, siendo a la vez mas eficiente computacionalmente que otras herramientas
populares de compresión de archivos FASTQ. Por otro lado, RENANO es
un compresor basado en la utilización de una referencia, que mejora el rendimiento
de ENANO, a partir de un nuevo esquema de compresión de las secuencias de bases.
Presentamos dos variantes de RENANO, correspondientes a los siguientes escenarios:
(i) se tiene a disposición un genoma de referencia, tanto del lado del compresor
como del descompresor, y (ii) se tiene un genoma de referencia disponible solo del
lado del compresor, y se incluye una versión compacta de la referencia en el archivo
comprimido. Ambas variantes de RENANO mejoran significativamente los niveles
compresión de ENANO, alcanzando tiempos de compresión similares y un mayor
consumo de memoria
Quantitative Evaluation of Dense Skeletons for Image Compression
Skeletons are well-known descriptors used for analysis and processing of 2D binary images. Recently, dense skeletons have been proposed as an extension of classical skeletons as a dual encoding for 2D grayscale and color images. Yet, their encoding power, measured by the quality and size of the encoded image, and how these metrics depend on selected encoding parameters, has not been formally evaluated. In this paper, we fill this gap with two main contributions. First, we improve the encoding power of dense skeletons by effective layer selection heuristics, a refined skeleton pixel-chain encoding, and a postprocessing compression scheme. Secondly, we propose a benchmark to assess the encoding power of dense skeletons for a wide set of natural and synthetic color and grayscale images. We use this benchmark to derive optimal parameters for dense skeletons. Our method, called Compressing Dense Medial Descriptors (CDMD), achieves higher-compression ratios at similar quality to the well-known JPEG technique and, thereby, shows that skeletons can be an interesting option for lossy image encoding
- …