776 research outputs found
Toward smart and efficient scientific data management
Scientific research generates vast amounts of data, and the scale of data has significantly increased with advancements in scientific applications. To manage this data effectively, lossy data compression techniques are necessary to reduce storage and transmission costs. Nevertheless, the use of lossy compression introduces uncertainties related to its performance. This dissertation aims to answer key questions surrounding lossy data compression, such as how the performance changes, how much reduction can be achieved, and how to optimize these techniques for modern scientific data management workflows.
One of the major challenges in adopting lossy compression techniques is the trade-off between data accuracy and compression performance, particularly the compression ratio. This trade-off is not well understood, leading to a trial-and-error approach in selecting appropriate setups. To address this, the dissertation analyzes and estimates the compression performance of two modern lossy compressors, SZ and ZFP, on HPC datasets at various error bounds. By predicting compression ratios based on intrinsic metrics collected under a given base error bound, the effectiveness of the estimation scheme is confirmed through evaluations using real HPC datasets.
Furthermore, as scientific simulations scale up on HPC systems, the disparity between computation and input/output (I/O) becomes a significant challenge. To overcome this, error-bounded lossy compression has emerged as a solution to bridge the gap between computation and I/O. Nonetheless, the lack of understanding of compression performance hinders the wider adoption of lossy compression. The dissertation aims to address this challenge by examining the complex interaction between data, error bounds, and compression algorithms, providing insights into compression performance and its implications for scientific production.
Lastly, the dissertation addresses the performance limitations of progressive data retrieval frameworks for post-hoc data analytics on full-resolution scientific simulation data. Existing frameworks suffer from over-pessimistic error control theory, leading to fetching more data than necessary for recomposition, resulting in additional I/O overhead. To enhance the performance of progressive retrieval, deep neural networks are leveraged to optimize the error control mechanism, reducing unnecessary data fetching and improving overall efficiency.
By tackling these challenges and providing insights, this dissertation contributes to the advancement of scientific data management, lossy data compression techniques, and HPC progressive data retrieval frameworks. The findings and methodologies presented pave the way for more efficient and effective management of large-scale scientific data, facilitating enhanced scientific research and discovery.
In future research, this dissertation highlights the importance of investigating the impact of lossy data compression on downstream analysis. On the one hand, more data reduction can be achieved under scenarios like image visualization where the error tolerance is very high, leading to less I/O and communication overhead. On the other hand, post-hoc calculations based on physical properties after compression may lead to misinterpretation, as the statistical information of such properties might be compromised during compression. Therefore, a comprehensive understanding of the impact of lossy data compression on each specific scenario is vital to ensure accurate analysis and interpretation of results
LEARNING-BASED IMAGE COMPRESSION USING MULTIPLE AUTOENCODERS
Advanced video applications in smart environments (e.g., smart cities) bring different
challenges associated with increasingly intelligent systems and demanding
requirements in emerging fields such as urban surveillance, computer vision in
industry, medicine and others. As a consequence, a huge amount of visual data
is captured to be analyzed by task-algorithm driven machines. Due to the large
amount of data generated, problems may occur at the data management level, and
to overcome this problem it is necessary to implement efficient compression methods
to reduce the amount of stored resources.
This thesis presents the research work on image compression methods using
deep learning algorithms analyzing the properties of different algorithms, because
recently these have shown good results in image compression. It is also explained
the convolutional neural networks and presented a state-of-the-art of autoencoders.
Two compression approaches using autoencoders were studied, implemented and
tested, namely an object-oriented compression scheme, and algorithms oriented
to high resolution images (UHD and 360º images). In the first approach, a video
surveillance scenario considering objects such as people, cars, faces, bicycles and
motorbikes was regarded, and a compression method using autoencoders was developed
with the purpose of the decoded images being delivered for machine vision
processing. In this approach the performance was measured analysing the traditional
image quality metrics and the accuracy of task driven by machine using decoded
images. In the second approach, several high resolution images were considered
adapting the method used in the previous approach considering properties of the
image, like variance, gradients or PCA of the features, instead of the content that
the image represents.
Regarding the first approach, in comparison with the Versatile Video Coding
(VVC) standard, the proposed approach achieves significantly better coding efficiency,
e.g., up to 46.7% BD-rate reduction. The accuracy of the machine vision tasks is
also significantly higher when performed over visual objects compressed with the
proposed scheme in comparison with the same tasks performed over the same visual
objects compressed with the VVC. These results demonstrate that the learningbased
approach proposed is a more efficient solution for compression of visual objects than standard encoding. Considering the second approach although it is possible to
obtain better results than VVC on the test subsets, the presented approach only
presents significant gains considering 360º images
Compression with Bayesian Implicit Neural Representations
Many common types of data can be represented as functions that map
coordinates to signal values, such as pixel locations to RGB values in the case
of an image. Based on this view, data can be compressed by overfitting a
compact neural network to its functional representation and then encoding the
network weights. However, most current solutions for this are inefficient, as
quantization to low-bit precision substantially degrades the reconstruction
quality. To address this issue, we propose overfitting variational Bayesian
neural networks to the data and compressing an approximate posterior weight
sample using relative entropy coding instead of quantizing and entropy coding
it. This strategy enables direct optimization of the rate-distortion
performance by minimizing the -ELBO, and target different
rate-distortion trade-offs for a given network architecture by adjusting
. Moreover, we introduce an iterative algorithm for learning prior
weight distributions and employ a progressive refinement process for the
variational posterior that significantly enhances performance. Experiments show
that our method achieves strong performance on image and audio compression
while retaining simplicity.Comment: Preprin
HySpecNet-11k: A Large-Scale Hyperspectral Dataset for Benchmarking Learning-Based Hyperspectral Image Compression Methods
The development of learning-based hyperspectral image compression methods has
recently attracted great attention in remote sensing. Such methods require a
high number of hyperspectral images to be used during training to optimize all
parameters and reach a high compression performance. However, existing
hyperspectral datasets are not sufficient to train and evaluate learning-based
compression methods, which hinders the research in this field. To address this
problem, in this paper we present HySpecNet-11k that is a large-scale
hyperspectral benchmark dataset made up of 11,483 nonoverlapping image patches.
Each patch is a portion of 128 128 pixels with 224 spectral bands and
a ground sample distance of 30 m. We exploit HySpecNet-11k to benchmark the
current state of the art in learning-based hyperspectral image compression by
focussing our attention on various 1D, 2D and 3D convolutional autoencoder
architectures. Nevertheless, HySpecNet-11k can be used for any unsupervised
learning task in the framework of hyperspectral image analysis. The dataset,
our code and the pre-trained weights are publicly available at
https://hyspecnet.rsim.berlinComment: Accepted at IEEE International Geoscience and Remote Sensing
Symposium (IGARSS) 2023. The dataset, our code and the pre-trained weights
are publicly available at https://hyspecnet.rsim.berli
Modality-Agnostic Variational Compression of Implicit Neural Representations
We introduce a modality-agnostic neural data compression algorithm based on a
functional view of data and parameterised as an Implicit Neural Representation
(INR). Bridging the gap between latent coding and sparsity, we obtain compact
latent representations which are non-linearly mapped to a soft gating mechanism
capable of specialising a shared INR base network to each data item through
subnetwork selection. After obtaining a dataset of such compact latent
representations, we directly optimise the rate/distortion trade-off in this
modality-agnostic space using non-linear transform coding. We term this method
Variational Compression of Implicit Neural Representation (VC-INR) and show
both improved performance given the same representational capacity pre
quantisation while also outperforming previous quantisation schemes used for
other INR-based techniques. Our experiments demonstrate strong results over a
large set of diverse data modalities using the same algorithm without any
modality-specific inductive biases. We show results on images, climate data, 3D
shapes and scenes as well as audio and video, introducing VC-INR as the first
INR-based method to outperform codecs as well-known and diverse as JPEG 2000,
MP3 and AVC/HEVC on their respective modalities
Optimization of scientific algorithms in heterogeneous systems and accelerators for high performance computing
Actualmente, la computación de propósito general en GPU es uno de los pilares básicos
de la computación de alto rendimiento. Aunque existen cientos de aplicaciones
aceleradas en GPU, aún hay algoritmos cientÃficos poco estudiados. Por ello, la
motivación de esta tesis ha sido investigar la posibilidad de acelerar significativamente
en GPU un conjunto de algoritmos pertenecientes a este grupo.
En primer lugar, se ha obtenido una implementación optimizada del algoritmo de
compresión de vÃdeo e imagen CAVLC (Context-Adaptive Variable Length Encoding), que
es el método entrópico más usado en el estándar de codificación de vÃdeo H.264. La
aceleración respecto a la mejor implementación anterior está entre 2.5x y 5.4x. Esta
solución puede aprovecharse como el componente entrópico de codificadores H.264
software, y utilizarse en sistemas de compresión de vÃdeo e imagen en formatos
distintos a H.264, como imágenes médicas.
En segundo lugar, se ha desarrollado GUD-Canny, un detector de bordes de Canny no
supervisado y distribuido. El sistema resuelve las principales limitaciones de las
implementaciones del algoritmo de Canny, que son el cuello de botella causado por el
proceso de histéresis y el uso de umbrales de histéresis fijos. Dada una imagen, esta
se divide en un conjunto de sub-imágenes, y, para cada una de ellas, se calcula de forma
no supervisada un par de umbrales de histéresis utilizando el método de MedinaCarnicer. El detector satisface el requisito de tiempo real, al ser 0.35 ms el tiempo
promedio en detectar los bordes de una imagen 512x512.
En tercer lugar, se ha realizado una implementación optimizada del método de
compresión de datos VLE (Variable-Length Encoding), que es 2.6x más rápida en
promedio que la mejor implementación anterior. Además, esta solución incluye un
nuevo método scan inter-bloque, que se puede usar para acelerar la propia operación
scan y otros algoritmos, como el de compactación. En el caso de la operación scan, se
logra una aceleración de 1.62x si se usa el método propuesto en lugar del utilizado en la
mejor implementación anterior de VLE.
Esta tesis doctoral concluye con un capÃtulo sobre futuros trabajos de investigación que
se pueden plantear a partir de sus contribuciones
2022 Review of Data-Driven Plasma Science
Data-driven science and technology offer transformative tools and methods to science. This review article highlights the latest development and progress in the interdisciplinary field of data-driven plasma science (DDPS), i.e., plasma science whose progress is driven strongly by data and data analyses. Plasma is considered to be the most ubiquitous form of observable matter in the universe. Data associated with plasmas can, therefore, cover extremely large spatial and temporal scales, and often provide essential information for other scientific disciplines. Thanks to the latest technological developments, plasma experiments, observations, and computation now produce a large amount of data that can no longer be analyzed or interpreted manually. This trend now necessitates a highly sophisticated use of high-performance computers for data analyses, making artificial intelligence and machine learning vital components of DDPS. This article contains seven primary sections, in addition to the introduction and summary. Following an overview of fundamental data-driven science, five other sections cover widely studied topics of plasma science and technologies, i.e., basic plasma physics and laboratory experiments, magnetic confinement fusion, inertial confinement fusion and high-energy-density physics, space and astronomical plasmas, and plasma technologies for industrial and other applications. The final section before the summary discusses plasma-related databases that could significantly contribute to DDPS. Each primary section starts with a brief introduction to the topic, discusses the state-of-the-art developments in the use of data and/or data-scientific approaches, and presents the summary and outlook. Despite the recent impressive signs of progress, the DDPS is still in its infancy. This article attempts to offer a broad perspective on the development of this field and identify where further innovations are required
The State of Applying Artificial Intelligence to Tissue Imaging for Cancer Research and Early Detection
Artificial intelligence represents a new frontier in human medicine that
could save more lives and reduce the costs, thereby increasing accessibility.
As a consequence, the rate of advancement of AI in cancer medical imaging and
more particularly tissue pathology has exploded, opening it to ethical and
technical questions that could impede its adoption into existing systems. In
order to chart the path of AI in its application to cancer tissue imaging, we
review current work and identify how it can improve cancer pathology
diagnostics and research. In this review, we identify 5 core tasks that models
are developed for, including regression, classification, segmentation,
generation, and compression tasks. We address the benefits and challenges that
such methods face, and how they can be adapted for use in cancer prevention and
treatment. The studies looked at in this paper represent the beginning of this
field and future experiments will build on the foundations that we highlight
- …