7 research outputs found

    Pakkausmenetelmät hajautetussa aikasarjatietokannassa

    Get PDF
    Rise of microservices and distributed applications in containerized deployments are putting increasing amount of burden to the monitoring systems. They push the storage requirements to provide suitable performance for large queries. In this paper we present the changes we made to our distributed time series database, Hawkular-Metrics, and how it stores data more effectively in the Cassandra. We show that using our methods provides significant space savings ranging from 50 to 90% reduction in storage usage, while reducing the query speeds by over 90\% compared to the nominal approach when using Cassandra. We also provide our unique algorithm modified from Gorilla compression algorithm that we use in our solution, which provides almost three times the throughput in compression with equal compression ratio.Hajautettujen järjestelmien yleistyminen on aiheuttanut valvontajärjestelmissä tiedon määrän kasvua, sillä aikasarjojen määrä on kasvanut ja niihin talletetaan useammin tietoa. Tämä on aiheuttanut kasvavaa kuormitusta levyjärjestelmille, joilla on ongelmia palvella kasvavia kyselyitä Tässä paperissa esittelemme muutoksia hajautettuun aikasarjatietokantaamme, Hawkular-Metricsiin, käyttäen hyödyksi tehokkaampaa tiedon pakkausta ja järjestelyä kun tietoa talletetaan Cassandraan. Nopeutimme kyselyjä lähes kymmenkertaisesti ja samalla pienensimme levytilavaatimuksia aineistosta riippuen 50-95%. Esittelemme myös muutoksemme Gorilla pakkausalgoritmiin, jota hyödynnämme tulosten saavuttamiseksi. Muutoksemme nopeuttavat pakkaamista melkein kolminkertaiseksi alkuperäiseen algoritmiin nähden ilman pakkaustehon laskua

    Assessing the effects of data compression in simulations using physically motivated metrics

    Get PDF
    Abstract not provide

    Performance of compressors in scientific data : a comparative study

    Get PDF
    Dissertação de Mestrado em InformáticaComputing resources have been increasingly growing over the last decade. This fact leads to the increasing amount of scientific data generated, reaching a I/O bottleneck and a storage problem. The solution of simply increasing the storage space is not viable, and the I/O throughput can not cope with the increasing number of execution cores on a system. The scientific community turns to the use of data compression, for both used storage space reduction, and alleviating the pressure on the I/O by making best use of the computational resources. We aim to do a comparative study of three distinct lossless compressors, using scientific data. Selecting gzip and LZ4, both general compressors, and FPC a floating-point specific compressor, we assess the performance achieved by the compressors and their respective parallel implementations. MAFISC is a adaptive filtering for scientific data compressor, and is briefly put to the test. We present a rather thorough comparison between the compressors parallel speedup and efficiency and the compression ratios. Using pigz parallel compression can yield speedup values in an average of 12 for 12 threads, achieving an efficiency close to one. gzip is the most complete compression algorithm, but LZ4 can replace it for faster compression and decompression, at the cost of compression ratio. FPC can achieve higher compression ratios and throughput for certain datafiles. MAFISC accomplishes what it proposes to, higher compression ratios, but at the cost of much increased compression time.Na última década tem-se vindo a assistir a um crescimento contínuo do uso de recursos de computação. Em consequência tem também aumentado significativamente a quantidade de dados gerados, em particular de dados científicos, que no final se traduz no estrangulamento da E/S de dados e num problema de armazenamento. O simples aumento do espaço de armazenamento não é solução, nem é possível atingir taxas de transferência E/S capazes de lidar com o aumento do número de núcleos de execução dos sistemas atuais. Assim, a comunidade científica inclina-se para a compressão de dados, tanto para redução de espaço de armazenamento utilizado como para aliviar a pressão sobre a E/S, através do melhor aproveitamento dos recursos computacionais. Nesta dissertação fizemos um estudo comparativo de três compressores, sem perdas (lossless), aplicados a dados científicos. Selecionados o gzip e LZ4, ambos compressores gerais, e o FPC, específico para dados em vírgula flutuante, avaliamos o desempenho alcançado pelos mesmos e suas respetivas implementações paralelas. Um outro compressor, MAFISC, para dados científicos, baseado em filtragem adaptativa, foi também brevemente posto à prova. No final apresentamos uma comparação bastante completa entre os ganhos obtidos em velocidade (speedup) e eficiência dos compressores paralelos e as taxas de compressão. Utilizando compressão paralela com pigz podem obter-se ganhos médios de 12 para o speedup, para 12 fios de execução (threads), e eficiência próxima da unidade. Do estudo desenvolvido parece poder-se concluir que o gzip é o algoritmo de compressão mais abrangente, mas o LZ4 pode substituí-lo quando há exigência de compressão e descompressão mais rápidas, à custa de menor taxa de compressão. O FPC pode alcançar taxas de compressão mais elevadas, para tipos de dados mais restritos. Pelo seu lado o MAFISC parece cumprir os objetivos de obter elevadas taxas de compressão, mas à custa do aumento significativo do tempo de compressão

    Assessing the Effects of Data Compression in Simulations Using Physically Motivated Metrics

    Get PDF

    Provenance Management for Collaborative Data Science Workflows

    Get PDF
    Collaborative data science activities are becoming pervasive in a variety of communities, and are often conducted in teams, with people of different expertise performing back-and-forth modeling and analysis on time-evolving datasets. Current data science systems mainly focus on specific steps in the process such as training machine learning models, scaling to large data volumes, or serving the data or the models, while the issues of end-to-end data science lifecycle management are largely ignored. Such issues include, for example, tracking provenance and derivation history of models, identifying data processing pipelines and keeping track of their evolution, analyzing unexpected behaviors and monitoring the project health, and providing the ability to reason about specific analysis results. We address these challenges by ingesting, managing, and analyzing rich provenance information generated during data science projects, and using it to enable users to easily publish, share, and discover data analytics projects. We first describe the design of our unified provenance and metadata management system, called ProvDB. We adopt a schema-later approach and use a flexible graph-based provenance representation model that combines the core concepts in version control and provenance management. We describe several ingestion mechanisms for this provenance model and show how heterogeneous data analysis environments can be served with natural extensions to this framework. We also describe a set of novel features of the system including graph queries for retrospective provenance, fileviews for data transformations, introspective queries for debugging, and continuous monitoring queries for anomaly detection. We then illustrate how to support deep learning modeling lifecycle via the extensibility mechanism in ProvDB. We describe techniques to compactly store and efficiently query the rich set of data artifacts generated during deep learning modeling lifecycle. We also describe a high-level domain specific language that helps raise the abstraction level during model exploration and enumeration and accelerate the modeling process. Lastly, we propose graph query operators and develop efficient evaluation techniques to address the verbose and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the provenance of a collection of user-given vertices (e.g., versioned files, author names) via flexible boundary criteria. Second, we propose a graph summarization operator to aggregate the results of multiple segmentation operations, and allow multi-resolution interaction with the aggregation result to understand similar and abnormal behaviors in those segments

    Preconditioner-based In Situ Data Reduction for End-to-End Throughput Optimization

    No full text
    Efficient handling of large volumes of data is a necessity for future extreme-scale scientific applications and database systems. To address the growing storage and throughput imbalance between the data production on such systems and their I/O subsystems, reduction of the handled data volume by compression is a reasonable approach. However, quite often many scientific data sets compress poorly, referred to as hard-to-compress datasets, due to the negative impact of highly entropic information represented within the data. Lossless compression efforts on such datasets typically do not yield more than a 20% reduction in size when exact reproduction of the original data is required. Moreover, modern applications of compression for hard-to-compress scientific datasets hinder end-to-end throughput performance due to overhead timing costs of data analysis, compression, and reorganization. When overhead costs of applying compression are greater than end-to-end performance gains obtained by the data reduction, utilization of a compressor has no practical benefit for scientific systems. A difficult problem in lossless compression for improving scientific data reduction efficiency and throughput performance is to identify the hard-to-compress information and subsequently optimize the compression techniques. To address this challenge, we introduce the In Situ Orthogonal Byte Aggregate Reduction Compression (ISOBAR-compress) methodology as a preconditioner of lossless compression to identify and optimize the compression efficiency and throughput of hard-to-compress datasets. Out of 24 scientific datasets from both the public domain and peta-scale simulations, ISOBAR-compress accurately identified the 19 that were hard-to-compress. Additionally, ISOBAR-compress improved data reduction by an average of 19% and increased compression and decompression throughput by an average speedup of 24.1 and 33.6, respectively. Additionally, dataset preconditioning for lossless compression is a promising approach for reducing disk and network I/O activity to address the problem of limited I/O bandwidth in current analytic frameworks. Hence, we also introduce a hybrid compression-I/O methodology for interleaving I/O activity with data compression to improve end-to-end throughput performance along with the reduced dataset size.We evaluate several interleaving strategies, present theoretical models, and evaluate the efficiency and scalability of the approach through comparative analysis. The hybrid method when applied to 19 hard-to-compress scientific datasets demonstrates a 12% to 46% increase in end-to-end throughput. At the reported peak bandwidth of 60 GB/s of uncompressed data for a current, leadership-class parallel I/O system, this translates into an effective gain of 7 to 28 GB/s in aggregate throughput. Lastly, it is important that scientific applications further streamline their end-to-end throughput performance beyond only preconditioning datasets for compression. The concept of applying a preconditioner is generalizable for other techniques that allow optimizing performance by data analysis and reorganization. For example in present-day scientific simulations, there is a drive to optimize in situ processing performance by inspecting the layout structure of a generated dataset and then restructuring the content. Typically, these simulations interleave dataset variables in memory during their calculation phase to improve computational performance, but deinterleave the data for subsequent storage and analysis. As a result, an efficient preconditioner for data deinterleaving is critical since common deinterleaving methods provide inefficient throughput and energy performance. To address this problem, we present a deinterleaving method that is high performance, energy efficient, and generic to any data type. When evaluated against conventional deinterleaving methods on 105 STREAM standard micro-benchmarks, our method always improved throughput and throughput/watt. In the best case, our deinterleaving method improved throughput up to 26.2x and throughput/watt up to 7.8x
    corecore