1,386 research outputs found
Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
The relative ease of collaborative data science and analysis has led to a
proliferation of many thousands or millions of of the same datasets
in many scientific and commercial domains, acquired or constructed at various
stages of data analysis across many users, and often over long periods of time.
Managing, storing, and recreating these dataset versions is a non-trivial task.
The fundamental challenge here is the : the more
storage we use, the faster it is to recreate or retrieve versions, while the
less storage we use, the slower it is to recreate or retrieve versions. Despite
the fundamental nature of this problem, there has been a surprisingly little
amount of work on it. In this paper, we study this trade-off in a principled
manner: we formulate six problems under various settings, trading off these
quantities in various ways, demonstrate that most of the problems are
intractable, and propose a suite of inexpensive heuristics drawing from
techniques in delay-constrained scheduling, and spanning tree literature, to
solve these problems. We have built a prototype version management system, that
aims to serve as a foundation to our DATAHUB system for facilitating
collaborative data science. We demonstrate, via extensive experiments, that our
proposed heuristics provide efficient solutions in practical dataset versioning
scenarios
Optimizations for Energy-Aware, High-Performance and Reliable Distributed Storage Systems
With the decreasing cost and wide-spread use of commodity hard drives, it has become possible to create very large-scale storage systems with less expense. However, as we approach exabyte-scale storage systems, maintaining important features such as energy-efficiency, performance, reliability and usability became increasingly difficult. Despite the decreasing cost of storage systems, the energy consumption of these systems still needs to be addressed in order to retain cost-effectiveness. Any improvements in a storage system can be outweighed by high energy costs. On the other hand, large-scale storage systems can benefit more from the object storage features for improved performance and usability. One area of concern is metadata performance bottleneck of applications reading large directories or creating a large number of files. Similarly, computation on big data where data needs to be transferred between compute and storage clusters adversely affects I/O performance. As the storage systems become more complex and larger, transferring data between remote compute and storage tiers becomes impractical. Furthermore, storage systems implement reliability typically at the file system or client level. This approach might not always be practical in terms of performance. Lastly, object storage features are usually tailored to specific use cases that makes it harder to use them in various contexts.
In this thesis, we are presenting several approaches to enhance energy-efficiency, performance, reliability and usability of large-scale storage systems. To begin with, we improve the energy-efficiency of storage systems by moving I/O load to a subset of the storage nodes with energy-aware node allocation methods and turn off the unused nodes, while preserving load balance on demand. To address the metadata performance issue associated with large creates and directory reads, we represent directories with object storage collections and implement lazy creation of objects. Similarly, in-situ computation on large-scale data is enabled by using object storage features to integrate a computational framework with the existing object storage layer to eliminate the need to transfer data between compute and storage silos for better performance. We then present parity-based redundancy using object storage features to achieve reliability with less performance impact. Finally, unified storage brings together the object storage features to meet the needs of distinct use cases; such as cloud storage, big data or high-performance computing to alleviate the unnecessary fragmentation of storage resources. We evaluate each proposed approach thoroughly and validate their effectiveness in terms of improving energy-efficiency, performance, reliability and usability of a large-scale storage system
A Survey on Array Storage, Query Languages, and Systems
Since scientific investigation is one of the most important providers of
massive amounts of ordered data, there is a renewed interest in array data
processing in the context of Big Data. To the best of our knowledge, a unified
resource that summarizes and analyzes array processing research over its long
existence is currently missing. In this survey, we provide a guide for past,
present, and future research in array processing. The survey is organized along
three main topics. Array storage discusses all the aspects related to array
partitioning into chunks. The identification of a reduced set of array
operators to form the foundation for an array query language is analyzed across
multiple such proposals. Lastly, we survey real systems for array processing.
The result is a thorough survey on array data storage and processing that
should be consulted by anyone interested in this research topic, independent of
experience level. The survey is not complete though. We greatly appreciate
pointers towards any work we might have forgotten to mention.Comment: 44 page
Análise colaborativa de grandes conjuntos de séries temporais
The recent expansion of metrification on a daily basis has led to the production
of massive quantities of data, and in many cases, these collected metrics
are only useful for knowledge building when seen as a full sequence of
data ordered by time, which constitutes a time series. To find and interpret
meaningful behavioral patterns in time series, a multitude of analysis software
tools have been developed. Many of the existing solutions use annotations
to enable the curation of a knowledge base that is shared between a group
of researchers over a network. However, these tools also lack appropriate
mechanisms to handle a high number of concurrent requests and to properly
store massive data sets and ontologies, as well as suitable representations
for annotated data that are visually interpretable by humans and explorable by
automated systems. The goal of the work presented in this dissertation is to
iterate on existing time series analysis software and build a platform for the
collaborative analysis of massive time series data sets, leveraging state-of-the-art technologies for querying, storing and displaying time series and annotations.
A theoretical and domain-agnostic model was proposed to enable
the implementation of a distributed, extensible, secure and high-performant
architecture that handles various annotation proposals in simultaneous and
avoids any data loss from overlapping contributions or unsanctioned changes.
Analysts can share annotation projects with peers, restricting a set of collaborators
to a smaller scope of analysis and to a limited catalog of annotation
semantics. Annotations can express meaning not only over a segment of time,
but also over a subset of the series that coexist in the same segment. A novel
visual encoding for annotations is proposed, where annotations are rendered
as arcs traced only over the affected series’ curves in order to reduce visual
clutter. Moreover, the implementation of a full-stack prototype with a reactive
web interface was described, directly following the proposed architectural and
visualization model while applied to the HVAC domain. The performance of
the prototype under different architectural approaches was benchmarked, and
the interface was tested in its usability. Overall, the work described in this dissertation
contributes with a more versatile, intuitive and scalable time series
annotation platform that streamlines the knowledge-discovery workflow.A recente expansão de metrificação diária levou à produção de quantidades
massivas de dados, e em muitos casos, estas métricas são úteis para
a construção de conhecimento apenas quando vistas como uma sequência
de dados ordenada por tempo, o que constitui uma série temporal. Para se
encontrar padrões comportamentais significativos em séries temporais, uma
grande variedade de software de análise foi desenvolvida. Muitas das soluções
existentes utilizam anotações para permitir a curadoria de uma base
de conhecimento que é compartilhada entre investigadores em rede. No entanto,
estas ferramentas carecem de mecanismos apropriados para lidar com
um elevado número de pedidos concorrentes e para armazenar conjuntos
massivos de dados e ontologias, assim como também representações apropriadas
para dados anotados que são visualmente interpretáveis por seres
humanos e exploráveis por sistemas automatizados. O objetivo do trabalho
apresentado nesta dissertação é iterar sobre o software de análise de séries
temporais existente e construir uma plataforma para a análise colaborativa
de grandes conjuntos de séries temporais, utilizando tecnologias estado-de-arte
para pesquisar, armazenar e exibir séries temporais e anotações. Um
modelo teórico e agnóstico quanto ao domínio foi proposto para permitir a
implementação de uma arquitetura distribuída, extensível, segura e de alto
desempenho que lida com várias propostas de anotação em simultâneo e
evita quaisquer perdas de dados provenientes de contribuições sobrepostas
ou alterações não-sancionadas. Os analistas podem compartilhar projetos
de anotação com colegas, restringindo um conjunto de colaboradores a uma
janela de análise mais pequena e a um catálogo limitado de semântica de
anotação. As anotações podem exprimir significado não apenas sobre um
intervalo de tempo, mas também sobre um subconjunto das séries que coexistem
no mesmo intervalo. Uma nova codificação visual para anotações é
proposta, onde as anotações são desenhadas como arcos traçados apenas
sobre as curvas de séries afetadas de modo a reduzir o ruído visual. Para
além disso, a implementação de um protótipo full-stack com uma interface
reativa web foi descrita, seguindo diretamente o modelo de arquitetura e visualização
proposto enquanto aplicado ao domínio AVAC. O desempenho do
protótipo com diferentes decisões arquiteturais foi avaliado, e a interface foi
testada quanto à sua usabilidade. Em geral, o trabalho descrito nesta dissertação
contribui com uma abordagem mais versátil, intuitiva e escalável para
uma plataforma de anotação sobre séries temporais que simplifica o fluxo de
trabalho para a descoberta de conhecimento.Mestrado em Engenharia Informátic
Microservice-based Reference Architecture for Semantics-aware Measurement Systems
Cloud technologies have become more important than ever with the rising need for scalable
and distributed software systems. A pattern that is used in many such systems is a
microservice-based architecture (MSA). MSAs have become a blueprint for many large
companies and big software systems. In many scientific fields like energy and environmental
informatics, efficient and scalable software systems with a primary focus on measurement
data are a core requirement. Nowadays, there are many ways to solve research questions
using data-driven approaches. Most of them have a need for large amounts of measurement
data and according metadata. However, many measurement systems still follow deprecated
guidelines such as monolithic architectures, classic relational database principles and are
missing semantic awareness and interpretation of data. These problems and the resulting
requirements are tackled by the introduction of a reference architecture with a focus on
measurement systems that utilizes the principles of microservices.
The thesis first presents the systematic design of the reference architecture by using the
principles of Domain-driven Design (DDD). This process ensures that the reference architecture
is defined in a modular and sustainable way in contrast to complex monolithic
software systems. An extensive scientific analysis leads to the core parts of the concept
consisting of the data management and semantics for measurement systems. Different data
services define a concept for managing measurement data, according meta data and master
data describing the business objects of the application implemented by using the reference
architecture. Further concepts allow the reference architecture to define a way for the system
to understand and interpret the data using semantic information. Lastly, the introduction of
a frontend framework for dashboard applications represents an example for visualizing the
data managed by the microservices
- …