21 research outputs found

    Clouder : a flexible large scale decentralized object store

    Get PDF
    Programa Doutoral em Informática MAP-iLarge scale data stores have been initially introduced to support a few concrete extreme scale applications such as social networks. Their scalability and availability requirements often outweigh sacrificing richer data and processing models, and even elementary data consistency. In strong contrast with traditional relational databases (RDBMS), large scale data stores present very simple data models and APIs, lacking most of the established relational data management operations; and relax consistency guarantees, providing eventual consistency. With a number of alternatives now available and mature, there is an increasing willingness to use them in a wider and more diverse spectrum of applications, by skewing the current trade-off towards the needs of common business users, and easing the migration from current RDBMS. This is particularly so when used in the context of a Cloud solution such as in a Platform as a Service (PaaS). This thesis aims at reducing the gap between traditional RDBMS and large scale data stores, by seeking mechanisms to provide additional consistency guarantees and higher level data processing primitives in large scale data stores. The devised mechanisms should not hinder the scalability and dependability of large scale data stores. Regarding, higher level data processing primitives this thesis explores two complementary approaches: by extending data stores with additional operations such as general multi-item operations; and by coupling data stores with other existent processing facilities without hindering scalability. We address this challenges with a new architecture for large scale data stores, efficient multi item access for large scale data stores, and SQL processing atop large scale data stores. The novel architecture allows to find the right trade-offs among flexible usage, efficiency, and fault-tolerance. To efficient support multi item access we extend first generation large scale data store’s data models with tags and a multi-tuple data placement strategy, that allow to efficiently store and retrieve large sets of related data at once. For efficient SQL support atop scalable data stores we devise design modifications to existing relational SQL query engines, allowing them to be distributed. We demonstrate our approaches with running prototypes and extensive experimental evaluation using proper workloads.Os sistemas de armazenamento de dados de grande escala foram inicialmente desenvolvidos para suportar um leque restrito de aplicacões de escala extrema, como as redes sociais. Os requisitos de escalabilidade e elevada disponibilidade levaram a sacrificar modelos de dados e processamento enriquecidos e até a coerência dos dados. Em oposição aos tradicionais sistemas relacionais de gestão de bases de dados (SRGBD), os sistemas de armazenamento de dados de grande escala apresentam modelos de dados e APIs muito simples. Em particular, evidenciasse a ausência de muitas das conhecidas operacões de gestão de dados relacionais e o relaxamento das garantias de coerência, fornecendo coerência futura. Atualmente, com o número de alternativas disponíveis e maduras, existe o crescente interesse em usá-los num maior e diverso leque de aplicacões, orientando o atual compromisso para as necessidades dos típicos clientes empresariais e facilitando a migração a partir das atuais SRGBD. Isto é particularmente importante no contexto de soluções cloud como plataformas como um servic¸o (PaaS). Esta tese tem como objetivo reduzir a diferencça entre os tradicionais SRGDBs e os sistemas de armazenamento de dados de grande escala, procurando mecanismos que providenciem garantias de coerência mais fortes e primitivas com maior capacidade de processamento. Os mecanismos desenvolvidos não devem comprometer a escalabilidade e fiabilidade dos sistemas de armazenamento de dados de grande escala. No que diz respeito às primitivas com maior capacidade de processamento esta tese explora duas abordagens complementares : a extensão de sistemas de armazenamento de dados de grande escala com operacões genéricas de multi objeto e a junção dos sistemas de armazenamento de dados de grande escala com mecanismos existentes de processamento e interrogac¸ ˜ao de dados, sem colocar em causa a escalabilidade dos mesmos. Para isso apresent´amos uma nova arquitetura para os sistemas de armazenamento de dados de grande escala, acesso eficiente a m´ultiplos objetos, e processamento de SQL sobre sistemas de armazenamento de dados de grande escala. A nova arquitetura permite encontrar os compromissos adequados entre flexibilidade, eficiˆencia e tolerˆancia a faltas. De forma a suportar de forma eficiente o acesso a m´ultiplos objetos estendemos o modelo de dados de sistemas de armazenamento de dados de grande escala da primeira gerac¸ ˜ao com palavras-chave e definimos uma estrat´egia de colocac¸ ˜ao de dados para m´ultiplos objetos que permite de forma eficiente armazenar e obter grandes quantidades de dados de uma s´o vez. Para o suporte eficiente de SQL sobre sistemas de armazenamento de dados de grande escala, analisámos a arquitetura dos motores de interrogação de SRGBDs e fizemos alterações que permitem que sejam distribuídos. As abordagens propostas são demonstradas através de protótipos e uma avaliacão experimental exaustiva recorrendo a cargas adequadas baseadas em aplicações reais

    Clouder: a flexible large scale decentralized object store - architecture overview

    Get PDF
    The current exponential growth of data calls for massive scale capabilities of storage and processing. Such large volumes of data tend to disallow their centralized storage and processing making extensive and flexible data partitioning unavoidable. This is being acknowledged by several major Internet players embracing the Cloud computing model and offering first generation remote storage services with simple processing capabilities. In this position paper we present preliminary ideas for the architecture of a flexible, efficient and dependable fully decentralized object store able to manage very large sets of variable size objects and to coordinate in place processing. Our target are local area large computing facilities composed of tens of thousands of nodes under the same administrative domain. The system should be capable of leveraging massive replication of data to balance read scalability and fault tolerance.(undefined

    EASAHE, um algoritmo para o agendamento de trabalhos em ferramentas de processamento de dados com preocupações de eficiência energética

    Get PDF
    As ferramentas de processamento de dados massivos em ambientes distribuídos como o Spark ou Dask permitem aos programadores efectuarem processamento sobre quantidades massivas de dados em grandes clusters. As ferramentas atuais utilizam algoritmos simples para o agendamento eficiente de trabalhos de processamento de dados em computação distribuída, recorrendo a heurísticas sem ter em conta as características da carga de trabalho. Trabalho recente explora o agendamento eficiente de trabalhos de processamento de dados em computação distribuída. Neste artigo propomos um novo algoritmo para o agendamento de trabalhos para ferramentas de processamento de dados massivos com preocupações de eficiência energética. A implementação num simulador e avaliação usando traces de execuções reais e sintéticas em Spark, demonstram que o algoritmo consegue reduzir o consumo energético em até 11.5%, além de conseguir reduzir o tempo de execução dos trabalhos em até 11.9%, sem grande impacto no tempo gasto no agendamento

    d'Artagnan: a trusted NoSQL database on untrusted clouds

    Get PDF
    Privacy sensitive applications that store confidential information such as personal identifiable data or medical records have strict security concerns. These concerns hinder the adoption of the cloud. With cloud providers under the constant threat of malicious attacks, a single successful breach is sufficient to exploit any valuable information and disclose sensitive data. Existing privacy-aware databases mitigate some of these concerns, but sill leak critical information that can potently compromise the entire system's security. This paper proposes d'Artagnan, the first privacy-aware multi-cloud NoSQL database framework that renders database leaks worthless. The framework stores data as encrypted secrets in multiple clouds such that i) a single data breach cannot break the database's confidentiality and ii) queries are processed on the server-side without leaking any sensitive information. d'Artagnan is evaluated with industry-standard benchmark on market-leading cloud providers.This work is financed by National Funds through thePortuguese funding agency, FCT - Fundação para a Ciência ea Tecnologia within project: UID/EEA/50014/2019. This workis financed by National Funds through the Portuguese fundingagency, FCT - Fundação para a Ciência e a Tecnologia withthe grant: SFRH/BD/142704/201

    AIDA-DB: a data management architecture for the edge and cloud continuum

    Get PDF
    There is an increasing demand for stateful edge computing for both complex Virtual Network Functions (VNFs) and application services in emerging 5G networks. Managing a mutable persistent state in the edge does however bring new architectural, performance, and dependability challenges. Not only it has to be integrated with existing cloud-based systems, but also cope with both operational and analytical workloads and be compatible with a variety of SQL and NoSQL database management systems. We address these challenges with AIDA-DB, a polyglot data management architecture for the edge and cloud continuum. It leverages recent development in distributed transaction processing for a reliable mutable state in operational workloads, with a flexible synchronization mechanism for efficient data collection in cloud-based analytical workloads.Partially funded by project AIDA – Adaptive, Intelligent and Distributed Assurance Platform (POCI-01-0247- FEDER-045907) co-financed by the European Regional Development Fund (ERDF) through the Operational Program for Competitiveness and Internationalisation (COMPETE 2020) and by the Portuguese Foundation for Science and Technology (FCT) under CMU Portugal

    DATAFLASKS: epidemic store for massive scale systems

    Get PDF
    Very large scale distributed systems provide some of the most interesting research challenges while at the same time being increasingly required by nowadays applications. The escalation in the amount of connected devices and data being produced and exchanged, demands new data management systems. Although new data stores are continuously being proposed, they are not suitable for very large scale environments. The high levels of churn and constant dynamics found in very large scale systems demand robust, proactive and unstructured approaches to data management. In this paper we propose a novel data store solely based on epidemic (or gossip-based) protocols. It leverages the capacity of these protocols to provide data persistence guarantees even in highly dynamic, massive scale systems. We provide an open source prototype of the data store and correspondent evaluation

    DataFlasks : an epidemic dependable key-value substrate

    Get PDF
    Recently, tuple-stores have become pivotal struc- tures in many information systems. Their ability to handle large datasets makes them important in an era with unprecedented amounts of data being produced and exchanged. However, these tuple-stores typically rely on structured peer-to-peer protocols which assume moderately stable environments. Such assumption does not always hold for very large scale systems sized in the scale of thousands of machines. In this paper we present a novel approach to the design of a tuple-store. Our approach follows a stratified design based on an unstructured substrate. We focus on this substrate and how the use of epidemic protocols allow reaching high dependability and scalability.(undefined

    pH1: a transactional middleware for NoSQL

    Get PDF
    NoSQL databases opt not to offer important abstractions traditionally found in relational databases in order to achieve high levels of scalability and availability: transactional guarantees and strong data consistency. In this work we propose pH1, a generic middleware layer over NoSQL databases that offers transactional guarantees with Snapshot Isolation. This is achieved in a non-intrusive manner, requiring no modifications to servers and no native support for multiple versions. Instead, the transactional context is achieved by means of a multiversion distributed cache and an external transaction certifier, exposed by extending the client’s interface with transaction bracketing primitives. We validate and evaluate pH1 with Apache Cassandra and Hyperdex. First, using the YCSB benchmark, we show that the cost of providing ACID guarantees to these NoSQL databases amounts to 11% decrease in throughput. Moreover, using the transaction intensive TPC-C workload, pH1 presented an impact of 22% decrease in throughput. This contrasts with OMID, a previous proposal that takes advantage of HBase’s support for multiple versions, with a throughput penalty of 76% in the same conditions

    On the cost of database clusters reconfiguration

    Get PDF
    Database clusters based on share-nothing replication techniques are currently widely accepted as a practical solution to scalability and availability of the data tier. A key issue when planning such systems is the ability to meet service level agreements when load spikes occur or cluster nodes fail. This translates into the ability to provision and deploy additional nodes. Many current research efforts focus on designing autonomic controllers to perform such reconfiguration, tuned to quickly react to system changes and spawn new replicas based on resource usage and performance measurements. In contrast, we are concerned about the inherent impact of deploying an additional node to an online cluster, considering both the time required to finish such an action as well as the impact on resource usage and performance of the cluster as a whole. If noticeable, such impact hinders the practicability of self-management techniques, since it adds an additional dimension that has to be accounted for. Our approach is to systematically benchmark a number of different reconfiguration scenarios to assess the cost of bringing a new replica online. We consider factors such as: workload characteristics, incremental and parallel recovery, flow control and outdatedness of the recovering replica. As a result, we show that research should be refocused from optimizing the capture and transmition of changes to applying them, which in a realistic setting dominates the cost of the recovery operation.Work supported by the Spanish Government under research grant TIN2006-14738-C02-02

    MeT: workload aware elasticity for NoSQL

    Get PDF
    NoSQL databases manage the bulk of data produced by modern Web applications such as social networks. This stems from their ability to partition and spread data to all available nodes, allowing NoSQL systems to scale. Unfortunately, current solutions' scale out is oblivious to the underlying data access patterns, resulting in both highly skewed load across nodes and suboptimal node configurations. In this paper, we first show that judicious placement of HBase partitions taking into account data access patterns can improve overall throughput by 35%. Next, we go beyond current state of the art elastic systems limited to uninformed replica addition and removal by: i) reconfiguring existing replicas according to access patterns and ii) adding replicas specifically configured to the expected access pattern. MeT is a prototype for a Cloud-enabled framework that can be used alone or in conjunction with OpenStack for the automatic and heterogeneous reconfiguration of a HBase deployment. Our evaluation, conducted using the YCSB workload generator and a TPC-C workload, shows that MeT is able to i) autonomously achieve the performance of a manual configured cluster and ii) quickly reconfigure the cluster according to unpredicted workload changes.(undefined
    corecore