18 research outputs found

    Eventually Consistent Configuration Management in Fog Systems with CRDTs

    Full text link
    Current fog systems rely on centralized and strongly consistent services for configuration management originally designed for cloud systems. In the geo-distributed fog, such systems can exhibit high communication latency or become unavailable in case of network partition. In this paper, we examine the drawbacks of strong consistency for fog configuration management and propose an alternative based on CRDTs. We prototypically implement our approach for the FReD fog data management platform. Early results show reductions of server response times of up to 50%

    Trade-offs in Replicated Systems

    Get PDF
    Replicated systems provide the foundation for most of today’s large-scale services. Engineering such replicated system is an onerous task. The first—and often foremost—step in this task is to establish an appropriate set of design goals, such as availability or performance, which should synthesize all the underlying system properties. Mixing design goals, however, is fraught with dangers, given that many properties are antagonistic and fundamental trade-offs exist among them. Navigating the harsh landscape of trade-offs is difficult because these formulations use different notations and system models, so it is hard to get an all-encompassing understanding of the state of the art in this area. In this paper, we address this difficulty by providing a systematic overview of the most relevant trade- offs involved in building replicated systems. Starting from the well-known FLP result, we follow a long line of research and investigate different trade-offs, assembling a coherent perspective of these results. Among others, we consider trade-offs which examine the complex interactions between properties such as consistency, availability, low latency, partition-tolerance, churn, scalability, and visibility latency

    What is a Blockchain? A Definition to Clarify the Role of the Blockchain in the Internet of Things

    Get PDF
    The use of the term blockchain is documented for disparate projects, from cryptocurrencies to applications for the Internet of Things (IoT), and many more. The concept of blockchain appears therefore blurred, as it is hard to believe that the same technology can empower applications that have extremely different requirements and exhibit dissimilar performance and security. This position paper elaborates on the theory of distributed systems to advance a clear definition of blockchain that allows us to clarify its role in the IoT. This definition inextricably binds together three elements that, as a whole, provide the blockchain with those unique features that distinguish it from other distributed ledger technologies: immutability, transparency and anonimity. We note however that immutability comes at the expense of remarkable resource consumption, transparency demands no confidentiality and anonymity prevents user identification and registration. This is in stark contrast to the requirements of most IoT applications that are made up of resource constrained devices, whose data need to be kept confidential and users to be clearly known. Building on the proposed definition, we derive new guidelines for selecting the proper distributed ledger technology depending on application requirements and trust models, identifying common pitfalls leading to improper applications of the blockchain. We finally indicate a feasible role of the blockchain for the IoT: myriads of local, IoT transactions can be aggregated off-chain and then be successfully recorded on an external blockchain as a means of public accountability when required

    Whitelisting System State In Windows Forensic Memory Visualizations

    Get PDF
    Examiners in the field of digital forensics regularly encounter enormous amounts of data and must identify the few artifacts of evidentiary value. The most pressing challenge these examiners face is manual reconstruction of complex datasets with both hierarchical and associative relationships. The complexity of this data requires significant knowledge, training, and experience to correctly and efficiently examine. Current methods provide primarily text-based representations or low-level visualizations, but levee the task of maintaining global context of system state on the examiner. This research presents a visualization tool that improves analysis methods through simultaneous representation of the hierarchical and associative relationships and local detailed data within a single page application. A novel whitelisting feature further improves analysis by eliminating items of little interest from view, allowing examiners to identify artifacts more quickly and accurately. Results from two pilot studies demonstrates that the visualization tool can assist examiners to more accurately and quickly identify artifacts of interest

    Consensus algorithms in distributed SDN controllers

    Get PDF
    Software defined networks (SDN) promise greater flexibility, cost efficiency and easier management of network infrastructure by logically centralizing the control and abstracting network resources, but their wider use is inhibited by various challenges. In this master’s thesis, we explore the problem of ensuring high-availability of SDN controllers. It is a mostly unexplored problem which is also crucial with the implementation of software defined networks in production environments. Firstly, we present the concept of software defined networking. Doing so, we explore where their use has the greatest potential and identify various challenges which make their implementation difficult. We emphasize the problem of ensuring high-availability and the scalability of the network because of the centralized control plane and list as a solution the implementation of a controller in the form of a fault-tolerant distributed system. Following this, we study the limitations in design and the implementation of these systems. We focus on consensus algorithms as a key component in ensuring high availability. Following this, we present strategies with developing distributed controllers and explore which open-source implementations allow for their high availability. We choose the ONOS controller as the potentially most suitable for production use and analyze its architecture. With the chosen controller and the simulator of software defined networks Mininet, we establish a pilot environment. We develop a test framework for simulating scenarios that include various controller node and communication channel failures and analyze system behavior while doing so. Based on the results of the analysis we evaluate the chosen controller implementation based on ensuring high availability and give suggestions for improving the availability of the solution

    Efficient Synchronization of State-based CRDTs

    Get PDF
    To ensure high availability in large scale distributed systems, Conflict-free Replicated Data Types (CRDTs) relax consistency by allowing immediate query and update operations at the local replica, with no need for remote synchronization. State-based CRDTs synchronize replicas by periodically sending their full state to other replicas, which can become extremely costly as the CRDT state grows. Delta-based CRDTs address this problem by producing small incremental states (deltas) to be used in synchronization instead of the full state. However, current synchronization algorithms for delta-based CRDTs induce redundant wasteful delta propagation, performing worse than expected, and surprisingly, no better than state-based. In this paper we: 1) identify two sources of inefficiency in current synchronization algorithms for delta-based CRDTs; 2) bring the concept of join decomposition to state-based CRDTs; 3) exploit join decompositions to obtain optimal deltas and 4) improve the efficiency of synchronization algorithms; and finally, 5) experimentally evaluate the improved algorithms.We would like to thank Ricardo Macedo, Georges Younes, Marc Shapiro and the anonymous reviewers for their valuable feedback on earlier drafts of this work. Vitor Enes was supported by EU H2020 LightKone project (732505) and by a FCT -Fundacao para a Ciencia e a Tecnologia -PhD Fellowship (PD/BD/142927/2018). Carlos Baquero was partially supported by SMILES within TEC4Growth project (NORTE-01-0145-FEDER-000020). Joao Leitao was partially supported by project NG-STORAGE through FCT grant PTDC/CCI-INF/32038/2017, and by NOVA LINCS through the FCT grant UID/CEC/04516/2013

    A framework for multidimensional indexes on distributed and highly-available data stores

    Get PDF
    Spatial Big Data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of peta bytes of spatial data per year. However, as many authors have pointed out, the lack of specialized frameworks dealing with such kind of data is limiting possible applications and probably precluding many scientific breakthroughs. In this thesis, we describe three HPC scientific applications, ranging from molecular dynamics, neuroscience analysis, and physics simulations, where we experience first hand the limits of the existing technologies. Thanks to our experience, we define the desirable missing functionalities, and we focus on two features that when combined significantly improve the way scientific data is analyzed. On one side, scientific simulations generate complex datasets where multiple correlated characteristics describe each item. For instance, a particle might have a space position (x,y,z) at a given time (t). If we want to find all elements within the same area and period, we either have to scan the whole dataset, or we must organize the data so that all items in the same space and time are stored together. The second approach is called Multidimensional Indexing (MI), and it uses different techniques to cluster and to organize similar data together. On the other side, approximate analytics has been often indicated as a smart and flexible way to explore large datasets in a short period. Approximate analytics includes a broad family of algorithms which aims to speed up analytical workloads by relaxing the precision of the results within a specific interval of confidence. For instance, if we want to know the average age in a group with 1-year precision, we can consider just a random fraction of all the people, thus reducing the amount of calculation. But if we also want less I/O operations, we need efficient data sampling, which means organizing data in a way that we do not need to scan the whole data set to generate a random sample of it. According to our analysis, combining Multidimensional Indexing with efficient data Sampling (MIS) is a vital missing feature not available in the current distributed data management solutions. This thesis aims to solve such a shortcoming and it provides novel scalable solutions. At first, we describe the existing data management alternatives; then we motivate our preference for NoSQL key-value databases. Secondly, we propose an analytical model to study the influence of data models on the scalability and performance of this kind of distributed database. Thirdly, we use the analytical model to design two novel multidimensional indexes with efficient data sampling: the D8tree and the AOTree. Our first solution, the D8tree, improves state of the art for approximate spatial queries on static and mostly read dataset. Later, we enhanced the data ingestion capability or our approach by introducing the AOTree, an algorithm that enables the query performance of the D8tree even for HPC write-intensive applications. We compared our solution with PostgreSQL and plain storage, and we demonstrate that our proposal has better performance and scalability. Finally, we describe Qbeast, the novel distributed system that implements the D8tree and the AOTree using NoSQL technologies, and we illustrate how Qbeast simplifies the workflow of scientists in various HPC applications providing a scalable and integrated solution for data analysis and management.La gestión de BigData con información espacial está considerada como una tendencia esencial en el futuro de las aplicaciones científicas y de negocio. De hecho, se generan cientos de petabytes de datos espaciales por año mediante instrumentos de investigación, dispositivos médicos y redes sociales. Sin embargo, tal y como muchos autores han señalado, la falta de entornos especializados en manejar este tipo de datos está limitando sus posibles aplicaciones y está impidiendo muchos avances científicos. En esta tesis, describimos 3 aplicaciones científicas HPC, que cubren los ámbitos de dinámica molecular, análisis neurocientífico y simulaciones físicas, donde hemos experimentado en primera mano las limitaciones de las tecnologías existentes. Gracias a nuestras experiencias, hemos podido definir qué funcionalidades serían deseables y no existen, y nos hemos centrado en dos características que, al combinarlas, mejoran significativamente la manera en la que se analizan los datos científicos. Por un lado, las simulaciones científicas generan conjuntos de datos complejos, en los que cada elemento es descrito por múltiples características correlacionadas. Por ejemplo, una partícula puede tener una posición espacial (x, y, z) en un momento dado (t). Si queremos encontrar todos los elementos dentro de la misma área y periodo, o bien recorremos y analizamos todo el conjunto de datos, o bien organizamos los datos de manera que se almacenen juntos todos los elementos que comparten área en un momento dado. Esta segunda opción se conoce como Indexación Multidimensional (IM) y usa diferentes técnicas para agrupar y organizar datos similares. Por otro lado, se suele señalar que las analíticas aproximadas son una manera inteligente y flexible de explorar grandes conjuntos de datos en poco tiempo. Este tipo de analíticas incluyen una amplia familia de algoritmos que acelera el tiempo de procesado, relajando la precisión de los resultados dentro de un determinado intervalo de confianza. Por ejemplo, si queremos saber la edad media de un grupo con precisión de un año, podemos considerar sólo un subconjunto aleatorio de todas las personas, reduciendo así la cantidad de cálculo. Pero si además queremos menos operaciones de entrada/salida, necesitamos un muestreo eficiente de datos, que implica organizar los datos de manera que no necesitemos recorrerlos todos para generar una muestra aleatoria. De acuerdo con nuestros análisis, la combinación de Indexación Multidimensional con Muestreo eficiente de datos (IMM) es una característica vital que no está disponible en las soluciones actuales de gestión distribuida de datos. Esta tesis pretende resolver esta limitación y proporciona unas soluciones novedosas que son escalables. En primer lugar, describimos las alternativas de gestión de datos que existen y motivamos nuestra preferencia por las bases de datos NoSQL basadas en clave-valor. En segundo lugar, proponemos un modelo analítico para estudiar la influencia que tienen los modelos de datos sobre la escalabilidad y el rendimiento de este tipo de bases de datos distribuidas. En tercer lugar, usamos el modelo analítico para diseñar dos novedosos algoritmos IMM: el D8tree y el AOTree. Nuestra primera solución, el D8tree, mejora el estado del arte actual para consultas espaciales aproximadas, cuando el conjunto de datos es estático y mayoritariamente de lectura. Después, mejoramos la capacidad de ingestión introduciendo el AOTree, un algoritmo que conserva el rendimiento del D8tree incluso para aplicaciones HPC intensivas en escritura. Hemos comparado nuestra solución con PostgreSQL y almacenamiento plano demostrando que nuestra propuesta mejora tanto el rendimiento como la escalabilidad. Finalmente, describimos Qbeast, el sistema que implementa los algoritmos D8tree y AOTree, e ilustramos cómo Qbeast simplifica el flujo de trabajo de los científicos ofreciendo una solución escalable e integraPostprint (published version
    corecore