19 research outputs found

    Data locality in Hadoop

    Get PDF
    Current market tendencies show the need of storing and processing rapidly growing amounts of data. Therefore, it implies the demand for distributed storage and data processing systems. The Apache Hadoop is an open-source framework for managing such computing clusters in an effective, fault-tolerant way. Dealing with large volumes of data, Hadoop, and its storage system HDFS (Hadoop Distributed File System), face challenges to keep the high efficiency with computing in a reasonable time. The typical Hadoop implementation transfers computation to the data, rather than shipping data across the cluster. Otherwise, moving the big quantities of data through the network could significantly delay data processing tasks. However, while a task is already running, Hadoop favours local data access and chooses blocks from the nearest nodes. Next, the necessary blocks are moved just when they are needed in the given ask. For supporting the Hadoop’s data locality preferences, in this thesis, we propose adding an innovative functionality to its distributed file system (HDFS), that enables moving data blocks on request. In-advance shipping of data makes it possible to forcedly redistribute data between nodes in order to easily adapt it to the given processing tasks. New functionality enables the instructed movement of data blocks within the cluster. Data can be shifted either by user running the proper HDFS shell command or programmatically by other module like an appropriate scheduler. In order to develop such functionality, the detailed analysis of Apache Hadoop source code and its components (specifically HDFS) was conducted. Research resulted in a deep understanding of internal architecture, what made it possible to compare the possible approaches to achieve the desired solution, and develop the chosen one

    Concurrent software architectures for exploratory data analysis

    Get PDF
    Decades ago, increased volume of data made manual analysis obsolete and prompted the use of computational tools with interactive user interfaces and rich palette of data visualizations. Yet their classic, desktop-based architectures can no longer cope with the ever-growing size and complexity of data. Next-generation systems for explorative data analysis will be developed on client–server architectures, which already run concurrent software for data analytics but are not tailored to for an engaged, interactive analysis of data and models. In explorative data analysis, the key is the responsiveness of the system and prompt construction of interactive visualizations that can guide the users to uncover interesting data patterns. In this study, we review the current software architectures for distributed data analysis and propose a list of features to be included in the next generation frameworks for exploratory data analysis. The new generation of tools for explorative data analysis will need to address integrated data storage and processing, fast prototyping of data analysis pipelines supported by machine-proposed analysis workflows, preemptive analysis of data, interactivity, and user interfaces for intelligent data visualizations. The systems will rely on a mixture of concurrent software architectures to meet the challenge of seamless integration of explorative data interfaces at client site with management of concurrent data mining procedures on the servers

    Concurrent software architectures for exploratory data analysis

    Get PDF
    Decades ago, increased volume of data made manual analysis obsolete and prompted the use of computational tools with interactive user interfaces and rich palette of data visualizations. Yet their classic, desktop-based architectures can no longer cope with the ever-growing size and complexity of data. Next-generation systems for explorative data analysis will be developed on client–server architectures, which already run concurrent software for data analytics but are not tailored to for an engaged, interactive analysis of data and models. In explorative data analysis, the key is the responsiveness of the system and prompt construction of interactive visualizations that can guide the users to uncover interesting data patterns. In this study, we review the current software architectures for distributed data analysis and propose a list of features to be included in the next generation frameworks for exploratory data analysis. The new generation of tools for explorative data analysis will need to address integrated data storage and processing, fast prototyping of data analysis pipelines supported by machine-proposed analysis workflows, preemptive analysis of data, interactivity, and user interfaces for intelligent data visualizations. The systems will rely on a mixture of concurrent software architectures to meet the challenge of seamless integration of explorative data interfaces at client site with management of concurrent data mining procedures on the servers

    MooseGuard: secure file sharing at scale in untrusted environments

    Get PDF
    Shared storage systems provide cheap, scalable, and reliable storage, but secure sharing in these systems requires users to encrypt their data and limit efficient sharing or trust a service provider to faithfully keep their data private. Current research has explored the use of trusted execution environments (TEEs) to operate on sensitive data and sharing policies in isolated execution. That work enables the utilization of untrusted shared resources to store and share sensitive data while maintaining stronger security guarantees. However, current research has limitations in scaling these solutions, as it bottlenecks both metadata and data operations within the same physical TEE, whereas a scaled file system distributes metadata and data operations to separate devices. This paper explores the use of two TEEs specialized for metadata and data operations to provide file sharing at scale with less overhead in addition to strong security guarantees. This approach achieves scaled metadata and concurrent use by utilizing a server-side TEE for isolated execution on a master server and provides data privacy and efficient access revocation through a client-side TEE. MooseGuard is the prototype implementation of this design, utilizing Intel SGX as a TEE and extending the MooseFS distributed file system. MooseGuard's implementation details the modifications needed to provide security and shows how this approach can be applied to a typical distributed file system. An evaluation of MooseGuard demonstrates that TEEs specialized for metadata and data operations allow a secured distributed file system to maintain its scale with only constant overheads. As TEEs and secure hardware become more widely available in public clouds, enterprise, and personal devices, MooseGuard presents a way for users to get the best of both worlds in data privacy and efficient sharing when using scaled, shared storage systems

    daGui: A DataFlow Graphical User Interface

    Get PDF
    Big Data is a growing trend. It focuses on storing and processing a vast amount of data in a distributed environment. There are many frameworks and tools which can be used to work with this data. Many of them utilise Directed Acyclic Graphs (DAGs) in some way. A DAG is often used for expressing the dataflow of computation as it offers the possibility to optimise the execution, because it contains the overview of the whole computation. This thesis aims to create an Integrated Development Environment (IDE) like software, which is user-friendly, interactive and easily extendable. The software enables to draw a DAG which represents the dataflow of a program. The DAG is then transformed into launchable source code. Moreover, the software offers a simple way to execute the generated source code. It compiles the code (if necessary), and launches it based on the user's configuration, either on localhost or cluster. The software primarily aims to help beginners learn these technologies, but experts can also use it as visualisation for their workflow or as a prototyping tool. The software has been implemented using Electron and Web technologies, which ensure its platform independence. Its main features are code generation (i.e. translation of a DAG into source code) and code execution. It is created with extensibility in mind, to be able to plug-in support for more frameworks and tools in the future

    Causality-Guided Adaptive Interventional Debugging

    Full text link
    Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to (1) pinpoint the root cause of an application's intermittent failure and (2) generate an explanation of how the root cause triggers the failure. AID works by first identifying a set of runtime behaviors (called predicates) that are strongly correlated to the failure. It then utilizes temporal properties of the predicates to (over)-approximate their causal relationships. Finally, it uses fault injection to execute a sequence of interventions on the predicates and discover their true causal relationships. This enables AID to identify the true root cause and its causal relationship to the failure. We theoretically analyze how fast AID can converge to the identification. We evaluate AID with six real-world applications that intermittently fail under specific inputs. In each case, AID was able to identify the root cause and explain how the root cause triggered the failure, much faster than group testing and more precisely than statistical debugging. We also evaluate AID with many synthetically generated applications with known root causes and confirm that the benefits also hold for them.Comment: Technical report of AID (SIGMOD 2020
    corecore