209 research outputs found

    A Design Framework for Efficient Distributed Analytics on Structured Big Data

    Get PDF
    Distributed analytics architectures are often comprised of two elements: a compute engine and a storage system. Conventional distributed storage systems usually store data in the form of files or key-value pairs. This abstraction simplifies how the data is accessed and reasoned about by an application developer. However, the separation of compute and storage systems makes it difficult to optimize costly disk and network operations. By design the storage system is isolated from the workload and its performance requirements such as block co-location and replication. Furthermore, optimizing fine-grained data access requests becomes difficult as the storage layer is hidden away behind such abstractions. Using a clean slate approach, this thesis proposes a modular distributed analytics system design which is centered around a unified interface for distributed data objects named the DDO. The interface couples key mechanisms that utilize storage, memory, and compute resources. This coupling makes it ideal to optimize data access requests across all memory hierarchy levels, with respect to the workload and its performance requirements. In addition to the DDO, a complementary DDO controller implementation controls the logical view of DDOs, their replication, and distribution across the cluster. A proof-of-concept implementation shows improvement in mean query time by 3-6x on the TPC-H and TPC-DS benchmarks, and more than an order of magnitude improvement in many cases

    Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

    Get PDF
    The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

    Dynamic re-optimization techniques for stream processing engines and object stores

    Get PDF
    Large scale data storage and processing systems are strongly motivated by the need to store and analyze massive datasets. The complexity of a large class of these systems is rooted in their distributed nature, extreme scale, need for real-time response, and streaming nature. The use of these systems on multi-tenant, cloud environments with potential resource interference necessitates fine-grained monitoring and control. In this dissertation, we present efficient, dynamic techniques for re-optimizing stream-processing systems and transactional object-storage systems.^ In the context of stream-processing systems, we present VAYU, a per-topology controller. VAYU uses novel methods and protocols for dynamic, network-aware tuple-routing in the dataflow. We show that the feedback-driven controller in VAYU helps achieve high pipeline throughput over long execution periods, as it dynamically detects and diagnoses any pipeline-bottlenecks. We present novel heuristics to optimize overlays for group communication operations in the streaming model.^ In the context of object-storage systems, we present M-Lock, a novel lock-localization service for distributed transaction protocols on scale-out object stores to increase transaction throughput. Lock localization refers to dynamic migration and partitioning of locks across nodes in the scale-out store to reduce cross-partition acquisition of locks. The service leverages the observed object-access patterns to achieve lock-clustering and deliver high performance. We also present TransMR, a framework that uses distributed, transactional object stores to orchestrate and execute asynchronous components in amorphous data-parallel applications on scale-out architectures

    Nomadic fog storage

    Get PDF
    Mobile services incrementally demand for further processing and storage. However, mobile devices are known for their constrains in terms of processing, storage, and energy. Early proposals have addressed these aspects; by having mobile devices access remote clouds. But these proposals suffer from long latencies and backhaul bandwidth limitations in retrieving data. To mitigate these issues, edge clouds have been proposed. Using this paradigm, intermediate nodes are placed between the mobile devices and the remote cloud. These intermediate nodes should fulfill the end users’ resource requests, namely data and processing capability, and reduce the energy consumption on the mobile devices’ batteries. But then again, mobile traffic demand is increasing exponentially and there is a greater than ever evolution of mobile device’s available resources. This urges the use of mobile nodes’ extra capabilities for fulfilling the requisites imposed by new mobile applications. In this new scenario, the mobile devices should become both consumers and providers of the emerging services. The current work researches on this possibility by designing, implementing and testing a novel nomadic fog storage system that uses fog and mobile nodes to support the upcoming applications. In addition, a novel resource allocation algorithm has been developed that considers the available energy on mobile devices and the network topology. It also includes a replica management module based on data popularity. The comprehensive evaluation of the fog proposal has evidenced that it is responsive, offloads traffic from the backhaul links, and enables a fair energy depletion among mobiles nodes by storing content in neighbor nodes with higher battery autonomy.Os serviços móveis requerem cada vez mais poder de processamento e armazenamento. Contudo, os dispositivos móveis são conhecidos por serem limitados em termos de armazenamento, processamento e energia. Como solução, os dispositivos móveis começaram a aceder a estes recursos através de nuvens distantes. No entanto, estas sofrem de longas latências e limitações na largura de banda da rede, ao aceder aos recursos. Para resolver estas questões, foram propostas soluções de edge computing. Estas, colocam nós intermediários entre os dispositivos móveis e a nuvem remota, que são responsáveis por responder aos pedidos de recursos por parte dos utilizadores finais. Dados os avanços na tecnologia dos dispositivos móveis e o aumento da sua utilização, torna-se cada mais pertinente a utilização destes próprios dispositivos para fornecer os serviços da nuvem. Desta forma, o dispositivo móvel torna-se consumidor e fornecedor do serviço nuvem. O trabalho atual investiga esta vertente, implementado e testando um sistema que utiliza dispositivos móveis e nós no “fog”, para suportar os serviços móveis emergentes. Foi ainda implementado um algoritmo de alocação de recursos que considera os níveis de energia e a topologia da rede, bem como um módulo que gere a replicação de dados no sistema de acordo com a sua popularidade. Os resultados obtidos provam que o sistema é responsivo, alivia o tráfego nas ligações no core, e demonstra uma distribuição justa do consumo de energia no sistema através de uma disseminação eficaz de conteúdo nos nós da periferia da rede mais próximos dos nós consumidores

    VELOXDFS: ELASTIC BLOCKS IN DISTRIBUTED FILE SYSTEMS FOR BIG DATA FRAMEWORKS

    Get PDF
    Department of Computer Science and EngineeringBig data processing and storage has grown into one of the most important aspects of distributed computing in the last years. Much of the effort in this area goes into sophisticated algorithms and architectures which provides a small leap to a more efficient big data system. This works explores a novel idea in which by modifying a simple component found in most of the distributed systems it leads to a significant improvement of the overall performance of the underline system which is often blind to this modification. This small component is file partitioning, and it plays a crucial role in the division of the workload for a distributed job into small working units. This work proposes a different view of file partitioning which separates partitions of a file into conventional simple blocks to a more sophisticated system in which those blocks can change its size at running time and consequently been able to adjust the amount of input of each of the working units in the distributed job. The implications that this technique unleashes are enormous since it can be virtually plugged to any distributed system and improve its system utilization and performance. In this research we plug our proposed file partitioning system in one the most used data processing system of our time, Apache Hadoop. Coincidentally, this thesis also presents a novel distributed file system named VeloxDFS which im- plements elastic blocks among other remarkable features and can be used as a substitute of the Apache Hadoop Distributed File System.clos
    corecore