43,145 research outputs found

    STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics

    Full text link
    Various general-purpose distributed systems have been proposed to cope with high-diversity applications in the pipeline of Big Data analytics. Most of them provide simple yet effective primitives to simplify distributed programming. While the rigid primitives offer great ease of use to savvy programmers, they probably compromise efficiency in performance and flexibility in data representation and programming specifications, which are critical properties in real systems. In this paper, we discuss the limitations of coarse-grained primitives and aim to provide an alternative for users to have flexible control over distributed programs and operate globally shared data more efficiently. We develop STEP, a novel distributed framework based on in-memory key-value store. The key idea of STEP is to adapt multi-threading in a single machine to a distributed environment. STEP enables users to take fine-grained control over distributed threads and apply task-specific optimizations in a flexible manner. The underlying key-value store serves as distributed shared memory to keep globally shared data. To ensure ease-of-use, STEP offers plentiful effective interfaces in terms of distributed shared data manipulation, cluster management, distributed thread management and synchronization. We conduct extensive experimental studies to evaluate the performance of STEP using real data sets. The results show that STEP outperforms the state-of-the-art general-purpose distributed systems as well as a specialized ML platform in many real applications

    HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

    Full text link
    Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases. In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS' single node in-memory metadata service, with a distributed metadata service built on a NewSQL database. By removing the metadata bottleneck, HopsFS enables an order of magnitude larger and higher throughput clusters compared to HDFS. Metadata capacity has been increased to at least 37 times HDFS' capacity, and in experiments based on a workload trace from Spotify, we show that HopsFS supports 16 to 37 times the throughput of Apache HDFS. HopsFS also has lower latency for many concurrent clients, and no downtime during failover. Finally, as metadata is now stored in a commodity database, it can be safely extended and easily exported to external systems for online analysis and free-text search

    Improving the Performance and Endurance of Persistent Memory with Loose-Ordering Consistency

    Full text link
    Persistent memory provides high-performance data persistence at main memory. Memory writes need to be performed in strict order to satisfy storage consistency requirements and enable correct recovery from system crashes. Unfortunately, adhering to such a strict order significantly degrades system performance and persistent memory endurance. This paper introduces a new mechanism, Loose-Ordering Consistency (LOC), that satisfies the ordering requirements at significantly lower performance and endurance loss. LOC consists of two key techniques. First, Eager Commit eliminates the need to perform a persistent commit record write within a transaction. We do so by ensuring that we can determine the status of all committed transactions during recovery by storing necessary metadata information statically with blocks of data written to memory. Second, Speculative Persistence relaxes the write ordering between transactions by allowing writes to be speculatively written to persistent memory. A speculative write is made visible to software only after its associated transaction commits. To enable this, our mechanism supports the tracking of committed transaction ID and multi-versioning in the CPU cache. Our evaluations show that LOC reduces the average performance overhead of memory persistence from 66.9% to 34.9% and the memory write traffic overhead from 17.1% to 3.4% on a variety of workloads.Comment: This paper has been accepted by IEEE Transactions on Parallel and Distributed System

    A Fast Diagnosis Scheme for Distributed Small Embedded SRAMs

    Full text link
    This paper proposes a diagnosis scheme aimed at reducing diagnosis time of distributed small embedded SRAMs (e-SRAMs). This scheme improves the one proposed in [A parallel built-in self-diagnostic method for embedded memory buffers, A parallel built-in self-diagnostic method for embedded memory arrays]. The improvements are mainly two-fold. On one hand, the diagnosis of time-consuming Data Retention Faults (DRFs), which is neglected by the diagnosis architecture in [A parallel built-in self-diagnostic method for embedded memory buffers, A parallel built-in self-diagnostic method for embedded memory arrays], is now considered and performed via a DFT technique referred to as the "No Write Recovery Test Mode (NWRTM)". On the other hand, a pair comprising a Serial to Parallel Converter (SPC) and a Parallel to Serial Converter (PSC) is utilized to replace the bi-directional serial interface, to avoid the problems of serial fault masking and defect rate dependent diagnosis. Results from our evaluations show that the proposed diagnosis scheme achieves an increased diagnosis coverage and reduces diagnosis time compared to those obtained in [A parallel built-in self-diagnostic method for embedded memory buffers, A parallel built-in self-diagnostic method for embedded memory arrays], with neglectable extra area cost.Comment: Submitted on behalf of EDAA (http://www.edaa.com/

    Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips

    Full text link
    This article summarizes key results of our work on experimental characterization and analysis of latency variation and latency-reliability trade-offs in modern DRAM chips, which was published in SIGMETRICS 2016, and examines the work's significance and future potential. The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make six major new observations about latency variation within DRAM. Notably, we find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced. Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system

    Adaptive Logging for Distributed In-memory Databases

    Full text link
    A new type of logs, the command log, is being employed to replace the traditional data log (e.g., ARIES log) in the in-memory databases. Instead of recording how the tuples are updated, a command log only tracks the transactions being executed, thereby effectively reducing the size of the log and improving the performance. Command logging on the other hand increases the cost of recovery, because all the transactions in the log after the last checkpoint must be completely redone in case of a failure. In this paper, we first extend the command logging technique to a distributed environment, where all the nodes can perform recovery in parallel. We then propose an adaptive logging approach by combining data logging and command logging. The percentage of data logging versus command logging becomes an optimization between the performance of transaction processing and recovery to suit different OLTP applications. Our experimental study compares the performance of our proposed adaptive logging, ARIES-style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.Comment: 13 page

    Reducing DRAM Refresh Overheads with Refresh-Access Parallelism

    Full text link
    This article summarizes the idea of "refresh-access parallelism," which was published in HPCA 2014, and examines the work's significance and future potential. The overarching objective of our HPCA 2014 paper is to reduce the significant negative performance impact of DRAM refresh with intelligent memory controller mechanisms. To mitigate the negative performance impact of DRAM refresh, our HPCA 2014 paper proposes two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of state-of-the-art per-bank refresh mechanism by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as it is done today, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP proactively schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Our extensive evaluations on a wide variety of workloads and systems show that our mechanisms improve system performance (and energy efficiency) compared to three state-of-the-art refresh policies, and their performance bene ts increase as DRAM density increases.Comment: 9 pages. arXiv admin note: text overlap with arXiv:1712.07754, arXiv:1601.0635

    Parallel and Distributed Collaborative Filtering: A Survey

    Full text link
    Collaborative filtering is amongst the most preferred techniques when implementing recommender systems. Recently, great interest has turned towards parallel and distributed implementations of collaborative filtering algorithms. This work is a survey of the parallel and distributed collaborative filtering implementations, aiming not only to provide a comprehensive presentation of the field's development, but also to offer future research orientation by highlighting the issues that need to be further developed.Comment: 46 page

    Data protection by means of fragmentation in various different distributed storage systems - a survey

    Full text link
    This paper analyzes various distributed storage systems that use data fragmentation and dispersal as a way of protection.Existing solutions have been organized into two categories: bitwise and structurewise. Systems from the bitwise category are operating on unstructured data and in a uniform environment. Those having structured input data with predefined confidentiality level and disposing of a heterogeneous environment in terms of machine trustworthiness were classified as structurewise. Furthermore, we outline high-level requirements and desirable architecture traits of an eficient data fragmentation system, which will address performance (including latency), availability, resilience and scalability.Comment: arXiv admin note: text overlap with arXiv:1512.0295

    Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics

    Full text link
    Control planes of cloud frameworks trade off between scheduling granularity and performance. Centralized systems schedule at task granularity, but only schedule a few thousand tasks per second. Distributed systems schedule hundreds of thousands of tasks per second but changing the schedule is costly. We present execution templates, a control plane abstraction that can schedule hundreds of thousands of tasks per second while supporting fine-grained, per-task scheduling decisions. Execution templates leverage a program's repetitive control flow to cache blocks of frequently-executed tasks. Executing a task in a template requires sending a single message. Large-scale scheduling changes install new templates, while small changes apply edits to existing templates. Evaluations of execution templates in Nimbus, a data analytics framework, find that they provide the fine-grained scheduling flexibility of centralized control planes while matching the strong scaling of distributed ones. Execution templates support complex, real-world applications, such as a fluid simulation with a triply nested loop and data dependent branches.Comment: To appear at USENIX ATC 201
    • …
    corecore