43 research outputs found

    Optimal and Error-Free Multi-Valued Byzantine Consensus Through Parallel Execution

    Get PDF
    Multi-valued Byzantine Consensus (BC), in which nn processes must reach agreement on a single LL-bit value, is an essential primitive in the design of distributed cryptographic protocols and fault-tolerant distributed systems. One of the most desirable traits for a multi-valued BC protocol is to be error-free. In other words, have zero probability of producing incorrect results. The most efficient error-free multi-valued BC protocols are built as extension protocols, which reduce agreement on large values to agreement on small sequences of bits whose lengths are independent of LL. The best extension protocols achieve O(Ln)\mathcal{O}(Ln) communication complexity, which is optimal, when LL is large relative to nn. Unfortunately, all known error-free and communication-optimal BC extension protocols require each process to broadcast at least nn bits with a binary Byzantine Broadcast (BB) protocol. This design limits the scalability of these protocols to many processes, since when nn is large, the binary broadcasts significantly inflate the overall number of bits communicated by the extension protocol. In this paper, we present Byzantine Consensus with Parallel Execution (BCPE), the first error-free and communication-optimal BC extension protocol in which each process only broadcasts a single bit with a binary BB protocol. BCPE is a synchronous and deterministic protocol, and tolerates f<n/3f < n/3 faulty processes (the best resilience possible). Our evaluation shows that BCPE\u27s design makes it significantly more scalable than the best existing protocol by Ganesh and Patra. For 1,000 processes to agree on 2 MB of data, BCPE communicates 10.92×10.92\times fewer bits. For agreement on 10 MB of data, BCPE communicates 6.97×6.97\times fewer bits. BCPE also matches the best existing protocol in all other standard efficiency metrics

    Open Information Extraction: A Review of Baseline Techniques, Approaches, and Applications

    Full text link
    With the abundant amount of available online and offline text data, there arises a crucial need to extract the relation between phrases and summarize the main content of each document in a few words. For this purpose, there have been many studies recently in Open Information Extraction (OIE). OIE improves upon relation extraction techniques by analyzing relations across different domains and avoids requiring hand-labeling pre-specified relations in sentences. This paper surveys recent approaches of OIE and its applications on Knowledge Graph (KG), text summarization, and Question Answering (QA). Moreover, the paper describes OIE basis methods in relation extraction. It briefly discusses the main approaches and the pros and cons of each method. Finally, it gives an overview about challenges, open issues, and future work opportunities for OIE, relation extraction, and OIE applications.Comment: 15 pages, 9 figure

    Hi-Rise: A high-radix switch for 3D integration with single-cycle arbitration

    Get PDF
    Abstract-This paper proposes a novel 3D switch, called &apos;HiRise&apos;, that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch&apos;s fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction

    MTrainS: Improving DLRM training efficiency using heterogeneous memories

    Full text link
    Recommendation models are very large, requiring terabytes (TB) of memory during training. In pursuit of better quality, the model size and complexity grow over time, which requires additional training data to avoid overfitting. This model growth demands a large number of resources in data centers. Hence, training efficiency is becoming considerably more important to keep the data center power demand manageable. In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We observe that the bandwidth requirement is not uniform across different tables and that embedding tables show high temporal locality. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically. MTrainS allows for higher memory capacity per node and increases training efficiency by lowering the need to scale out to multiple hosts in memory capacity bound use cases. By optimizing the platform memory hierarchy, we reduce the number of nodes for training by 4-8X, saving power and cost of training while meeting our target training performance

    Hardware Acceleration for Similarity Measurement in Natural Language Processing

    Get PDF
    Abstract-The continuation of Moore&apos;s law scaling, but in the absence of Dennard scaling, motivates an emphasis on energyefficient accelerator-based designs for future applications. In natural language processing, the conventional approach to automatically analyze vast text collections-using scale-out processingincurs high energy and hardware costs since the central computeintensive step of similarity measurement often entails pair-wise, allto-all comparisons. We propose a custom hardware accelerator for similarity measures that leverages data streaming, memory latency hiding, and parallel computation across variable-length threads. We evaluate our design through a combination of architectural simulation and RTL synthesis. When executing the dominant kernel in a semantic indexing application for documents, we demonstrate throughput gains of up to 42× and 58× lower energy per similaritycomputation compared to an optimized software implementation, while requiring less than 1.3% of the area of a conventional core
    corecore