115 research outputs found
Open Information Extraction: A Review of Baseline Techniques, Approaches, and Applications
With the abundant amount of available online and offline text data, there
arises a crucial need to extract the relation between phrases and summarize the
main content of each document in a few words. For this purpose, there have been
many studies recently in Open Information Extraction (OIE). OIE improves upon
relation extraction techniques by analyzing relations across different domains
and avoids requiring hand-labeling pre-specified relations in sentences. This
paper surveys recent approaches of OIE and its applications on Knowledge Graph
(KG), text summarization, and Question Answering (QA). Moreover, the paper
describes OIE basis methods in relation extraction. It briefly discusses the
main approaches and the pros and cons of each method. Finally, it gives an
overview about challenges, open issues, and future work opportunities for OIE,
relation extraction, and OIE applications.Comment: 15 pages, 9 figure
Optimal and Error-Free Multi-Valued Byzantine Consensus Through Parallel Execution
Multi-valued Byzantine Consensus (BC), in which processes must reach agreement on a single -bit value, is an essential primitive in the design of distributed cryptographic protocols and fault-tolerant distributed systems.
One of the most desirable traits for a multi-valued BC protocol is to be error-free.
In other words, have zero probability of producing incorrect results.
The most efficient error-free multi-valued BC protocols are built as extension protocols, which reduce agreement on large values to agreement on small sequences of bits whose lengths are independent of .
The best extension protocols achieve communication complexity, which is optimal, when is large relative to .
Unfortunately, all known error-free and communication-optimal BC extension protocols require each process to broadcast at least bits with a binary Byzantine Broadcast (BB) protocol.
This design limits the scalability of these protocols to many processes, since when is large, the binary broadcasts significantly inflate the overall number of bits communicated by the extension protocol.
In this paper, we present Byzantine Consensus with Parallel Execution (BCPE), the first error-free and communication-optimal BC extension protocol in which each process only broadcasts a single bit with a binary BB protocol.
BCPE is a synchronous and deterministic protocol, and tolerates faulty processes (the best resilience possible).
Our evaluation shows that BCPE\u27s design makes it significantly more scalable than the best existing protocol by Ganesh and Patra.
For 1,000 processes to agree on 2 MB of data, BCPE communicates fewer bits.
For agreement on 10 MB of data, BCPE communicates fewer bits.
BCPE also matches the best existing protocol in all other standard efficiency metrics
MTrainS: Improving DLRM training efficiency using heterogeneous memories
Recommendation models are very large, requiring terabytes (TB) of memory
during training. In pursuit of better quality, the model size and complexity
grow over time, which requires additional training data to avoid overfitting.
This model growth demands a large number of resources in data centers. Hence,
training efficiency is becoming considerably more important to keep the data
center power demand manageable. In Deep Learning Recommendation Models (DLRM),
sparse features capturing categorical inputs through embedding tables are the
major contributors to model size and require high memory bandwidth. In this
paper, we study the bandwidth requirement and locality of embedding tables in
real-world deployed models. We observe that the bandwidth requirement is not
uniform across different tables and that embedding tables show high temporal
locality. We then design MTrainS, which leverages heterogeneous memory,
including byte and block addressable Storage Class Memory for DLRM
hierarchically. MTrainS allows for higher memory capacity per node and
increases training efficiency by lowering the need to scale out to multiple
hosts in memory capacity bound use cases. By optimizing the platform memory
hierarchy, we reduce the number of nodes for training by 4-8X, saving power and
cost of training while meeting our target training performance
Hi-Rise: A high-radix switch for 3D integration with single-cycle arbitration
Abstract-This paper proposes a novel 3D switch, called 'HiRise', that employs high-radix switches to efficiently route data across multiple stacked layers of dies. The proposed interconnect is hierarchical and composed of two switches per silicon layer and a set of dedicated layer to layer channels. However, a hierarchical 3D switch can lead to unfair arbitration across different layers. To address this, the paper proposes a unique class-based arbitration scheme that is fully integrated into the switching fabric, and is easy to implement. It makes the 3D hierarchical switch's fairness comparable to that of a flat 2D switch with least recently granted arbitration. The 3D switch is evaluated for different radices, number of stacked layers, and different 3D integration technologies. A 64-radix, 128-bit width, 4-layer Hi-Rise evaluated in a 32nm technology has a throughput of 10.65 Tbps for uniform random traffic. Compared to a 2D design this corresponds to a 15% improvement in throughput, a 33% area reduction, a 20% latency reduction, and a 38% energy per transaction reduction
Hardware Acceleration for Similarity Measurement in Natural Language Processing
Abstract-The continuation of Moore's law scaling, but in the absence of Dennard scaling, motivates an emphasis on energyefficient accelerator-based designs for future applications. In natural language processing, the conventional approach to automatically analyze vast text collections-using scale-out processingincurs high energy and hardware costs since the central computeintensive step of similarity measurement often entails pair-wise, allto-all comparisons. We propose a custom hardware accelerator for similarity measures that leverages data streaming, memory latency hiding, and parallel computation across variable-length threads. We evaluate our design through a combination of architectural simulation and RTL synthesis. When executing the dominant kernel in a semantic indexing application for documents, we demonstrate throughput gains of up to 42× and 58× lower energy per similaritycomputation compared to an optimized software implementation, while requiring less than 1.3% of the area of a conventional core
- …