418 research outputs found
The future of computing beyond Moore's Law.
Moore's Law is a techno-economic model that has enabled the information technology industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. Advances in silicon lithography have enabled this exponential miniaturization of electronics, but, as transistors reach atomic scale and fabrication costs continue to rise, the classical technological driver that has underpinned Moore's Law for 50 years is failing and is anticipated to flatten by 2025. This article provides an updated view of what a post-exascale system will look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also discusses the tapering of historical improvements, and how it affects options available to continue scaling of successors to the first exascale machine. Lastly, this article covers the many different opportunities and strategies available to continue computing performance improvements in the absence of historical technology drivers. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
tieval: An Evaluation Framework for Temporal Information Extraction Systems
Temporal information extraction (TIE) has attracted a great deal of interest
over the last two decades, leading to the development of a significant number
of datasets. Despite its benefits, having access to a large volume of corpora
makes it difficult when it comes to benchmark TIE systems. On the one hand,
different datasets have different annotation schemes, thus hindering the
comparison between competitors across different corpora. On the other hand, the
fact that each corpus is commonly disseminated in a different format requires a
considerable engineering effort for a researcher/practitioner to develop
parsers for all of them. This constraint forces researchers to select a limited
amount of datasets to evaluate their systems which consequently limits the
comparability of the systems. Yet another obstacle that hinders the
comparability of the TIE systems is the evaluation metric employed. While most
research works adopt traditional metrics such as precision, recall, and ,
a few others prefer temporal awareness -- a metric tailored to be more
comprehensive on the evaluation of temporal systems. Although the reason for
the absence of temporal awareness in the evaluation of most systems is not
clear, one of the factors that certainly weights this decision is the necessity
to implement the temporal closure algorithm in order to compute temporal
awareness, which is not straightforward to implement neither is currently
easily available. All in all, these problems have limited the fair comparison
between approaches and consequently, the development of temporal extraction
systems. To mitigate these problems, we have developed tieval, a Python library
that provides a concise interface for importing different corpora and
facilitates system evaluation. In this paper, we present the first public
release of tieval and highlight its most relevant features.Comment: 10 page
Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks
Loom (LM), a hardware inference accelerator for Convolutional Neural Networks
(CNNs) is presented. In LM every bit of data precision that can be saved
translates to proportional performance gains. Specifically, for convolutional
layers LM's execution time scales inversely proportionally with the precisions
of both weights and activations. For fully-connected layers LM's performance
scales inversely proportionally with the precision of the weights. LM targets
area- and bandwidth-constrained System-on-a-Chip designs such as those found on
mobile devices that cannot afford the multi-megabyte buffers that would be
needed to store each layer on-chip. Accordingly, given a data bandwidth budget,
LM boosts energy efficiency and performance over an equivalent bit-parallel
accelerator. For both weights and activations LM can exploit profile-derived
perlayer precisions. However, at runtime LM further trims activation precisions
at a much smaller than a layer granularity. Moreover, it can naturally exploit
weight precision variability at a smaller granularity than a layer. On average,
across several image classification CNNs and for a configuration that can
perform the equivalent of 128 16b x 16b multiply-accumulate operations per
cycle LM outperforms a state-of-the-art bit-parallel accelerator [1] by 4.38x
without any loss in accuracy while being 3.54x more energy efficient. LM can
trade-off accuracy for additional improvements in execution performance and
energy efficiency and compares favorably to an accelerator that targeted only
activation precisions. We also study 2- and 4-bit LM variants and find the the
2-bit per cycle variant is the most energy efficient
Synchronous OEIC integrating receiver for optically reconfigurable gate arrays
A monolithically integrated optoelectronic receiver with a low-capacitance on-chip pin photodiode is presented. The receiver is fabricated in a 0.35µm opto-CMOS process fed at 3.3V and due to the highly effective integrated pin photodiode it operates at µW. A regenerative latch acting as a sense amplifier leads in addition to a low electrical power consumption. At 400 Mbit/s, sensitivities of -26.0dBm and -25.5dBm are achieved, respectively, for ¿ = 635nm and ¿ = 675nm (BER =10-9) with an energy efficiency of 2 pJ/bit
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
A recent trend in DNN development is to extend the reach of deep learning
applications to platforms that are more resource and energy constrained, e.g.,
mobile devices. These endeavors aim to reduce the DNN model size and improve
the hardware processing efficiency, and have resulted in DNNs that are much
more compact in their structures and/or have high data sparsity. These compact
or sparse models are different from the traditional large ones in that there is
much more variation in their layer shapes and sizes, and often require
specialized hardware to exploit sparsity for performance improvement. Thus,
many DNN accelerators designed for large DNNs do not perform well on these
models. In this work, we present Eyeriss v2, a DNN accelerator architecture
designed for running compact and sparse DNNs. To deal with the widely varying
layer shapes and sizes, it introduces a highly flexible on-chip network, called
hierarchical mesh, that can adapt to the different amounts of data reuse and
bandwidth requirements of different data types, which improves the utilization
of the computation resources. Furthermore, Eyeriss v2 can process sparse data
directly in the compressed domain for both weights and activations, and
therefore is able to improve both processing speed and energy efficiency with
sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65nm CMOS
process achieves a throughput of 1470.6 inferences/sec and 2560.3 inferences/J
at a batch size of 1, which is 12.6x faster and 2.5x more energy efficient than
the original Eyeriss running MobileNet. We also present an analysis methodology
called Eyexam that provides a systematic way of understanding the performance
limits for DNN processors as a function of specific characteristics of the DNN
model and accelerator design; it applies these characteristics as sequential
steps to increasingly tighten the bound on the performance limits.Comment: accepted for publication in IEEE Journal on Emerging and Selected
Topics in Circuits and Systems. This extended version on arXiv also includes
Eyexam in the appendi
Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism
The demise of Moore's Law and Dennard Scaling has revived interest in
specialized computer architectures and accelerators. Verification and testing
of this hardware heavily uses cycle-accurate simulation of
register-transfer-level (RTL) designs. The best software RTL simulators can
simulate designs at 1--1000~kHz, i.e., more than three orders of magnitude
slower than hardware. Faster simulation can increase productivity by speeding
design iterations and permitting more exhaustive exploration.
One possibility is to use parallelism as RTL exposes considerable fine-grain
concurrency. However, state-of-the-art RTL simulators generally perform best
when single-threaded since modern processors cannot effectively exploit
fine-grain parallelism.
This work presents Manticore: a parallel computer designed to accelerate RTL
simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution
model to eliminate runtime synchronization barriers among many simple
processors. Manticore relies entirely on its compiler to schedule resources and
communication. Because RTL code is practically free of long divergent execution
paths, static scheduling is feasible. Communication and synchronization no
longer incur runtime overhead, enabling efficient fine-grain parallelism.
Moreover, static scheduling dramatically simplifies the physical
implementation, significantly increasing the potential parallelism on a chip.
Our 225-core FPGA prototype running at 475 MHz outperforms a state-of-the-art
RTL simulator on an Intel Xeon processor running at 3.3 GHz by up to
27.9 (geomean 5.3) in nine Verilog benchmarks
SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
Modern in-memory services rely on large distributed object stores to achieve the high scalability essential to service thousands of requests concurrently. The independent and unpredictable nature of incoming requests results in random accesses to the object store, triggering frequent remote memory accesses. State-of-the-art distributed memory frameworks leverage the one-sided operations offered by RDMA technology to mitigate the traditionally high cost of remote memory access. Unfortunately, the limited semantics of RDMA one-sided operations bound remote memory access atomicity to a single cache block; therefore, atomic remote object access relies on software mechanisms. Emerging highly integrated rack-scale systems that reduce the latency of one-sided operations to a small multiple of DRAM latency expose the overhead of these software mechanisms as a major latency contributor. This technology-triggered paradigm shift calls for new one-sided operations with stronger semantics. We take a step in that direction by proposing SABRes, a new one-sided operation that provides atomic remote object reads in hardware. We then present LightSABRes, a lightweight hardware accelerator for SABRes that removes all atomicity-associated software overheads. Compared to a state-of-the-art software atomicity mechanism, LightSABRes improve the throughput of a microbenchmark atomically accessing 128B-8KB objects from remote memory by 15-97%, and the throughput of a modern in-memory distributed object store by 30-60%
FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture
Neural Network (NN) accelerators with emerging ReRAM (resistive random access
memory) technologies have been investigated as one of the promising solutions
to address the \textit{memory wall} challenge, due to the unique capability of
\textit{processing-in-memory} within ReRAM-crossbar-based processing elements
(PEs). However, the high efficiency and high density advantages of ReRAM have
not been fully utilized due to the huge communication demands among PEs and the
overhead of peripheral circuits.
In this paper, we propose a full system stack solution, composed of a
reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and
its software system including neural synthesizer, temporal-to-spatial mapper,
and placement & routing. We highly leverage the software system to make the
hardware design compact and efficient. To satisfy the high-performance
communication demand, we optimize it with a reconfigurable routing architecture
and the placement & routing tool. To improve the computational density, we
greatly simplify the PE circuit with the spiking schema and then adopt neural
synthesizer to enable the high density computation-resources to support
different kinds of NN operations. In addition, we provide spiking memory blocks
(SMBs) and configurable logic blocks (CLBs) in hardware and leverage the
temporal-to-spatial mapper to utilize them to balance the storage and
computation requirements of NN. Owing to the end-to-end software system, we can
efficiently deploy existing deep neural networks to FPSA. Evaluations show
that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME,
the computational density of FPSA improves by 31x; for representative NNs, its
inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201
- …