94 research outputs found
The Mondrian Data Engine
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches. Near-memory processing (NMP) architectures are reemerging as promising candidates to improve computing efficiency through tight coupling of logic and memory. NMP architectures are especially fitting for data analytics, as they provide immense bandwidth to memory-resident data and dramatically reduce data movement, the main source of energy consumption. Modern data analytics operators are optimized for CPU execution and hence rely on large caches and employ random memory accesses. In the context of NMP, such random accesses result in wasteful DRAM row buffer activations that account for a significant fraction of the total memory access energy. In addition, utilizing NMP’s ample bandwidth with fine-grained random accesses requires complex hardware that cannot be accommodated under NMP’s tight area and power constraints. Our thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams. We introduce an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine. Compared to a CPU-centric and a baseline NMP system, the Mondrian Data Engine improves the performance of basic data analytics operators by up to 49× and 5×, and efficiency by up to 28× and 5×, respectively
FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture
Neural Network (NN) accelerators with emerging ReRAM (resistive random access
memory) technologies have been investigated as one of the promising solutions
to address the \textit{memory wall} challenge, due to the unique capability of
\textit{processing-in-memory} within ReRAM-crossbar-based processing elements
(PEs). However, the high efficiency and high density advantages of ReRAM have
not been fully utilized due to the huge communication demands among PEs and the
overhead of peripheral circuits.
In this paper, we propose a full system stack solution, composed of a
reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and
its software system including neural synthesizer, temporal-to-spatial mapper,
and placement & routing. We highly leverage the software system to make the
hardware design compact and efficient. To satisfy the high-performance
communication demand, we optimize it with a reconfigurable routing architecture
and the placement & routing tool. To improve the computational density, we
greatly simplify the PE circuit with the spiking schema and then adopt neural
synthesizer to enable the high density computation-resources to support
different kinds of NN operations. In addition, we provide spiking memory blocks
(SMBs) and configurable logic blocks (CLBs) in hardware and leverage the
temporal-to-spatial mapper to utilize them to balance the storage and
computation requirements of NN. Owing to the end-to-end software system, we can
efficiently deploy existing deep neural networks to FPSA. Evaluations show
that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME,
the computational density of FPSA improves by 31x; for representative NNs, its
inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201
PIM-Enclave: Bringing Confidential Computation Inside Memory
Demand for data-intensive workloads and confidential computing are the
prominent research directions shaping the future of cloud computing. Computer
architectures are evolving to accommodate the computing of large data better.
Protecting the computation of sensitive data is also an imperative yet
challenging objective; processor-supported secure enclaves serve as the key
element in confidential computing in the cloud. However, side-channel attacks
are threatening their security boundaries. The current processor architectures
consume a considerable portion of its cycles in moving data. Near data
computation is a promising approach that minimizes redundant data movement by
placing computation inside storage. In this paper, we present a novel design
for Processing-In-Memory (PIM) as a data-intensive workload accelerator for
confidential computing. Based on our observation that moving computation closer
to memory can achieve efficiency of computation and confidentiality of the
processed information simultaneously, we study the advantages of confidential
computing \emph{inside} memory. We then explain our security model and
programming model developed for PIM-based computation offloading. We construct
our findings into a software-hardware co-design, which we call PIM-Enclave. Our
design illustrates the advantages of PIM-based confidential computing
acceleration. Our evaluation shows PIM-Enclave can provide a side-channel
resistant secure computation offloading and run data-intensive applications
with negligible performance overhead compared to baseline PIM model
TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference
TensorDash is a hardware level technique for enabling data-parallel MAC units
to take advantage of sparsity in their input operand streams. When used to
compose a hardware accelerator for deep learning, TensorDash can speedup the
training process while also increasing energy efficiency. TensorDash combines a
low-cost, sparse input operand interconnect comprising an 8-input multiplexer
per multiplier input, with an area-efficient hardware scheduler. While the
interconnect allows a very limited set of movements per operand, the scheduler
can effectively extract sparsity when it is present in the activations, weights
or gradients of neural networks. Over a wide set of models covering various
applications, TensorDash accelerates the training process by
while being more energy-efficient, more energy
efficient when taking on-chip and off-chip memory accesses into account. While
TensorDash works with any datatype, we demonstrate it with both
single-precision floating-point units and bfloat16
- …