136 research outputs found
FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture
Neural Network (NN) accelerators with emerging ReRAM (resistive random access
memory) technologies have been investigated as one of the promising solutions
to address the \textit{memory wall} challenge, due to the unique capability of
\textit{processing-in-memory} within ReRAM-crossbar-based processing elements
(PEs). However, the high efficiency and high density advantages of ReRAM have
not been fully utilized due to the huge communication demands among PEs and the
overhead of peripheral circuits.
In this paper, we propose a full system stack solution, composed of a
reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and
its software system including neural synthesizer, temporal-to-spatial mapper,
and placement & routing. We highly leverage the software system to make the
hardware design compact and efficient. To satisfy the high-performance
communication demand, we optimize it with a reconfigurable routing architecture
and the placement & routing tool. To improve the computational density, we
greatly simplify the PE circuit with the spiking schema and then adopt neural
synthesizer to enable the high density computation-resources to support
different kinds of NN operations. In addition, we provide spiking memory blocks
(SMBs) and configurable logic blocks (CLBs) in hardware and leverage the
temporal-to-spatial mapper to utilize them to balance the storage and
computation requirements of NN. Owing to the end-to-end software system, we can
efficiently deploy existing deep neural networks to FPSA. Evaluations show
that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME,
the computational density of FPSA improves by 31x; for representative NNs, its
inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201
Deterministic, Stash-Free Write-Only ORAM
Write-Only Oblivious RAM (WoORAM) protocols provide privacy by encrypting the
contents of data and also hiding the pattern of write operations over that
data. WoORAMs provide better privacy than plain encryption and better
performance than more general ORAM schemes (which hide both writing and reading
access patterns), and the write-oblivious setting has been applied to important
applications of cloud storage synchronization and encrypted hidden volumes. In
this paper, we introduce an entirely new technique for Write-Only ORAM, called
DetWoORAM. Unlike previous solutions, DetWoORAM uses a deterministic,
sequential writing pattern without the need for any "stashing" of blocks in
local state when writes fail. Our protocol, while conceptually simple, provides
substantial improvement over prior solutions, both asymptotically and
experimentally. In particular, under typical settings the DetWoORAM writes only
2 blocks (sequentially) to backend memory for each block written to the device,
which is optimal. We have implemented our solution using the BUSE (block device
in user-space) module and tested DetWoORAM against both an encryption only
baseline of dm-crypt and prior, randomized WoORAM solutions, measuring only a
3x-14x slowdown compared to an encryption-only baseline and around 6x-19x
speedup compared to prior work
Strongly Secure and Efficient Data Shuffle On Hardware Enclaves
Mitigating memory-access attacks on the Intel SGX architecture is an
important and open research problem. A natural notion of the mitigation is
cache-miss obliviousness which requires the cache-misses emitted during an
enclave execution are oblivious to sensitive data. This work realizes the
cache-miss obliviousness for the computation of data shuffling. The proposed
approach is to software-engineer the oblivious algorithm of Melbourne shuffle
on the Intel SGX/TSX architecture, where the Transaction Synchronization
eXtension (TSX) is (ab)used to detect the occurrence of cache misses. In the
system building, we propose software techniques to prefetch memory data prior
to the TSX transaction to defend the physical bus-tapping attacks. Our
evaluation based on real implementation shows that our system achieves superior
performance and lower transaction abort rate than the related work in the
existing literature.Comment: Systex'1
NUMASK: High Performance Scalable Skip List for NUMA
This paper presents NUMASK, a skip list data structure specifically designed to exploit the characteristics of Non-Uniform Memory Access (NUMA) architectures to improve performance. NUMASK deploys an architecture around a concurrent skip list so that all metadata accesses (e.g., traversals of the skip list index levels) read and write memory blocks allocated in the NUMA zone where the thread is executing. To the best of our knowledge, NUMASK is the first NUMA-aware skip list design that goes beyond merely limiting the performance penalties introduced by NUMA, and leverages the NUMA architecture to outperform state-of-the-art concurrent high-performance implementations. We tested NUMASK on a four-socket server. Its performance scales for both read-intensive and write-intensive workloads (tested up to 160 threads). In write-intensive workload, NUMASK shows speedups over competitors in the range of 2x to 16x
- …