Search CORE

136 research outputs found

FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture

Author: Hu Xing
Ji Yu
Li Shuangchen
Wang Peiqi
Xie Xinfeng
Xie Yuan
Zhang Youhui
Zhang Youyang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/01/2019
Field of study

Neural Network (NN) accelerators with emerging ReRAM (resistive random access memory) technologies have been investigated as one of the promising solutions to address the \textit{memory wall} challenge, due to the unique capability of \textit{processing-in-memory} within ReRAM-crossbar-based processing elements (PEs). However, the high efficiency and high density advantages of ReRAM have not been fully utilized due to the huge communication demands among PEs and the overhead of peripheral circuits. In this paper, we propose a full system stack solution, composed of a reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and its software system including neural synthesizer, temporal-to-spatial mapper, and placement & routing. We highly leverage the software system to make the hardware design compact and efficient. To satisfy the high-performance communication demand, we optimize it with a reconfigurable routing architecture and the placement & routing tool. To improve the computational density, we greatly simplify the PE circuit with the spiking schema and then adopt neural synthesizer to enable the high density computation-resources to support different kinds of NN operations. In addition, we provide spiking memory blocks (SMBs) and configurable logic blocks (CLBs) in hardware and leverage the temporal-to-spatial mapper to utilize them to balance the storage and computation requirements of NN. Owing to the end-to-end software system, we can efficiently deploy existing deep neural networks to FPSA. Evaluations show that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME, the computational density of FPSA improves by 31x; for representative NNs, its inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201

arXiv.org e-Print Archive

Crossref

Deterministic, Stash-Free Write-Only ORAM

Author: Anderson Ross J.
Andrew
Aviv Adam J
Jia Yaoqi
Peters Timothy
Ren Ling
Stefanov Emil
Stefanov Emil
Toft Tomas
Zahur Samee
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/09/2017
Field of study

Write-Only Oblivious RAM (WoORAM) protocols provide privacy by encrypting the contents of data and also hiding the pattern of write operations over that data. WoORAMs provide better privacy than plain encryption and better performance than more general ORAM schemes (which hide both writing and reading access patterns), and the write-oblivious setting has been applied to important applications of cloud storage synchronization and encrypted hidden volumes. In this paper, we introduce an entirely new technique for Write-Only ORAM, called DetWoORAM. Unlike previous solutions, DetWoORAM uses a deterministic, sequential writing pattern without the need for any "stashing" of blocks in local state when writes fail. Our protocol, while conceptually simple, provides substantial improvement over prior solutions, both asymptotically and experimentally. In particular, under typical settings the DetWoORAM writes only 2 blocks (sequentially) to backend memory for each block written to the device, which is optimal. We have implemented our solution using the BUSE (block device in user-space) module and tested DetWoORAM against both an encryption only baseline of dm-crypt and prior, randomized WoORAM solutions, measuring only a 3x-14x slowdown compared to an encryption-only baseline and around 6x-19x speedup compared to prior work

arXiv.org e-Print Archive

Crossref

Cryptology ePrint Archive

Strongly Secure and Efficient Data Shuffle On Hardware Enclaves

Author: Almeida J. B.
Bulck J. V.
Costan V
Holz T.
Ohrimenko O.
Ohrimenko O.
Zheng W.
Publication venue
Publication date: 12/11/2017
Field of study

Mitigating memory-access attacks on the Intel SGX architecture is an important and open research problem. A natural notion of the mitigation is cache-miss obliviousness which requires the cache-misses emitted during an enclave execution are oblivious to sensitive data. This work realizes the cache-miss obliviousness for the computation of data shuffling. The proposed approach is to software-engineer the oblivious algorithm of Melbourne shuffle on the Intel SGX/TSX architecture, where the Transaction Synchronization eXtension (TSX) is (ab)used to detect the occurrence of cache misses. In the system building, we propose software techniques to prefetch memory data prior to the TSX transaction to defend the physical bus-tapping attacks. Our evaluation based on real implementation shows that our system achieves superior performance and lower transaction abort rate than the related work in the existing literature.Comment: Systex'1

arXiv.org e-Print Archive

Crossref

NUMASK: High Performance Scalable Skip List for NUMA

Author: Daly Henry
Hassan Ahmed
Palmieri Roberto
Spear Michael F.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd International Symposium on Distributed Computing (DISC 2018)
Publication date: 01/01/2018
Field of study

This paper presents NUMASK, a skip list data structure specifically designed to exploit the characteristics of Non-Uniform Memory Access (NUMA) architectures to improve performance. NUMASK deploys an architecture around a concurrent skip list so that all metadata accesses (e.g., traversals of the skip list index levels) read and write memory blocks allocated in the NUMA zone where the thread is executing. To the best of our knowledge, NUMASK is the first NUMA-aware skip list design that goes beyond merely limiting the performance penalties introduced by NUMA, and leverages the NUMA architecture to outperform state-of-the-art concurrent high-performance implementations. We tested NUMASK on a four-socket server. Its performance scales for both read-intensive and write-intensive workloads (tested up to 160 threads). In write-intensive workload, NUMASK shows speedups over competitors in the range of 2x to 16x

Dagstuhl Research Online Publication Server