19 research outputs found
FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture
Neural Network (NN) accelerators with emerging ReRAM (resistive random access
memory) technologies have been investigated as one of the promising solutions
to address the \textit{memory wall} challenge, due to the unique capability of
\textit{processing-in-memory} within ReRAM-crossbar-based processing elements
(PEs). However, the high efficiency and high density advantages of ReRAM have
not been fully utilized due to the huge communication demands among PEs and the
overhead of peripheral circuits.
In this paper, we propose a full system stack solution, composed of a
reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and
its software system including neural synthesizer, temporal-to-spatial mapper,
and placement & routing. We highly leverage the software system to make the
hardware design compact and efficient. To satisfy the high-performance
communication demand, we optimize it with a reconfigurable routing architecture
and the placement & routing tool. To improve the computational density, we
greatly simplify the PE circuit with the spiking schema and then adopt neural
synthesizer to enable the high density computation-resources to support
different kinds of NN operations. In addition, we provide spiking memory blocks
(SMBs) and configurable logic blocks (CLBs) in hardware and leverage the
temporal-to-spatial mapper to utilize them to balance the storage and
computation requirements of NN. Owing to the end-to-end software system, we can
efficiently deploy existing deep neural networks to FPSA. Evaluations show
that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME,
the computational density of FPSA improves by 31x; for representative NNs, its
inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
System-level emulators have been used extensively for system design,
debugging and evaluation. They work by providing a system-level virtual machine
to support a guest operating system (OS) running on a platform with the same or
different native OS that uses the same or different instruction-set
architecture. For such system-level emulation, dynamic binary translation (DBT)
is one of the core technologies. A recently proposed learning-based DBT
approach has shown a significantly improved performance with a higher quality
of translated code using automatically learned translation rules. However, it
has only been applied to user-level emulation, and not yet to system-level
emulation. In this paper, we explore the feasibility of applying this approach
to improve system-level emulation, and use QEMU to build a prototype. ... To
achieve better performance, we leverage several optimizations that include
coordination overhead reduction to reduce the overhead of each coordination,
and coordination elimination and code scheduling to reduce the coordination
frequency. Experimental results show that it can achieve an average of 1.36X
speedup over QEMU 6.1 with negligible coordination overhead in the system
emulation mode using SPEC CINT2006 as application benchmarks and 1.15X on
real-world applications.Comment: 10 pages, 19 figures, to be published in International Symposium on
Code Generation and Optimization (CGO) 202
Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism
The demise of Moore's Law and Dennard Scaling has revived interest in
specialized computer architectures and accelerators. Verification and testing
of this hardware heavily uses cycle-accurate simulation of
register-transfer-level (RTL) designs. The best software RTL simulators can
simulate designs at 1--1000~kHz, i.e., more than three orders of magnitude
slower than hardware. Faster simulation can increase productivity by speeding
design iterations and permitting more exhaustive exploration.
One possibility is to use parallelism as RTL exposes considerable fine-grain
concurrency. However, state-of-the-art RTL simulators generally perform best
when single-threaded since modern processors cannot effectively exploit
fine-grain parallelism.
This work presents Manticore: a parallel computer designed to accelerate RTL
simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution
model to eliminate runtime synchronization barriers among many simple
processors. Manticore relies entirely on its compiler to schedule resources and
communication. Because RTL code is practically free of long divergent execution
paths, static scheduling is feasible. Communication and synchronization no
longer incur runtime overhead, enabling efficient fine-grain parallelism.
Moreover, static scheduling dramatically simplifies the physical
implementation, significantly increasing the potential parallelism on a chip.
Our 225-core FPGA prototype running at 475 MHz outperforms a state-of-the-art
RTL simulator on an Intel Xeon processor running at 3.3 GHz by up to
27.9 (geomean 5.3) in nine Verilog benchmarks
Cross-Inlining Binary Function Similarity Detection
Binary function similarity detection plays an important role in a wide range
of security applications. Existing works usually assume that the query function
and target function share equal semantics and compare their full semantics to
obtain the similarity. However, we find that the function mapping is more
complex, especially when function inlining happens.
In this paper, we will systematically investigate cross-inlining binary
function similarity detection. We first construct a cross-inlining dataset by
compiling 51 projects using 9 compilers, with 4 optimizations, to 6
architectures, with 2 inlining flags, which results in two datasets both with
216 combinations. Then we construct the cross-inlining function mappings by
linking the common source functions in these two datasets. Through analysis of
this dataset, we find that three cross-inlining patterns widely exist while
existing work suffers when detecting cross-inlining binary function similarity.
Next, we propose a pattern-based model named CI-Detector for cross-inlining
matching. CI-Detector uses the attributed CFG to represent the semantics of
binary functions and GNN to embed binary functions into vectors. CI-Detector
respectively trains a model for these three cross-inlining patterns. Finally,
the testing pairs are input to these three models and all the produced
similarities are aggregated to produce the final similarity. We conduct several
experiments to evaluate CI-Detector. Results show that CI-Detector can detect
cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds
all state-of-the-art works.Comment: Accepted at ICSE 2024 (Second Cycle). Camera-ready versio
Recovering Container Class Types in C++ Binaries
We present TIARA, a novel approach to recovering container classes in c++ binaries. Given a variable address in a c++ binary, TIARA first applies a new type-relevant slicing algorithm incorporated with a decay function, TSLICE, to obtain an inter-procedural forward slice of instructions expressed as a CFG to summarize how the variable is used in the binary (as our primary contribution). TIARA then makes use of a GCN (Graph Convolutional Network) to learn and predict the container type for the variable (as our secondary contribution). According to our evaluation, TIARA can advance the state of the art in inferring commonly used container types in a set of eight large real-world COTS c++ binaries efficiently (in terms of the overall analysis time) and effectively (in terms of precision, recall and F1 score)
iDriving: Toward Safe and Efficient Infrastructure-directed Autonomous Driving
Autonomous driving will become pervasive in the coming decades. iDriving
improves the safety of autonomous driving at intersections and increases
efficiency by improving traffic throughput at intersections. In iDriving,
roadside infrastructure remotely drives an autonomous vehicle at an
intersection by offloading perception and planning from the vehicle to roadside
infrastructure. To achieve this, iDriving must be able to process voluminous
sensor data at full frame rate with a tail latency of less than 100 ms, without
sacrificing accuracy. We describe algorithms and optimizations that enable it
to achieve this goal using an accurate and lightweight perception component
that reasons on composite views derived from overlapping sensors, and a planner
that jointly plans trajectories for multiple vehicles. In our evaluations,
iDriving always ensures safe passage of vehicles, while autonomous driving can
only do so 27% of the time. iDriving also results in 5x lower wait times than
other approaches because it enables traffic-light free intersections
QuanShield: Protecting against Side-Channels Attacks using Self-Destructing Enclaves
Trusted Execution Environments (TEEs) allow user processes to create enclaves
that protect security-sensitive computation against access from the OS kernel
and the hypervisor. Recent work has shown that TEEs are vulnerable to
side-channel attacks that allow an adversary to learn secrets shielded in
enclaves. The majority of such attacks trigger exceptions or interrupts to
trace the control or data flow of enclave execution.
We propose QuanShield, a system that protects enclaves from side-channel
attacks that interrupt enclave execution. The main idea behind QuanShield is to
strengthen resource isolation by creating an interrupt-free environment on a
dedicated CPU core for running enclaves in which enclaves terminate when
interrupts occur. QuanShield avoids interrupts by exploiting the tickless
scheduling mode supported by recent OS kernels. QuanShield then uses the save
area (SA) of the enclave, which is used by the hardware to support interrupt
handling, as a second stack. Through an LLVM-based compiler pass, QuanShield
modifies enclave instructions to store/load memory references, such as function
frame base addresses, to/from the SA. When an interrupt occurs, the hardware
overwrites the data in the SA with CPU state, thus ensuring that enclave
execution fails. Our evaluation shows that QuanShield significantly raises the
bar for interrupt-based attacks with practical overhead.Comment: 15pages, 5 figures, 5 table