8 research outputs found
GuardNN: Secure DNN Accelerator for Privacy-Preserving Deep Learning
This paper proposes GuardNN, a secure deep neural network (DNN) accelerator,
which provides strong hardware-based protection for user data and model
parameters even in an untrusted environment. GuardNN shows that the
architecture and protection can be customized for a specific application to
provide strong confidentiality and integrity protection with negligible
overhead. The design of the GuardNN instruction set reduces the TCB to just the
accelerator and enables confidentiality protection without the overhead of
integrity protection. GuardNN also introduces a new application-specific memory
protection scheme to minimize the overhead of memory encryption and integrity
verification. The scheme shows that most of the off-chip meta-data in today's
state-of-the-art memory protection can be removed by exploiting the known
memory access patterns of a DNN accelerator. GuardNN is implemented as an FPGA
prototype, which demonstrates effective protection with less than 2%
performance overhead for inference over a variety of modern DNN models
Domain-Specialized Cache Management for Graph Analytics
Graph analytics power a range of applications in areas as diverse as finance,
networking and business logistics. A common property of graphs used in the
domain of graph analytics is a power-law distribution of vertex connectivity,
wherein a small number of vertices are responsible for a high fraction of all
connections in the graph. These richly-connected, hot, vertices inherently
exhibit high reuse. However, this work finds that state-of-the-art hardware
cache management schemes struggle in capitalizing on their reuse due to highly
irregular access patterns of graph analytics.
In response, we propose GRASP, domain-specialized cache management at the
last-level cache for graph analytics. GRASP augments existing cache policies to
maximize reuse of hot vertices by protecting them against cache thrashing,
while maintaining sufficient flexibility to capture the reuse of other vertices
as needed. GRASP keeps hardware cost negligible by leveraging lightweight
software support to pinpoint hot vertices, thus eliding the need for
storage-intensive prediction mechanisms employed by state-of-the-art cache
management schemes. On a set of diverse graph-analytic applications with large
high-skew graph datasets, GRASP outperforms prior domain-agnostic schemes on
all datapoints, yielding an average speed-up of 4.2% (max 9.4%) over the
best-performing prior scheme. GRASP remains robust on low-/no-skew datasets,
whereas prior schemes consistently cause a slowdown.Comment: No content changes from the previous versio
Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling
Graph processing is increasingly bottlenecked by main memory accesses. On-chip caches are of little help because the irregular structure of graphs causes seemingly random memory references. However, most real-world graphs offer significant potential locality-it is just hard to predict ahead of time. In practice, graphs have well-connected regions where relatively few vertices share edges with many common neighbors. If these vertices were processed together, graph processing would enjoy significant data reuse. Hence, a graph's traversal schedule largely determines its locality. This paper explores online traversal scheduling strategies that exploit the community structure of real-world graphs to improve locality. Software graph processing frameworks use simple, locality-oblivious scheduling because, on general-purpose cores, the benefits of locality-Aware scheduling are outweighed by its overheads. Software frameworks rely on offline preprocessing to improve locality. Unfortunately, preprocessing is so expensive that its costs often negate any benefits from improved locality. Recent graph processing accelerators have inherited this design. Our insight is that this misses an opportunity: Hardware acceleration allows for more sophisticated, online locality-Aware scheduling than can be realized in software, letting systems significantly improve locality without any preprocessing. To exploit this insight, we present bounded depth-first scheduling (BDFS), a simple online locality-Aware scheduling strategy. BDFS restricts each core to explore one small, connected region of the graph at a time, improving locality on graphs with good community structure. We then present HATS, a hardware-Accelerated traversal scheduler that adds just 0.4% area and 0.2% power over general-purpose cores. We evaluate BDFS and HATS on several algorithms using large real-world graphs. On a simulated 16-core system, BDFS reduces main memory accesses by up to 2.4x and by 30% on average. However, BDFS is too expensive in software and degrades performance by 21% on average. HATS eliminates these overheads, allowing BDFS to improve performance by 83% on average (up to 3.1x) over a locality-oblivious software implementation and by 31% on average (up to 2.1x) over specialized prefetchers.National Science Foundation (Grant CAREER-1452994