21 research outputs found
Mapping the Intel Last-Level Cache
Modern Intel processors use an undisclosed hash function to map memory lines into last-level cache slices. In this work we develop a technique for reverse-engineering the hash function. We apply the technique to a 6-core Intel processor and demonstrate that knowledge of this hash function can facilitate cache-based side channel attacks, reducing the amount of work required for profiling the cache by three orders of magnitude. We also show how using the hash function we can double the number of colours used for page-colouring techniques
PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives
Deep Neural Networks (DNNs) have revolutionized many aspects of our lives.
The use of DNNs is becoming ubiquitous including in softwares for image
recognition, speech recognition, speech synthesis, language translation, to
name a few. he training of DNN architectures however is computationally
expensive. Once the model is created, its use in the intended application - the
inference task, is computationally heavy too and the inference needs to be fast
for real time use. For obtaining high performance today, the code of Deep
Learning (DL) primitives optimized for specific architectures by expert
programmers exposed via libraries is the norm. However, given the constant
emergence of new DNN architectures, creating hand optimized code is expensive,
slow and is not scalable.
To address this performance-productivity challenge, in this paper we present
compiler algorithms to automatically generate high performance implementations
of DL primitives that closely match the performance of hand optimized
libraries. We develop novel data reuse analysis algorithms using the polyhedral
model to derive efficient execution schedules automatically. In addition,
because most DL primitives use some variant of matrix multiplication at their
core, we develop a flexible framework where it is possible to plug in library
implementations of the same in lieu of a subset of the loops. We show that such
a hybrid compiler plus a minimal library-use approach results in
state-of-the-art performance. We develop compiler algorithms to also perform
operator fusions that reduce data movement through the memory hierarchy of the
computer system.Comment: arXiv admin note: substantial text overlap with arXiv:2002.0214
Warping Cache Simulation of Polyhedral Programs
Techniques to evaluate a program’s cache performance fall
into two camps: 1. Traditional trace-based cache simulators
precisely account for sophisticated real-world cache models
and support arbitrary workloads, but their runtime is proportional to the number of memory accesses performed by
the program under analysis. 2. Relying on implicit workload
characterizations such as the polyhedral model, analytical approaches often achieve problem-size-independent runtimes,
but so far have been limited to idealized cache models.
We introduce a hybrid approach, warping cache simulation, that aims to achieve applicability to real-world cache
models and problem-size-independent runtimes. As prior
analytical approaches, we focus on programs in the polyhedral model, which allows to reason about the sequence
of memory accesses analytically. Combining this analytical
reasoning with information about the cache behavior obtained from explicit cache simulation allows us to soundly
fast-forward the simulation. By this process of warping, we
accelerate the simulation so that its cost is often independent
of the number of memory accesses
Packet Chasing: Spying on Network Packets over a Cache Side-Channel
This paper presents Packet Chasing, an attack on the network that does not
require access to the network, and works regardless of the privilege level of
the process receiving the packets. A spy process can easily probe and discover
the exact cache location of each buffer used by the network driver. Even more
useful, it can discover the exact sequence in which those buffers are used to
receive packets. This then enables packet frequency and packet sizes to be
monitored through cache side channels. This allows both covert channels between
a sender and a remote spy with no access to the network, as well as direct
attacks that can identify, among other things, the web page access patterns of
a victim on the network. In addition to identifying the potential attack, this
work proposes a software-based short-term mitigation as well as a light-weight,
adaptive, cache partitioning mitigation that blocks the interference of I/O and
CPU requests in the last-level cache
Cache Attacks Enable Bulk Key Recovery on the Cloud
Cloud services keep gaining popularity despite the security concerns. While non-sensitive data is easily trusted to cloud, security critical data and applications are not. The main concern with the cloud is the shared resources like the CPU, memory and even the network adapter that provide subtle side-channels to malicious parties. We argue that these side-channels indeed leak fine grained, sensitive information and enable key recovery attacks on the cloud. Even further, as a quick scan in one of the Amazon EC2 regions shows, high percentage -55\%- of users run outdated, leakage prone libraries leaving them vulnerable to mass surveillance.
The most commonly exploited leakage in the shared resource systems stem from the cache and the memory. High resolution and the stability of these channels allow the attacker to extract fine grained information. In this work, we employ the \PnP\ attack to retrieve an RSA secret key from a co-located instance. To speed up the attack, we reverse engineer the cache slice selection algorithm for the Intel Xeon E5-2670 v2 that is used in our cloud instances. Finally we employ noise reduction to deduce the RSA private key from the monitored traces. By processing the noisy data we obtain the complete 2048-bit RSA key used during the decryption
Cache-Base Application Detection in the Cloud Using Machine Learning
Cross-VM attacks have emerged as a major threat on commercial clouds. These attacks commonly exploit hardware level leakages on shared physical servers. A co-located machine can readily feel the presence of a co-located instance with a heavy computational load through performance degradation due to contention on shared resources. Shared cache architectures such as the last level cache (LLC) have become a popular leakage source to mount cross-VM attack. By exploiting LLC leakages, researchers have already shown that it is possible to recover fine grain information such as cryptographic keys from popular software libraries. This makes it essential to verify implementations that handle sensitive data across the many versions and numerous target platforms, a task too complicated, error prone and costly to be handled by human beings.
Here we propose a machine learning based technique to classify applications according to their cache access profiles. We show that with minimal and simple manual processing steps feature vectors can be used to train models using support vector machines to classify the applications with a high degree of success. The profiling and training steps are completely automated and do not require any inspection or study of the code to be classified. In native execution, we achieve a successful classification rate as high as 98\% (L1 cache) and 78\% (LLC) over 40 benchmark applications in the Phoronix suite with mild training. In the cross-VM setting on the noisy Amazon EC2 the success rate drops to 60\% for a suite of 25 applications. With this initial study we demonstrate that it is possible to train meaningful models to successfully predict applications running in co-located instances
nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems
We present nanoBench, a tool for evaluating small microbenchmarks using
hardware performance counters on Intel and AMD x86 systems. Most existing tools
and libraries are intended to either benchmark entire programs, or program
segments in the context of their execution within a larger program. In
contrast, nanoBench is specifically designed to evaluate small, isolated pieces
of code. Such code is common in microbenchmark-based hardware analysis
techniques.
Unlike previous tools, nanoBench can execute microbenchmarks directly in
kernel space. This allows to benchmark privileged instructions, and it enables
more accurate measurements. The reading of the performance counters is
implemented with minimal overhead avoiding functions calls and branches. As a
consequence, nanoBench is precise enough to measure individual memory accesses.
We illustrate the utility of nanoBench at the hand of two case studies.
First, we briefly discuss how nanoBench has been used to determine the latency,
throughput, and port usage of more than 13,000 instruction variants on recent
x86 processors. Second, we show how to generate microbenchmarks to precisely
characterize the cache architectures of eleven Intel Core microarchitectures.
This includes the most comprehensive analysis of the employed cache replacement
policies to date