4,426 research outputs found
DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks
In cloud computing environments, multiple tenants are often co-located on the
same multi-processor system. Thus, preventing information leakage between
tenants is crucial. While the hypervisor enforces software isolation, shared
hardware, such as the CPU cache or memory bus, can leak sensitive information.
For security reasons, shared memory between tenants is typically disabled.
Furthermore, tenants often do not share a physical CPU. In this setting, cache
attacks do not work and only a slow cross-CPU covert channel over the memory
bus is known. In contrast, we demonstrate a high-speed covert channel as well
as the first side-channel attack working across processors and without any
shared memory. To build these attacks, we use the undocumented DRAM address
mappings.
We present two methods to reverse engineer the mapping of memory addresses to
DRAM channels, ranks, and banks. One uses physical probing of the memory bus,
the other runs entirely in software and is fully automated. Using this mapping,
we introduce DRAMA attacks, a novel class of attacks that exploit the DRAM row
buffer that is shared, even in multi-processor systems. Thus, our attacks work
in the most restrictive environments. First, we build a covert channel with a
capacity of up to 2 Mbps, which is three to four orders of magnitude faster
than memory-bus-based channels. Second, we build a side-channel template attack
that can automatically locate and monitor memory accesses. Third, we show how
using the DRAM mappings improves existing attacks and in particular enables
practical Rowhammer attacks on DDR4.Comment: Original publication in the Proceedings of the 25th Annual USENIX
Security Symposium (USENIX Security 2016).
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/pess
The 1 Teraflops QCDSP computer
The QCDSP computer (Quantum Chromodynamics on Digital Signal Processors) is
an inexpensive, massively parallel computer intended primarily for simulations
in lattice gauge theory. Currently, two large QCDSP machines are in full-time
use: an 8,192 processor, 0.4 Teraflops machine at Columbia University and an
12,288 processor, 0.6 Teraflops machine at the RIKEN-BNL Research Center at
Brookhaven National Laboratory. We describe the design process, architecture,
software and current physics projects of these computers.Comment: 19 pages, 3 figure
Memory-based Combination PUFs for Device Authentication in Embedded Systems
Embedded systems play a crucial role in fueling the growth of the
Internet-of-Things (IoT) in application domains such as healthcare, home
automation, transportation, etc. However, their increasingly network-connected
nature, coupled with their ability to access potentially sensitive/confidential
information, has given rise to many security and privacy concerns. An
additional challenge is the growing number of counterfeit components in these
devices, resulting in serious reliability and financial implications.
Physically Unclonable Functions (PUFs) are a promising security primitive to
help address these concerns. Memory-based PUFs are particularly attractive as
they require minimal or no additional hardware for their operation. However,
current memory-based PUFs utilize only a single memory technology for
constructing the PUF, which has several disadvantages including making them
vulnerable to security attacks. In this paper, we propose the design of a new
memory-based combination PUF that intelligently combines two memory
technologies, SRAM and DRAM, to overcome these shortcomings. The proposed
combination PUF exhibits high entropy, supports a large number of
challenge-response pairs, and is intrinsically reconfigurable. We have
implemented the proposed combination PUF using a Terasic TR4-230 FPGA board and
several off-the-shelf SRAMs and DRAMs. Experimental results demonstrate
substantial improvements over current memory-based PUFs including the ability
to resist various attacks. Extensive authentication tests across a wide
temperature range (20 - 60 deg. Celsius) and accelerated aging (12 months)
demonstrate the robustness of the proposed design, which achieves a 100%
true-positive rate and 0% false-positive rate for authentication across these
parameter ranges.Comment: 7 pages, 10 figure
Towards Implementation of Robust and Low-Cost Security Primitives for Resource-Constrained IoT Devices
In recent years, due to the trend in globalization, system integrators have
had to deal with integrated circuit (IC)/intellectual property (IP)
counterfeiting more than ever. These counterfeit hardware issues counterfeit
hardware that have driven the need for more secure chip authentication. High
entropy random numbers from physical sources are a critical component in
authentication and encryption processes within secure systems [6]. Secure
encryption is dependent on sources of truly random numbers for generating keys,
and there is a need for an on chip random number generator to achieve adequate
security. Furthermore, the Internet of Things (IoT) adopts a large number of
these hardware-based security and prevention solutions in order to securely
exchange data in resource efficient manner. In this work, we have developed
several methodologies of hardware-based random functions in order to address
the issues and enhance the security and trust of ICs: a novel DRAM-based
intrinsic Physical Unclonable Function (PUF) [13] for system level security and
authentication along with analysis of the impact of various environmental
conditions, particularly silicon aging; a DRAM remanence based True Random
Number Generation (TRNG) to produce random sequences with a very low overhead;
a DRAM TRNG model using its startup value behavior for creating random bit
streams; an efficient power supply noise based TRNG model for generating an
infinite number of random bits which has been evaluated as a cost effective
technique; architectures and hardware security solutions for the Internet of
Things (IoT) environment. Since IoT devices are heavily resource constrained,
our proposed designs can alleviate the concerns of establishing trustworthy and
security in an efficient and low-cost manner.Comment: 7 pages, 6 figures, 1 tabl
Amber: Enabling Precise Full-System Simulation with Detailed Modeling of All SSD Resources
SSDs become a major storage component in modern memory hierarchies, and SSD
research demands exploring future simulation-based studies by integrating SSD
subsystems into a full-system environment. However, several challenges exist to
model SSDs under a full-system simulations; SSDs are composed upon their own
complete system and architecture, which employ all necessary hardware, such as
CPUs, DRAM and interconnect network. Employing the hardware components, SSDs
also require to have multiple device controllers, internal caches and software
modules that respect a wide spectrum of storage interfaces and protocols. These
SSD hardware and software are all necessary to incarnate storage subsystems
under full-system environment, which can operate in parallel with the host
system. In this work, we introduce a new SSD simulation framework, SimpleSSD
2.0, namely Amber, that models embedded CPU cores, DRAMs, and various flash
technologies (within an SSD), and operate under the full system simulation
environment by enabling a data transfer emulation. Amber also includes full
firmware stack, including DRAM cache logic, flash firmware, such as FTL and
HIL, and obey diverse standard protocols by revising the host DMA engines and
system buses of a popular full system simulator's all functional and timing CPU
models (gem5). The proposed simulator can capture the details of dynamic
performance and power of embedded cores, DRAMs, firmware and flash under the
executions of various OS systems and hardware platforms. Using Amber, we
characterize several system-level challenges by simulating different types of
fullsystems, such as mobile devices and general-purpose computers, and offer
comprehensive analyses by comparing passive storage and active storage
architectures.Comment: This paper has been accepted at the 51st Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO '51), 2018. This material
is presented to ensure timely dissemination of scholarly and technical wor
Theory and Practice of Finding Eviction Sets
Many micro-architectural attacks rely on the capability of an attacker to
efficiently find small eviction sets: groups of virtual addresses that map to
the same cache set. This capability has become a decisive primitive for cache
side-channel, rowhammer, and speculative execution attacks. Despite their
importance, algorithms for finding small eviction sets have not been
systematically studied in the literature.
In this paper, we perform such a systematic study. We begin by formalizing
the problem and analyzing the probability that a set of random virtual
addresses is an eviction set. We then present novel algorithms, based on ideas
from threshold group testing, that reduce random eviction sets to their minimal
core in linear time, improving over the quadratic state-of-the-art.
We complement the theoretical analysis of our algorithms with a rigorous
empirical evaluation in which we identify and isolate factors that affect their
reliability in practice, such as adaptive cache replacement strategies and TLB
thrashing. Our results indicate that our algorithms enable finding small
eviction sets much faster than before, and under conditions where this was
previously deemed impractical.Comment: To appear at IEEE Symposium on Security and Privacy, 201
Performance Evaluation and Modeling of HPC I/O on Non-Volatile Memory
HPC applications pose high demands on I/O performance and storage capability.
The emerging non-volatile memory (NVM) techniques offer low-latency, high
bandwidth, and persistence for HPC applications. However, the existing I/O
stack are designed and optimized based on an assumption of disk-based storage.
To effectively use NVM, we must re-examine the existing high performance
computing (HPC) I/O sub-system to properly integrate NVM into it. Using NVM as
a fast storage, the previous assumption on the inferior performance of storage
(e.g., hard drive) is not valid any more. The performance problem caused by
slow storage may be mitigated; the existing mechanisms to narrow the
performance gap between storage and CPU may be unnecessary and result in large
overhead. Thus fully understanding the impact of introducing NVM into the HPC
software stack demands a thorough performance study.
In this paper, we analyze and model the performance of I/O intensive HPC
applications with NVM as a block device. We study the performance from three
perspectives: (1) the impact of NVM on the performance of traditional page
cache; (2) a performance comparison between MPI individual I/O and POSIX I/O;
and (3) the impact of NVM on the performance of collective I/O. We reveal the
diminishing effects of page cache, minor performance difference between MPI
individual I/O and POSIX I/O, and performance disadvantage of collective I/O on
NVM due to unnecessary data shuffling. We also model the performance of MPI
collective I/O and study the complex interaction between data shuffling,
storage performance, and I/O access patterns.Comment: 10 page
A Microbenchmark Characterization of the Emu Chick
The Emu Chick is a prototype system designed around the concept of migratory
memory-side processing. Rather than transferring large amounts of data across
power-hungry, high-latency interconnects, the Emu Chick moves lightweight
thread contexts to near-memory cores before the beginning of each memory read.
The current prototype hardware uses FPGAs to implement cache-less "Gossamer
cores for doing computational work and a stationary core to run basic operating
system functions and migrate threads between nodes. In this multi-node
characterization of the Emu Chick, we extend an earlier single-node
investigation (Hein, et al. AsHES 2018) of the the memory bandwidth
characteristics of the system through benchmarks like STREAM, pointer chasing,
and sparse matrix-vector multiplication. We compare the Emu Chick hardware to
architectural simulation and an Intel Xeon-based platform. Our results
demonstrate that for many basic operations the Emu Chick can use available
memory bandwidth more efficiently than a more traditional, cache-based
architecture although bandwidth usage suffers for computationally intensive
workloads like SpMV. Moreover, the Emu Chick provides stable, predictable
performance with up to 65% of the peak bandwidth utilization on a random-access
pointer chasing benchmark with weak locality
Fault-tolerant linear solvers via selective reliability
Energy increasingly constrains modern computer hardware, yet protecting
computations and data against errors costs energy. This holds at all scales,
but especially for the largest parallel computers being built and planned
today. As processor counts continue to grow, the cost of ensuring reliability
consistently throughout an application will become unbearable. However, many
algorithms only need reliability for certain data and phases of computation.
This suggests an algorithm and system codesign approach. We show that if the
system lets applications apply reliability selectively, we can develop
algorithms that compute the right answer despite faults. These "fault-tolerant"
iterative methods either converge eventually, at a rate that degrades
gracefully with increased fault rate, or return a clear failure indication in
the rare case that they cannot converge. Furthermore, they store most of their
data unreliably, and spend most of their time in unreliable mode.
We demonstrate this for the specific case of detected but uncorrectable
memory faults, which we argue are representative of all kinds of faults. We
developed a cross-layer application / operating system framework that
intercepts and reports uncorrectable memory faults to the application, rather
than killing the application, as current operating systems do. The application
in turn can mark memory allocations as subject to such faults. Using this
framework, we wrote a fault-tolerant iterative linear solver using components
from the Trilinos solvers library. Our solver exploits hybrid parallelism (MPI
and threads). It performs just as well as other solvers if no faults occur, and
converges where other solvers do not in the presence of faults. We show
convergence results for representative test problems. Near-term future work
will include performance tests
Memory DoS Attacks in Multi-tenant Clouds: Severity and Mitigation
In cloud computing, network Denial of Service (DoS) attacks are well studied
and defenses have been implemented, but severe DoS attacks on a victim's
working memory by a single hostile VM are not well understood. Memory DoS
attacks are Denial of Service (or Degradation of Service) attacks caused by
contention for hardware memory resources on a cloud server. Despite the strong
memory isolation techniques for virtual machines (VMs) enforced by the software
virtualization layer in cloud servers, the underlying hardware memory layers
are still shared by the VMs and can be exploited by a clever attacker in a
hostile VM co-located on the same server as the victim VM, denying the victim
the working memory he needs. We first show quantitatively the severity of
contention on different memory resources. We then show that a malicious cloud
customer can mount low-cost attacks to cause severe performance degradation for
a Hadoop distributed application, and 38X delay in response time for an
E-commerce website in the Amazon EC2 cloud.
Then, we design an effective, new defense against these memory DoS attacks,
using a statistical metric to detect their existence and execution throttling
to mitigate the attack damage. We achieve this by a novel re-purposing of
existing hardware performance counters and duty cycle modulation for security,
rather than for improving performance or power consumption. We implement a full
prototype on the OpenStack cloud system. Our evaluations show that this defense
system can effectively defeat memory DoS attacks with negligible performance
overhead.Comment: 18 page
- …