4,426 research outputs found

    DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks

    Full text link
    In cloud computing environments, multiple tenants are often co-located on the same multi-processor system. Thus, preventing information leakage between tenants is crucial. While the hypervisor enforces software isolation, shared hardware, such as the CPU cache or memory bus, can leak sensitive information. For security reasons, shared memory between tenants is typically disabled. Furthermore, tenants often do not share a physical CPU. In this setting, cache attacks do not work and only a slow cross-CPU covert channel over the memory bus is known. In contrast, we demonstrate a high-speed covert channel as well as the first side-channel attack working across processors and without any shared memory. To build these attacks, we use the undocumented DRAM address mappings. We present two methods to reverse engineer the mapping of memory addresses to DRAM channels, ranks, and banks. One uses physical probing of the memory bus, the other runs entirely in software and is fully automated. Using this mapping, we introduce DRAMA attacks, a novel class of attacks that exploit the DRAM row buffer that is shared, even in multi-processor systems. Thus, our attacks work in the most restrictive environments. First, we build a covert channel with a capacity of up to 2 Mbps, which is three to four orders of magnitude faster than memory-bus-based channels. Second, we build a side-channel template attack that can automatically locate and monitor memory accesses. Third, we show how using the DRAM mappings improves existing attacks and in particular enables practical Rowhammer attacks on DDR4.Comment: Original publication in the Proceedings of the 25th Annual USENIX Security Symposium (USENIX Security 2016). https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/pess

    The 1 Teraflops QCDSP computer

    Get PDF
    The QCDSP computer (Quantum Chromodynamics on Digital Signal Processors) is an inexpensive, massively parallel computer intended primarily for simulations in lattice gauge theory. Currently, two large QCDSP machines are in full-time use: an 8,192 processor, 0.4 Teraflops machine at Columbia University and an 12,288 processor, 0.6 Teraflops machine at the RIKEN-BNL Research Center at Brookhaven National Laboratory. We describe the design process, architecture, software and current physics projects of these computers.Comment: 19 pages, 3 figure

    Memory-based Combination PUFs for Device Authentication in Embedded Systems

    Full text link
    Embedded systems play a crucial role in fueling the growth of the Internet-of-Things (IoT) in application domains such as healthcare, home automation, transportation, etc. However, their increasingly network-connected nature, coupled with their ability to access potentially sensitive/confidential information, has given rise to many security and privacy concerns. An additional challenge is the growing number of counterfeit components in these devices, resulting in serious reliability and financial implications. Physically Unclonable Functions (PUFs) are a promising security primitive to help address these concerns. Memory-based PUFs are particularly attractive as they require minimal or no additional hardware for their operation. However, current memory-based PUFs utilize only a single memory technology for constructing the PUF, which has several disadvantages including making them vulnerable to security attacks. In this paper, we propose the design of a new memory-based combination PUF that intelligently combines two memory technologies, SRAM and DRAM, to overcome these shortcomings. The proposed combination PUF exhibits high entropy, supports a large number of challenge-response pairs, and is intrinsically reconfigurable. We have implemented the proposed combination PUF using a Terasic TR4-230 FPGA board and several off-the-shelf SRAMs and DRAMs. Experimental results demonstrate substantial improvements over current memory-based PUFs including the ability to resist various attacks. Extensive authentication tests across a wide temperature range (20 - 60 deg. Celsius) and accelerated aging (12 months) demonstrate the robustness of the proposed design, which achieves a 100% true-positive rate and 0% false-positive rate for authentication across these parameter ranges.Comment: 7 pages, 10 figure

    Towards Implementation of Robust and Low-Cost Security Primitives for Resource-Constrained IoT Devices

    Full text link
    In recent years, due to the trend in globalization, system integrators have had to deal with integrated circuit (IC)/intellectual property (IP) counterfeiting more than ever. These counterfeit hardware issues counterfeit hardware that have driven the need for more secure chip authentication. High entropy random numbers from physical sources are a critical component in authentication and encryption processes within secure systems [6]. Secure encryption is dependent on sources of truly random numbers for generating keys, and there is a need for an on chip random number generator to achieve adequate security. Furthermore, the Internet of Things (IoT) adopts a large number of these hardware-based security and prevention solutions in order to securely exchange data in resource efficient manner. In this work, we have developed several methodologies of hardware-based random functions in order to address the issues and enhance the security and trust of ICs: a novel DRAM-based intrinsic Physical Unclonable Function (PUF) [13] for system level security and authentication along with analysis of the impact of various environmental conditions, particularly silicon aging; a DRAM remanence based True Random Number Generation (TRNG) to produce random sequences with a very low overhead; a DRAM TRNG model using its startup value behavior for creating random bit streams; an efficient power supply noise based TRNG model for generating an infinite number of random bits which has been evaluated as a cost effective technique; architectures and hardware security solutions for the Internet of Things (IoT) environment. Since IoT devices are heavily resource constrained, our proposed designs can alleviate the concerns of establishing trustworthy and security in an efficient and low-cost manner.Comment: 7 pages, 6 figures, 1 tabl

    Amber: Enabling Precise Full-System Simulation with Detailed Modeling of All SSD Resources

    Full text link
    SSDs become a major storage component in modern memory hierarchies, and SSD research demands exploring future simulation-based studies by integrating SSD subsystems into a full-system environment. However, several challenges exist to model SSDs under a full-system simulations; SSDs are composed upon their own complete system and architecture, which employ all necessary hardware, such as CPUs, DRAM and interconnect network. Employing the hardware components, SSDs also require to have multiple device controllers, internal caches and software modules that respect a wide spectrum of storage interfaces and protocols. These SSD hardware and software are all necessary to incarnate storage subsystems under full-system environment, which can operate in parallel with the host system. In this work, we introduce a new SSD simulation framework, SimpleSSD 2.0, namely Amber, that models embedded CPU cores, DRAMs, and various flash technologies (within an SSD), and operate under the full system simulation environment by enabling a data transfer emulation. Amber also includes full firmware stack, including DRAM cache logic, flash firmware, such as FTL and HIL, and obey diverse standard protocols by revising the host DMA engines and system buses of a popular full system simulator's all functional and timing CPU models (gem5). The proposed simulator can capture the details of dynamic performance and power of embedded cores, DRAMs, firmware and flash under the executions of various OS systems and hardware platforms. Using Amber, we characterize several system-level challenges by simulating different types of fullsystems, such as mobile devices and general-purpose computers, and offer comprehensive analyses by comparing passive storage and active storage architectures.Comment: This paper has been accepted at the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '51), 2018. This material is presented to ensure timely dissemination of scholarly and technical wor

    Theory and Practice of Finding Eviction Sets

    Full text link
    Many micro-architectural attacks rely on the capability of an attacker to efficiently find small eviction sets: groups of virtual addresses that map to the same cache set. This capability has become a decisive primitive for cache side-channel, rowhammer, and speculative execution attacks. Despite their importance, algorithms for finding small eviction sets have not been systematically studied in the literature. In this paper, we perform such a systematic study. We begin by formalizing the problem and analyzing the probability that a set of random virtual addresses is an eviction set. We then present novel algorithms, based on ideas from threshold group testing, that reduce random eviction sets to their minimal core in linear time, improving over the quadratic state-of-the-art. We complement the theoretical analysis of our algorithms with a rigorous empirical evaluation in which we identify and isolate factors that affect their reliability in practice, such as adaptive cache replacement strategies and TLB thrashing. Our results indicate that our algorithms enable finding small eviction sets much faster than before, and under conditions where this was previously deemed impractical.Comment: To appear at IEEE Symposium on Security and Privacy, 201

    Performance Evaluation and Modeling of HPC I/O on Non-Volatile Memory

    Full text link
    HPC applications pose high demands on I/O performance and storage capability. The emerging non-volatile memory (NVM) techniques offer low-latency, high bandwidth, and persistence for HPC applications. However, the existing I/O stack are designed and optimized based on an assumption of disk-based storage. To effectively use NVM, we must re-examine the existing high performance computing (HPC) I/O sub-system to properly integrate NVM into it. Using NVM as a fast storage, the previous assumption on the inferior performance of storage (e.g., hard drive) is not valid any more. The performance problem caused by slow storage may be mitigated; the existing mechanisms to narrow the performance gap between storage and CPU may be unnecessary and result in large overhead. Thus fully understanding the impact of introducing NVM into the HPC software stack demands a thorough performance study. In this paper, we analyze and model the performance of I/O intensive HPC applications with NVM as a block device. We study the performance from three perspectives: (1) the impact of NVM on the performance of traditional page cache; (2) a performance comparison between MPI individual I/O and POSIX I/O; and (3) the impact of NVM on the performance of collective I/O. We reveal the diminishing effects of page cache, minor performance difference between MPI individual I/O and POSIX I/O, and performance disadvantage of collective I/O on NVM due to unnecessary data shuffling. We also model the performance of MPI collective I/O and study the complex interaction between data shuffling, storage performance, and I/O access patterns.Comment: 10 page

    A Microbenchmark Characterization of the Emu Chick

    Full text link
    The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this multi-node characterization of the Emu Chick, we extend an earlier single-node investigation (Hein, et al. AsHES 2018) of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication. We compare the Emu Chick hardware to architectural simulation and an Intel Xeon-based platform. Our results demonstrate that for many basic operations the Emu Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture although bandwidth usage suffers for computationally intensive workloads like SpMV. Moreover, the Emu Chick provides stable, predictable performance with up to 65% of the peak bandwidth utilization on a random-access pointer chasing benchmark with weak locality

    Fault-tolerant linear solvers via selective reliability

    Full text link
    Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest parallel computers being built and planned today. As processor counts continue to grow, the cost of ensuring reliability consistently throughout an application will become unbearable. However, many algorithms only need reliability for certain data and phases of computation. This suggests an algorithm and system codesign approach. We show that if the system lets applications apply reliability selectively, we can develop algorithms that compute the right answer despite faults. These "fault-tolerant" iterative methods either converge eventually, at a rate that degrades gracefully with increased fault rate, or return a clear failure indication in the rare case that they cannot converge. Furthermore, they store most of their data unreliably, and spend most of their time in unreliable mode. We demonstrate this for the specific case of detected but uncorrectable memory faults, which we argue are representative of all kinds of faults. We developed a cross-layer application / operating system framework that intercepts and reports uncorrectable memory faults to the application, rather than killing the application, as current operating systems do. The application in turn can mark memory allocations as subject to such faults. Using this framework, we wrote a fault-tolerant iterative linear solver using components from the Trilinos solvers library. Our solver exploits hybrid parallelism (MPI and threads). It performs just as well as other solvers if no faults occur, and converges where other solvers do not in the presence of faults. We show convergence results for representative test problems. Near-term future work will include performance tests

    Memory DoS Attacks in Multi-tenant Clouds: Severity and Mitigation

    Full text link
    In cloud computing, network Denial of Service (DoS) attacks are well studied and defenses have been implemented, but severe DoS attacks on a victim's working memory by a single hostile VM are not well understood. Memory DoS attacks are Denial of Service (or Degradation of Service) attacks caused by contention for hardware memory resources on a cloud server. Despite the strong memory isolation techniques for virtual machines (VMs) enforced by the software virtualization layer in cloud servers, the underlying hardware memory layers are still shared by the VMs and can be exploited by a clever attacker in a hostile VM co-located on the same server as the victim VM, denying the victim the working memory he needs. We first show quantitatively the severity of contention on different memory resources. We then show that a malicious cloud customer can mount low-cost attacks to cause severe performance degradation for a Hadoop distributed application, and 38X delay in response time for an E-commerce website in the Amazon EC2 cloud. Then, we design an effective, new defense against these memory DoS attacks, using a statistical metric to detect their existence and execution throttling to mitigate the attack damage. We achieve this by a novel re-purposing of existing hardware performance counters and duty cycle modulation for security, rather than for improving performance or power consumption. We implement a full prototype on the OpenStack cloud system. Our evaluations show that this defense system can effectively defeat memory DoS attacks with negligible performance overhead.Comment: 18 page
    corecore