8 research outputs found

    Energy-Efficient Acceleration of Asynchronous Programs.

    Full text link
    Asynchronous or event-driven programming has become the dominant programming model in the last few years. In this model, computations are posted as events to an event queue from where they get processed asynchronously by the application. A huge fraction of computing systems built today use asynchronous programming. All the Web 2.0 JavaScript applications (e.g., Gmail, Facebook) use asynchronous programming. There are now more than two million mobile applications available between the Apple App Store and Google Play, which are all written using asynchronous programming. Distributed servers (e.g., Twitter, LinkedIn, PayPal) built using actor-based languages (e.g., Scala) and platforms such as node.js rely on asynchronous events for scalable communication. Internet-of-Things (IoT), embedded systems, sensor networks, desktop GUI applications, etc., all rely on the asynchronous programming model. Despite the ubiquity of asynchronous programs, their unique execution characteristics have been largely ignored by conventional processor architectures, which have remained heavily optimized for synchronous programs. Asynchronous programs are characterized by short events executing varied tasks. This results in a large instruction footprint with little cache locality, severely degrading cache performance. Also, event execution has few repeatable patterns causing poor branch prediction. This thesis proposes novel processor optimizations exploiting the unique execution characteristics of asynchronous programs for performance optimization and energy-efficiency. These optimizations are designed to make the underlying hardware aware of discrete events and thereafter, exploit the latent Event-Level Parallelism present in these applications. Through speculative pre-execution of future events, cache addresses and branch outcomes are recorded and later used for improving cache and branch predictor performance. A hardware instruction prefetcher specialized for asynchronous programs is also proposed as a comparative design direction.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120780/1/gauravc_1.pd

    Random and Safe Cache Architecture to Defeat Cache Timing Attacks

    Full text link
    Caches have been exploited to leak secret information due to the different times they take to handle memory accesses. Cache timing attacks include non-speculative cache side and covert channel attacks and cache-based speculative execution attacks. We first present a systematic view of the attack and defense space and show that no existing defense has addressed both speculative and non-speculative cache timing attack families, which we do in this paper. We propose Random and Safe (RaS) cache architectures to decorrelate the cache state changes from memory requests. RaS fills the cache with ``safe'' cache lines that are likely to be used in the future, rather than with demand-fetched, security-sensitive lines. RaS captures a group of safe addresses during runtime and fetches addresses randomly displaced from these addresses. Our proposed RaS architecture is flexible to allow security-performance trade-offs. We show different designs of RaS architectures that can defeat cache side-channel attacks and cache-based speculative execution attacks. The RaS variant against cache-based speculative execution attacks has 4.2% average performance overhead and other RaS variants against both attack families have 7.9% to 45.2% average overhead. For some benchmarks, RaS defenses improve the performance while providing security

    Risky Translations: Securing TLBs against Timing Side Channels

    Get PDF
    Microarchitectural side-channel vulnerabilities in modern processors are known to be a powerful attack vector that can be utilized to bypass common security boundaries like memory isolation. As shown by recent variants of transient execution attacks related to Spectre and Meltdown, those side channels allow to leak data from the microarchitecture to the observable architectural state. The vast majority of attacks currently build on the cache-timing side channel, since it is easy to exploit and provides a reliable, fine-grained communication channel. Therefore, many proposals for side-channel secure cache architectures have been made. However, caches are not the only source of side-channel leakage in modern processors and mitigating the cache side channel will inevitably lead to attacks exploiting other side channels. In this work, we focus on defeating side-channel attacks based on page translations. It has been shown that the Translation Lookaside Buffer (TLB) can be exploited in a very similar fashion to caches. Since the main caches and the TLB share many features in their architectural design, the question arises whether existing countermeasures against cache-timing attacks can be used to secure the TLB. We analyze state-ofthe-art proposals for side-channel secure cache architectures and investigate their applicability to TLB side channels. We find that those cache countermeasures are not directly applicable to TLBs, and propose TLBcoat, a new side-channel secure TLB architecture. We provide evidence of TLB side-channel leakage on RISC-V-based Linux systems, and demonstrate that TLBcoat prevents this leakage. We implement TLBcoat using the gem5 simulator and evaluate its performance using the PARSEC benchmark suite

    Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures

    Get PDF
    With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications

    Deterministic Object Management in Large Distributed Systems

    Get PDF
    Caching is a widely used technique to improve the scalability of distributed systems. A central issue with caching is maintaining object replicas consistent with their master copies. Large distributed systems, such as the Web, typically deploy heuristic-based consistency mechanisms, which increase delay and place extra load on the servers, while not providing guarantees that cached copies served to clients are up-to-date. Server-driven invalidation has been proposed as an approach to strong cache consistency, but it requires servers to keep track of which objects are cached by which clients. We propose an alternative approach to strong cache consistency, called MONARCH, which does not require servers to maintain per-client state. Our approach builds on a few key observations. Large and popular sites, which attract the majority of the traffic, construct their pages from distinct components with various characteristics. Components may have different content types, change characteristics, and semantics. These components are merged together to produce a monolithic page, and the information about their uniqueness is lost. In our view, pages should serve as containers holding distinct objects with heterogeneous type and change characteristics while preserving the boundaries between these objects. Servers compile object characteristics and information about relationships between containers and embedded objects into explicit object management commands. Servers piggyback these commands onto existing request/response traffic so that client caches can use these commands to make object management decisions. The use of explicit content control commands is a deterministic, rather than heuristic, object management mechanism that gives content providers more control over their content. The deterministic object management with strong cache consistency offered by MONARCH allows content providers to make more of their content cacheable. Furthermore, MONARCH enables content providers to expose internal structure of their pages to clients. We evaluated MONARCH using simulations with content collected from real Web sites. The results show that MONARCH provides strong cache consistency for all objects, even for unpredictably changing ones, and incurs smaller byte and message overhead than heuristic policies. The results also show that as the request arrival rate or the number of clients increases, the amount of server state maintained by MONARCH remains the same while the amount of server state incurred by server invalidation mechanisms grows

    Feasibility of Accelerator Generation to Alleviate Dark Silicon in a Novel Architecture

    Get PDF
    This thesis presents a novel approach to alleviating Dark Silicon problem by reducing power density. Decreasing the size of transistor has generated an increasing on power consumption. To attempt to manage the power issue, processor design has shifted from one single core to many cores. Switching on fewer cores while the others are off helps the chip to cool down and spread power more evenly over the chip. This means that some transistors are always idle while others are working. Therefore, scaling down the size of the chip, and increasing the amount of power to be dissipated, increases the number of inactive transistors. As a result it generates Dark Silicon, which doubles every chip generation [63] One of the most effective techniques to deal with Dark Silicon is to implement accelerators that execute the most energy consumer software functions. In this way the CPU is able to dissipate more energy and reduce the dark silicon issue. This work explores a novel accelerator design model which could be interfaced to a Stack CPU and so could optimise the transistor logic area and improve energy efficiency to tackle the dark silicon problem based on heterogeneous multi-accelerators (co-processor) in stack structure. The contribution of this thesis is to develop a tool to generate coprocessors from software stack machine code. But it also employs up-to-date code optimisation strategies to enhance the code at the input stage. Analysis of the cores using key metrics, based on 65nm synthesis experiments and industry standard tool-sets. It further introduces a novel architecture to decrease the power density of the accelerator. In order to test these expectations, a large-scale synthesis translation experiment was conducted, covering widely recognised benchmarks, and generating a large number of cores (in the thousands). These were evaluated for a range of key metrics: silicon di-area, timing, power, instructions-per-clock, and power density, both with and without code optimisation applied. The results obtained demonstrate that one of two competing core models, ‘Wavecore’ (which is proposed in this thesis), delivers superior power density to the standard approach (which it refers to as Composite core), and that this is achieved without significant cost in terms of critical metrics of overall power consumption and critical path delays. Finally, to understand the benefit of these accelerators, these auto-generated cores are analysed in comparison to a standard stack-machine CPU executing the same code sequences. Both the cores generation work and the benchmark CPU assume a 65nm CMOS process node, and are evaluated with industry standard design tools. It is demonstrated that the generated cores achieve better power efficiency improvements over the relatively CPU core
    corecore