594 research outputs found

    An Investigation of trace cache organizations for superscalar processors

    Get PDF
    Techniques such as out-of-order issue and speculative execution aggressively exploit instruction level parallelism in modem superscalar processor architectures. The front end of such pipelined machines is concerned with providing a stream of schedulable instructions at a bandwidth that meets or exceeds the rate of instructions being issued and executed. As superscalar machines become increasingly wide, it is inevitable that the large set of instructions to be fetched every cycle will span multiple noncontiguous basic blocks. The mechanism to fetch, align, and pass this set of instructions down the pipeline must do so as efficiendy as possible, occupying a minimal number of pipeline cycles. The concept of trace cache has emerged as the most promising technique to meet this high-bandwidth, low-latency fetch requirement. This thesis presents the design, simulation and analysis of a microarchitecture simulator extension that incorporates trace cache. A new fill unit scheme, the Sliding Window Fill Mechanism is proposed. This method exploits trace continuity and identifies probable start regions to improve trace cache hit rate. A 7% hit rate increase was observed over the Rotenberg fill mechanism. Combined with branch promotion, trace cache hit rates experienced a 19% average increase along with a 17% average rise in fetch bandwidth

    Cross-core Microarchitectural Attacks and Countermeasures

    Get PDF
    In the last decade, multi-threaded systems and resource sharing have brought a number of technologies that facilitate our daily tasks in a way we never imagined. Among others, cloud computing has emerged to offer us powerful computational resources without having to physically acquire and install them, while smartphones have almost acquired the same importance desktop computers had a decade ago. This has only been possible thanks to the ever evolving performance optimization improvements made to modern microarchitectures that efficiently manage concurrent usage of hardware resources. One of the aforementioned optimizations is the usage of shared Last Level Caches (LLCs) to balance different CPU core loads and to maintain coherency between shared memory blocks utilized by different cores. The latter for instance has enabled concurrent execution of several processes in low RAM devices such as smartphones. Although efficient hardware resource sharing has become the de-facto model for several modern technologies, it also poses a major concern with respect to security. Some of the concurrently executed co-resident processes might in fact be malicious and try to take advantage of hardware proximity. New technologies usually claim to be secure by implementing sandboxing techniques and executing processes in isolated software environments, called Virtual Machines (VMs). However, the design of these isolated environments aims at preventing pure software- based attacks and usually does not consider hardware leakages. In fact, the malicious utilization of hardware resources as covert channels might have severe consequences to the privacy of the customers. Our work demonstrates that malicious customers of such technologies can utilize the LLC as the covert channel to obtain sensitive information from a co-resident victim. We show that the LLC is an attractive resource to be targeted by attackers, as it offers high resolution and, unlike previous microarchitectural attacks, does not require core-colocation. Particularly concerning are the cases in which cryptography is compromised, as it is the main component of every security solution. In this sense, the presented work does not only introduce three attack variants that can be applicable in different scenarios, but also demonstrates the ability to recover cryptographic keys (e.g. AES and RSA) and TLS session messages across VMs, bypassing sandboxing techniques. Finally, two countermeasures to prevent microarchitectural attacks in general and LLC attacks in particular from retrieving fine- grain information are presented. Unlike previously proposed countermeasures, ours do not add permanent overheads in the system but can be utilized as preemptive defenses. The first identifies leakages in cryptographic software that can potentially lead to key extraction, and thus, can be utilized by cryptographic code designers to ensure the sanity of their libraries before deployment. The second detects microarchitectural attacks embedded into innocent-looking binaries, preventing them from being posted in official application repositories that usually have the full trust of the customer

    Last-touch correlated data streaming

    Get PDF
    Recent research advocates address-correlating predictors to identify cache block addresses for prefetch. Unfortunately, address-correlating predictors require correlation data storage proportional in size to a program's active memory footprint. As a result, current proposals for this class of predictor are either limited in coverage due to constrained on-chip storage requirements or limited in prediction lookaheaddue to long off-chip correlation data lookup. In this paper, we propose Last-Touch Correlated Data Streaming (LT-cords), a practical address-correlating predictor. The key idea of LT-cords is to record correlation data off chip in the order they will be used and stream them into a practicallysized on-chip table shortly before they are needed, thereby obviating the need for scalable on-chip tables and enabling low-latency lookup. We use cycle-accurate simulation of an 8-way out-of-order superscalar processor to show that: (1) LT-cords with 214KB of on-chip storage can achieve the same coverage as a last-touch predictor with unlimited storage, without sacrificing predictor lookahead, and (2) LT-cords improves performance by 60% on average and 385% at best in the benchmarks studied. © 2007 IEEE

    Improving cache locality for thread-level speculation

    Full text link

    Improving trace cache hit rates using the sliding window fill mechanism and fill select table

    No full text

    Scalable service for flexible access to personal content

    Get PDF

    Cache Attacks and Defenses

    Get PDF
    In the digital age, as our daily lives depend heavily on interconnected computing devices, information security has become a crucial concern. The continuous exchange of data between devices over the Internet exposes our information vulnerable to potential security breaches. Yet, even with measures in place to protect devices, computing equipment inadvertently leaks information through side-channels, which emerge as byproducts of computational activities. One particular source of such side channels is the cache, a vital component of modern processors that enhances computational speed by storing frequently accessed data from random access memory (RAM). Due to their limited capacity, caches often need to be shared among concurrently running applications, resulting in vulnerabilities. Cache side-channel attacks, which exploit such vulnerabilities, have received significant attention due to their ability to stealthily compromise information confidentiality and the challenge in detecting and countering them. Consequently, numerous defense strategies have been proposed to mitigate these attacks. This thesis explores these defense strategies against cache side-channels, assesses their effectiveness, and identifies any potential vulnerabilities that could be used to undermine the effectiveness of these defense strategies. The first contribution of this thesis is a software framework to assess the security of secure cache designs. We show that while most secure caches are protected from eviction-set-based attacks, they are vulnerable to occupancybased attacks, which works just as well as eviction-set-based attacks, and therefore should be taken into account when designing and evaluating secure caches. Our second contribution presents a method that utilizes speculative execution to enable high-resolution attacks on low-resolution timers, a common cache attack countermeasure adopted by web browsers. We demonstrate that our technique not only allows for high-resolution attacks to be performed on low-resolution timers, but is also Turing-complete and is capable of performing robust calculations on cache states. Through this research, we uncover a new attack vector on low-resolution timers. By exposing this vulnerability, we hope to prompt the necessary measures to address the issue and enhance the security of systems in the future. Our third contribution is a survey, paired with experimental assessment of cache side-channel attack detection techniques using hardware performance counters. We show that, despite numerous claims regarding their efficacy, most detection techniques fail to perform proper evaluation of their performance, leaving them vulnerable to more advanced attacks. We identify and outline these shortcomings, and furnish experimental evidence to corroborate our findings. Furthermore, we demonstrate a new attack that is capable of compromising these detection methods. Our aim is to bring attention to these shortcomings and provide insights that can aid in the development of more robust cache side-channel attack detection techniques. This thesis contributes to a deeper comprehension of cache side-channel attacks and their potential effects on information security. Furthermore, it offers valuable insights into the efficacy of existing mitigation approaches and detection methods, while identifying areas for future research and development to better safeguard our computing devices and data from these insidious attacks.Thesis (MPhil) -- University of Adelaide, School of Computer and Mathematical Sciences, 202

    Data Resource Management in Throughput Processors

    Full text link
    Graphics Processing Units (GPUs) are becoming common in data centers for tasks like neural network training and image processing due to their high performance and efficiency. GPUs maintain high throughput by running thousands of threads simultaneously, issuing instructions from ready threads to hide latency in others that are stalled. While this is effective for keeping the arithmetic units busy, the challenge in GPU design is moving the data for computation at the same high rate. Any inefficiency in data movement and storage will compromise the throughput and energy efficiency of the system. Since energy consumption and cooling make up a large part of the cost of provisioning and running and a data center, making GPUs more suitable for this environment requires removing the bottlenecks and overheads that limit their efficiency. The performance of GPU workloads is often limited by the throughput of the memory resources inside each GPU core, and though many of the power-hungry structures in CPUs are not found in GPU designs, there is overhead for storing each thread's state. When sharing a GPU between workloads, contention for resources also causes interference and slowdown. This thesis develops techniques to manage and streamline the data movement and storage resources in GPUs in each of these places. The first part of this thesis resolves data movement restrictions inside each GPU core. The GPU memory system is optimized for sequential accesses, but many workloads load data in irregular or transposed patterns that cause a throughput bottleneck even when all loads are cache hits. This work identifies and leverages opportunities to merge requests across threads before sending them to the cache. While requests are waiting for merges, they can be reordered to achieve a higher cache hit rate. These methods yielded a 38% speedup for memory throughput limited workloads. Another opportunity for optimization is found in the register file. Since it must store the registers for thousands of active threads, it is the largest on-chip data storage structure on a GPU. The second work in this thesis replaces the register file with a smaller, more energy-efficient register buffer. Compiler directives allow the GPU to know ahead of time which registers will be accessed, allowing the hardware to store only the registers that will be imminently accessed in the buffer, with the rest moved to main memory. This technique reduced total GPU energy by 11%. Finally, in a data center, many different applications will be launching GPU jobs, and just as multiple processes can share the same CPU to increase its utilization, running multiple workloads on the same GPU can increase its overall throughput. However, co-runners interfere with each other in unpredictable ways, especially when sharing memory resources. The final part of this thesis controls this interference, allowing a GPU to be shared between two tiers of workloads: one tier with a high performance target and another suitable for batch jobs without deadlines. At a 90% performance target, this technique increased GPU throughput by 9.3%. GPUs' high efficiency and performance makes them a valuable accelerator in the data center. The contributions in this thesis further increase their efficiency by removing data movement and storage overheads and unlock additional performance by enabling resources to be shared between workloads while controlling interference.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146122/1/jklooste_1.pd

    Improving Trace Cache Hit Rates using the Sliding Window Fill Mechanism and Fill Select Table

    No full text
    As superscalar processors become increasingly wide, it is inevitable that the large set of instructions to be fetched every cycle will span multiple noncontiguous basic blocks. The mechanism to fetch, align, and pass this set of instructions down the pipeline must do so as efficiently as possible. The concept of trace cache has emerged as the most promising technique to meet this high-bandwidth, low-latency fetch requirement. A new fill unit scheme, the Sliding Window Fill Mechanism, is proposed as a method to efficiently populate the trace cache. This method exploits trace continuity and identifies probable start regions to improve trace cache hit rate. Simulation yields a 7 % average hit rate increase over the Rotenberg fill mechanism. When combined with branch promotion, trace cache hit rates experienced a 19% average increase along with a 17 % average improvement in fetch bandwidth
    • …
    corecore