390 research outputs found

    Simulation of the UKQCD computer

    Get PDF

    BandwidthBreach: Unleashing Covert and Side Channels through Cache Bandwidth Exploitation

    Full text link
    In the modern CPU architecture, enhancements such as the Line Fill Buffer (LFB) and Super Queue (SQ), which are designed to track pending cache requests, have significantly boosted performance. To exploit this structures, we deliberately engineered blockages in the L2 to L1d route by controlling LFB conflict and triggering prefetch prediction failures, while consciously dismissing other plausible influencing factors. This approach was subsequently extended to the L3 to L2 and L2 to L1i pathways, resulting in three potent covert channels, termed L2CC, L3CC, and LiCC, with capacities of 10.02 Mbps, 10.37 Mbps, and 1.83 Mbps, respectively. Strikingly, the capacities of L2CC and L3CC surpass those of earlier non-shared-memory-based covert channels, reaching a level comparable to their shared memory-dependent equivalents. Leveraging this congestion further facilitated the extraction of key bits from RSA and EdDSA implementations. Coupled with SpectreV1 and V2, our covert channels effectively evade the majority of traditional Spectre defenses. Their confluence with Branch Prediction (BP) Timing assaults additionally undercuts balanced branch protections, hence broadening their capability to infiltrate a wide range of cryptography libraries

    Prefetching techniques for client server object-oriented database systems

    Get PDF
    The performance of many object-oriented database applications suffers from the page fetch latency which is determined by the expense of disk access. In this work we suggest several prefetching techniques to avoid, or at least to reduce, page fetch latency. In practice no prediction technique is perfect and no prefetching technique can entirely eliminate delay due to page fetch latency. Therefore we are interested in the trade-off between the level of accuracy required for obtaining good results in terms of elapsed time reduction and the processing overhead needed to achieve this level of accuracy. If prefetching accuracy is high then the total elapsed time of an application can be reduced significantly otherwise if the prefetching accuracy is low, many incorrect pages are prefetched and the extra load on the client, network, server and disks decreases the whole system performance. Access pattern of object-oriented databases are often complex and usually hard to predict accurately. The ..

    Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads

    Get PDF
    GPGPU architectures have become the dominant platform for massively parallel workloads, delivering high performance and energy efficiency for popular applications such as machine learning, computer vision or self-driving cars. However, irregular applications, such as graph processing, fail to fully exploit GPGPU resources due to their divergent memory accesses that saturate the memory hierarchy. To reduce the pressure on the memory subsystem for divergent memory-intensive applications, programmers must take into account SIMT execution model and memory coalescing in GPGPUs, devoting significant efforts in complex optimization techniques. Despite these efforts, we show that irregular graph processing still suffers from low GPGPU performance. We observe that in many irregular applications the mapping of data to threads can be safely changed. In other words, it is possible to relax the strict relationship between thread and data processed to reduce memory divergence. Based on this observation, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension tightly integrated in the GPGPU pipeline. The IRU reorders data processed by the threads on irregular accesses to improve memory coalescing, i.e., it tries to assign data elements to threads as to produce coalesced accesses in SIMT groups. Furthermore, the IRU is capable of filtering and merging duplicated accesses, significantly reducing the workload. Programmers can easily utilize the IRU with a simple API, or let the compiler issue instructions from our extended ISA. We evaluate our proposal for state-of-the-art graph-based algorithms and a wide selection of applications. Results show that the IRU achieves a memory coalescing improvement of 1.32x and a 46% reduction in the overall traffic in the memory hierarchy, which results in 1.33x speedup and 13% energy savings on average, while incurring in a small 5.6% area overhead.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00 and the ICREA Academia program.Peer ReviewedPostprint (published version

    High performance and energy-efficient instruction cache design and optimisation for ultra-low-power multi-core clusters

    Get PDF
    High Energy efficiency and high performance are the key regiments for Internet of Things (IoT) end-nodes. Exploiting cluster of multiple programmable processors has recently emerged as a suitable solution to address this challenge. However, one of the main bottlenecks for multi-core architectures is the instruction cache. While private caches fall into data replication and wasting area, fully shared caches lack scalability and form a bottleneck for the operating frequency. Hence we propose a hybrid solution where a larger shared cache (L1.5) is shared by multiple cores connected through a low-latency interconnect to small private caches (L1). However, it is still limited by large capacity miss with a small L1. Thus, we propose a sequential prefetch from L1 to L1.5 to improve the performance with little area overhead. Moreover, to cut the critical path for better timing, we optimized the core instruction fetch stage with non-blocking transfer by adopting a 4 x 32-bit ring buffer FIFO and adding a pipeline for the conditional branch. We present a detailed comparison of different instruction cache architectures' performance and energy efficiency recently proposed for Parallel Ultra-Low-Power clusters. On average, when executing a set of real-life IoT applications, our two-level cache improves the performance by up to 20% and loses 7% energy efficiency with respect to the private cache. Compared to a shared cache system, it improves performance by up to 17% and keeps the same energy efficiency. In the end, up to 20% timing (maximum frequency) improvement and software control enable the two-level instruction cache with prefetch adapt to various battery-powered usage cases to balance high performance and energy efficiency

    Architectural Enhancements for Data Transport in Datacenter Systems

    Full text link
    Datacenter systems run myriad applications, which frequently communicate with each other and/or Input/Output (I/O) devices—including network adapters, storage devices, and accelerators. Due to the growing speed of I/O devices and the emergence of microservice-based programming models, the I/O software stacks have become a critical factor in end-to-end communication performance. As such, I/O software stacks have been evolving rapidly in recent years. Datacenters rely on fast, efficient “Software Data Planes”, which orchestrate data transfer between applications and I/O devices. The goal of this dissertation is to enhance the performance, efficiency, and scalability of software data planes by diagnosing their existing issues and addressing them through hardware-software solutions. In the first step, I characterize challenges of modern software data planes, which bypass the operating system kernel to avoid associated overheads. Since traditional interrupts and system calls cannot be delivered to user code without kernel assistance, kernel-bypass data planes use spinning cores on I/O queues to identify work/data arrival. Spin-polling obviously wastes CPU cycles on checking empty queues; however, I show that it entails even more drawbacks: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues. (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced. (3) Spin-polling also scales poorly with the number of cores due to the overhead of polling and operation rate limits. (4) Whereas shared queues can mitigate load imbalance and head-of-line blocking, synchronization overheads of spinning on them limit their potential benefits. Next, I propose a notification accelerator, dubbed HyperPlane, which replaces spin-polling in software data planes. Design principles of HyperPlane are: (1) not iterating on empty I/O queues to find work/data in ready ones, (2) blocking/halting when all queues are empty rather than spinning fruitlessly, and (3) allowing multiple cores to efficiently monitor a shared set of queues. These principles lead to queue scalability, work proportionality, and enjoying theoretical merits of shared queues. HyperPlane is realized with a programming model front-end and a hardware microarchitecture back-end. Evaluation of HyperPlane shows its significant advantage in terms of throughput, average/tail latency, and energy efficiency over a state-of-the-art spin-polling-based software data plane, with very small power and area overheads. Finally, I focus on the data transfer aspect in software data planes. Cache misses incurred by accessing I/O data are a major bottleneck in software data planes. Despite considerable efforts put into delivering I/O data directly to the last-level cache, some access latency is still exposed. Cores cannot prefetch such data to nearer caches in today's systems because of the complex access pattern of data buffers and the lack of an appropriate notification mechanism that can trigger the prefetch operations. As such, I propose HyperData, a data transfer accelerator based on targeted prefetching. HyperData prefetches exact (rather than predicted) data buffers (or a required subset to avoid cache pollution) to the L1 cache of the consumer core at the right time. Prefetching can be done for both core-peripheral and core-core communications. HyperData's prefetcher is programmable and supports various queue formats—namely, direct (regular), indirect (Virtio), and multi-consumer queues. I show that with a minor overhead, HyperData effectively hides data access latency in software data planes, thereby improving both application- and system-level performance and efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169826/1/hosseing_1.pd
    • 

    corecore