14,712 research outputs found

    Three-dimensional memory vectorization for high bandwidth media memory systems

    Get PDF
    Vector processors have good performance, cost and adaptability when targeting multimedia applications. However, for a significant number of media programs, conventional memory configurations fail to deliver enough memory references per cycle to feed the SIMD functional units. This paper addresses the problem of the memory bandwidth. We propose a novel mechanism suitable for 2-dimensional vector architectures and targeted at providing high effective bandwidth for SIMD memory instructions. The basis of this mechanism is the extension of the scope of vectorization at the memory level, so that 3-dimensional memory patterns can be fetched into a second-level register file. By fetching long blocks of data and by reusing 2-dimensional memory streams at this second-level register file, we obtain a significant increase in the effective memory bandwidth. As side benefits, the new 3-dimensional load instructions provide a high robustness to memory latency and a significant reduction of the cache activity, thus reducing power and energy requirements. At the investment of a 50% more area than a regular SIMD register file, we have measured and average speed-up of 13% and the potential for power savings in the L2 cache of a 30%.Peer ReviewedPostprint (published version

    Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization

    Get PDF
    A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in the program. We call such techniques load/store optimizations. One recent optimization attacks load queue (LQ) scalability by replacing the expensive associative search that is used to enforce intra- and inter- thread ordering with load re-execution. A second attacks store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it. The speculatively accessed, speculatively populated SQ can be made smaller and faster, but load re-execution is required to verify the speculation. A third uses a hardware table to identify redundant loads and skip their execution altogether. Redundant load elimination is highly accurate but not 100%, so re-execution is needed to flag false eliminations. Unfortunately, the inherent benefits of load/store optimizations are mitigated by re-execution itself. Re-execution contends for cache bandwidths with store retirement, and serializes load re-execution with subsequent store retirement. If a particular technique requires a sufficient number of load re-executions, the cost of these re-executions will outweigh the benefits of the technique entirely and may even produce drastic slowdowns. This is the case for the SQ technique. Store Vulnerability Window (SVW) is a new mechanism that reduces the re-execution requirements of a given load/store optimization significantly, by an average of 85% across the three load/store optimizations we study. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, and allows these techniques to perform up to their full potential. For the scalable SQ optimization, this means the chnace to perform at all. Without SVW, this technique posts significant slowdowns. SVW is a simple scheme based on monotonic store sequence numbering and a novel application of Bloom Filtering. The cost of an effective SVW implementation is a 1KB buffer and an 2B field per LQ entry

    First steps towards the certification of an ARM simulator using Compcert

    Get PDF
    The simulation of Systems-on-Chip (SoC) is nowadays a hot topic because, beyond providing many debugging facilities, it allows the development of dedicated software before the hardware is available. Low-consumption CPUs such as ARM play a central role in SoC. However, the effectiveness of simulation depends on the faithfulness of the simulator. To this effect, we propose here to prove significant parts of such a simulator, SimSoC. Basically, on one hand, we develop a Coq formal model of the ARM architecture while on the other hand, we consider a version of the simulator including components written in Compcert-C. Then we prove that the simulation of ARM operations, according to Compcert-C formal semantics, conforms to the expected formal model of ARM. Size issues are partly dealt with using automatic generation of significant parts of the Coq model and of SimSoC from the official textual definition of ARM. However, this is still a long-term project. We report here the current stage of our efforts and discuss in particular the use of Compcert-C in this framework.Comment: First International Conference on Certified Programs and Proofs 7086 (2011

    Overview of Hydra: a concurrent language for synchronous digital circuit design

    Get PDF
    Hydra is a computer hardware description language that integrates several kinds of software tool (simulation, netlist generation and timing analysis) within a single circuit specification. The design language is inherently concurrent, and it offers black box abstraction and general design patterns that simplify the design of circuits with regular structure. Hydra specifications are concise, allowing the complete design of a computer system as a digital circuit within a few pages. This paper discusses the motivations behind Hydra, and illustrates the system with a significant portion of the design of a basic RISC processor

    “Design and Verification of a Reusable Self-Reconfigurable Gate Array Architecture

    Get PDF
    This thesis presents the design and verification of a Self-Reconfigurable Gate Array architecture (SRGA-UT) created for reuse, and available with a step-by-step tutorial and comprehensive documentation. The original SRGA [1], created at the University of Southern California, is an innovative architecture for a reconfigurable device that allows single cycle context switching and single cycle random access to a unified on-chip configuration/data memory. The key architecture that enables the above two features is the use of a mesh of trees based interconnect with logic cells and memory blocks at the leaf nodes and identical switches at the parent nodes. The SRGA-UT was adapted by making necessary modifications to the original design, to be implemented using the available University of Tennessee electronic design automation tools. An 8x8 array of PEs (Processing Elements) was synthesized and routed targeting a standard cell library for a 0.18 μm process. The synthesized design can store eight configuration contexts in each PE (this number can be modified by editing the Verilog files). The place and route generated a core-chip size of 5,413,300 μm2, and contains 354,053 number of gates. The step-by-step tutorial demonstrates that the SRGA-UT design is capable to switch context and perform memory access operations in a single clock cycle. ModelSim tools were used for verification and simulation at all levels, Design Compiler executed the synthesis and created the netlist design, and First Encounter SoC performed the place and route and created the delay constraints

    A case for merging the ILP and DLP paradigms

    Get PDF
    The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single architecture to execute vectorizable code at a performance level that can not be achieved using either paradigm on its own. We will show that the combination of the two techniques yields very high performance at a low cost and a low complexity. We will show that this architecture can reach a performance equivalent to a superscalar processor that sustained 10 instructions per cycle. We will see that the machine exploiting both types of parallelism improves upon the ILP-only machine by factors of 1.5-1.8. We also present a study on the scalability of both paradigms and show that, when we increase resources to reach a 16-issue machine, the advantage of the ILP+DLP machine over the ILP-only machine increases up to 2.0-3.45. While the peak achieved IPC for the ILP machine is 4, the ILP+DLP machine exceeds 10 instructions per cycle.Peer ReviewedPostprint (published version
    • …
    corecore