451 research outputs found

    Empowering a helper cluster through data-width aware instruction selection policies

    Get PDF
    Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. helper cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit helper cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; helper cluster achieves an average speedup of 11% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22% on averagePeer ReviewedPostprint (published version

    Dynamic Dependency Collapsing

    Get PDF
    In this dissertation, we explore the concept of dynamic dependency collapsing. Performance increases in computer architecture are always introduced by exploiting additional parallelism when the clock speed is fixed. We show that further improvements are possible even when the available parallelism in programs are exhausted. This performance improvement is possible due to executing instructions in parallel that would ordinarily have been serialized. We call this concept dependency collapsing. We explore existing techniques that exploit parallelism and show which of them fall under the umbrella of dependency collapsing. We then introduce two dependency collapsing techniques of our own. The first technique collapses data dependencies by executing two normally dependent instructions together by fusing them. We show that exploiting the additional parallelism generated by collapsing these dependencies results in a performance increase. Our second technique collapses resource dependencies to execute instructions that would normally have been serialized due to resource constraints in the processor. We show that it is possible to take advantage of larger in-processor structures while avoiding the power and area penalty this often implies

    Mechanistic modeling of architectural vulnerability factor

    Get PDF
    Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF

    RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture

    Get PDF
    RingScalar is a complexity-effective microarchitecture for out-of-order superscalar processors, that reduces the area, latency, and power of all major structures in the instruction flow. The design divides an N-way superscalar into N columns connected in a unidirectional ring, where each column contains a portion of the instruction window, a bank of the register file, and an ALU. The design exploits the fact that most decoded instructions are waiting on just one operand to use only a single tag per issue window entry, and to restrict instruction wakeup and value bypass to only communicate with the neighboring column. Detailed simulations of four-issue single-threaded machines running SPECint2000 show that RingScalar has IPC only 13% lower than an idealized superscalar, while providing large reductions in area, power, and circuit latency

    Instruction history management for high-performance microprocessors

    Get PDF
    textHistory-driven dynamic optimization is an important factor in improving instruction throughput in future high-performance microprocessors. Historybased techniques have the ability to improve instruction-level parallelism by breaking program dependencies, eliminating long-latency microarchitecture operations, and improving prioritization within the microarchitecture. However, a combination of factors, such as wider issue widths, smaller transistors, larger die area, and increasing clock frequency, has led to microprocessors that are sensitive to both wire delays and energy consumption. In this environment, the global structures and long-distance communications that characterize current history data management are limiting instruction throughput. This dissertation proposes the ScatterFlow Framework for Instruction History Management. Execution history management tasks, such as history data storage, access, distribution, collection, and modification, are partitioned and dispersed throughout the instruction execution pipeline. History data packets are then associated with active instructions and flow with the instructions as they execute, encountering the history management tasks along the way. Between dynamic instances of the instructions, the history data packets reside in trace-based history storage that is synchronized with the instruction trace cache. Compared to traditional history data management, this ScatterFlow method improves instruction coverage, increases history data access bandwidth, shortens communication distances, improves history data accuracy in many cases, and decreases the effective history data access time. A comparison of general history management effectiveness between the ScatterFlow Framework and traditional hardware tables shows that the ScatterFlow Framework provides superior history maturity and instruction coverage. The unique properties that arise due to trace-based history storage and partitioned history management are analyzed, and novel design enhancements are presented to increase the usefulness of instruction history data within the ScatterFlow Framework. To demonstrate the potential of the proposed framework, specific dynamic optimization techniques are implemented using the ScatterFlow Framework. These illustrative examples combine the history capture advantages with the access latency improvements while exhibiting desirable dynamic energy consumption properties. Compared to a traditional table-based predictor, performing ScatterFlow value prediction improves execution time and reduces dynamic energy consumption. In other detailed examples, ScatterFlowenabled cluster assignment demonstrates improved execution time over previous cluster assignment schemes, and ScatterFlow instruction-level profiling detects more useful execution traits than traditional fixed-size and infinite-size hardware tables.Electrical and Computer Engineerin

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Banked microarchitectures for complexity-effective superscalar microprocessors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 95-99).High performance superscalar microarchitectures exploit instruction-level parallelism (ILP) to improve processor performance by executing instructions out of program order and by speculating on branch instructions. Monolithic centralized structures with global communications, including issue windows and register files, are used to buffer in-flight instructions and to maintain machine state. These structures scale poorly to greater issue widths and deeper pipelines, as they must support simultaneous global accesses from all active instructions. The lack of scalability is exacerbated in future technologies, which have increasing global interconnect delay and a much greater emphasis on reducing both switching and leakage power. However, these fully orthogonal structures are over-engineered for typical use. Banked microarchitectures that consist of multiple interleaved banks of fewer ported cells can significantly reduce power, area, and latency of these structures.(cont.) Although banked structures exhibit a minor performance penalty, significant reductions in delay and power can potentially be used to increase clock rate and lead to more complexity-effective designs. There are two main contributions in this thesis. First, a speculative control scheme is proposed to simplify the complicated control logic that is involved in managing a less-ported banked register file for high-frequency superscalar processors. Second, the RingScalar architecture, a complexity-effective out-of-order superscalar microarchitecture, based on a ring topology of banked structures, is introduced and evaluated.by Jessica Hui-Chun Tseng.Ph.D
    corecore