1,233 research outputs found

    Hierarchical clustered register file organization for VLIW processors

    Get PDF
    Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version

    A unified modulo scheduling and register allocation technique for clustered processors

    Get PDF
    This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assignment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solution is found. The proposed framework includes a mechanism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simultaneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outperforms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.Peer ReviewedPostprint (published version

    Survey on Combinatorial Register Allocation and Instruction Scheduling

    Full text link
    Register allocation (mapping variables to processor registers or memory) and instruction scheduling (reordering instructions to increase instruction-level parallelism) are essential tasks for generating efficient assembly code in a compiler. In the last three decades, combinatorial optimization has emerged as an alternative to traditional, heuristic algorithms for these two tasks. Combinatorial optimization approaches can deliver optimal solutions according to a model, can precisely capture trade-offs between conflicting decisions, and are more flexible at the expense of increased compilation time. This paper provides an exhaustive literature review and a classification of combinatorial optimization approaches to register allocation and instruction scheduling, with a focus on the techniques that are most applied in this context: integer programming, constraint programming, partitioned Boolean quadratic programming, and enumeration. Researchers in compilers and combinatorial optimization can benefit from identifying developments, trends, and challenges in the area; compiler practitioners may discern opportunities and grasp the potential benefit of applying combinatorial optimization

    AGAMOS: A graph-based approach to modulo scheduling for clustered microarchitectures

    Get PDF
    This paper presents AGAMOS, a technique to modulo schedule loops on clustered microarchitectures. The proposed scheme uses a multilevel graph partitioning strategy to distribute the workload among clusters and reduces the number of intercluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of intercluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme outperforms a state-of-the-art scheduler for all programs and different cluster configurations. For some configurations, the speedup obtained when using this new scheme is greater than 40 percent, and for selected programs, performance can be more than doubled.Peer ReviewedPostprint (published version

    Loop transformations for clustered VLIW architectures

    Get PDF
    With increasing demands for performance by embedded systems, especially by digital signal processing (DSP) applications, embedded processors must increase available instructionlevel parallelism (ILP) within significant constraints on power consumption and chip cost. Unfortunately, supporting a large amount of ILP on a processor while maintaining a single register file increases chip cost and potentially decreases overall performance due to increased cycle time. To address this problem, some modern embedded processors partition the register file into multiple low-ported register files, each directly connected with one or more functional units. These functional unit/register file groups are called clusters. Clustered VLIW (very long instruction word) architectures need extra copy operations or delays to transfer values among clusters. To take advantage of clustered architectures, the compiler must expose parallelism for maximal functional-unit utilization, and schedule instructions to reduce intercluster communication overhead. High-level loop transformations offer an excellent opportunity to enhance the abilities of low-level optimizers to generate code for clustered architectures. This dissertation investigates the effects of three loop transformations, i.e., loop fusion, loop unrolling, and unroll-and-jam, on clustered VLIW architectures. The objective is to achieve high performance with low communication overhead. This dissertation discusses the following techniques: Loop Fusion This research examines the impact of loop fusion on clustered architectures. A metric based upon communication costs for guiding loop fusion is developed and tested on DSP benchmarks. Unroll-and-jam and Loop Unrolling A new method that integrates a communication cost model with an integer-optimization problem is developed to determine unroll amounts for loop unrolling and unroll-and-jam automatically for a specific loop on a specific architecture. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures

    IMPROVING MULTIBANK MEMORY ACCESS PARALLELISM ON SIMT ARCHITECTURES

    Get PDF
    Memory mapping has traditionally been an important optimization problem for high-performance parallel systems. Today, these issues are increasingly affecting a much wider range of platforms. Several techniques have been presented to solve bank conflicts and reduce memory access latency but none of them turns out to be generally applicable to different application contexts. One of the ambitious goals of this Thesis is to contribute to modelling the problem of the memory mapping in order to find an approach that generalizes on existing conflict-avoiding techniques, supporting a systematic exploration of feasible mapping schemes
    • …
    corecore