535 research outputs found

    Hierarchical clustered register file organization for VLIW processors

    Get PDF
    Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version

    Strategies in Underwriting the Costs of Catastrophic Disease

    Get PDF
    In this thesis we address the problem of integrated software pipelining for clustered VLIW architectures. The phases that are integrated and solved as one combined problem are: cluster assignment, instruction selection, scheduling, register allocation and spilling. As a first step we describe two methods for integrated code generation of basic blocks. The first method is optimal and based on integer linear programming. The second method is a heuristic based on genetic algorithms. We then extend the integer linear programming model to modulo scheduling. To the best of our knowledge this is the first time anybody has optimally solved the modulo scheduling problem for clustered architectures with instruction selection and cluster assignment integrated. We also show that optimal spilling is closely related to optimal register allocation when the register files are clustered. In fact, optimal spilling is as simple as adding an additional virtual register file representing the memory and have transfer instructions to and from this register file corresponding to stores and loads. Our algorithm for modulo scheduling iteratively considers schedules with increasing number of schedule slots. A problem with such an iterative method is that if the initiation interval is not equal to the lower bound there is no way to determine whether the found solution is optimal or not. We have proven that for a class of architectures that we call transfer free, we can set an upper bound on the schedule length. I.e., we can prove when a found modulo schedule with initiation interval larger than the lower bound is optimal. Experiments have been conducted to show the usefulness and limitations of our optimal methods. For the basic block case we compare the optimal method to the heuristic based on genetic algorithms. This work has been supported by The Swedish national graduate school in computer science (CUGS) and VetenskapsrĂĄdet (VR)

    Variable-based multi-module data caches for clustered VLIW processors

    Get PDF
    Memory structures consume an important fraction of the total processor energy. One solution to reduce the energy consumed by cache memories consists of reducing their supply voltage and/or increase their threshold voltage at an expense in access time. We propose to divide the L1 data cache into two cache modules for a clustered VLIW processor consisting of two clusters. Such division is done on a variable basis so that the address of a datum determines its location. Each cache module is assigned to a cluster and can be set up as a fast power-hungry module or as a slow power-aware module. We also present compiler techniques in order to distribute variables between the two cache modules and generate code accordingly. We have explored several cache configurations using the Mediabench suite and we have observed that the best distributed cache organization outperforms traditional cache organizations by 19%-31% in energy-delay and by 11%-29% in energy-delay. In addition, we also explore a reconfigurable distributed cache, where the cache can be reconfigured on a context switch. This reconfigurable scheme further outperforms the best previous distributed organization by 3%-4%.Peer ReviewedPostprint (published version

    Instruction replication for clustered microarchitectures

    Get PDF
    This work presents a new compilation technique that uses instruction replication in order to reduce the number of communications executed on a clustered microarchitecture. For such architectures, the need to communicate values between clusters can result in a significant performance loss. Inter-cluster communications can be reduced by selectively replicating an appropriate set of instructions. However, instruction replication must be done carefully since it may also degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-of-the-art modulo scheduling algorithm that effectively reduces communications. Results show that the number of communications can decrease using replication, which results in significant speed-ups. IPC is increased by 25% on average for a 4-cluster microarchitecture and by as mush as 70% for selected programs.Peer ReviewedPostprint (published version

    Distributed data cache designs for clustered VLIW processors

    Get PDF
    Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in What we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular; we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible LO-buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite'show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored.Peer ReviewedPostprint (published version

    Virtual cluster scheduling through the scheduling graph

    Get PDF
    This paper presents an instruction scheduling and cluster assignment approach for clustered processors. The proposed technique makes use of a novel representation named the scheduling graph which describes all possible schedules. A powerful deduction process is applied to this graph, reducing at each step the set of possible schedules. In contrast to traditional list scheduling techniques, the proposed scheme tries to establish relations among instructions rather than assigning each instruction to a particular cycle. The main advantage is that wrong or poor schedules can be anticipated and discarded earlier. In addition, cluster assignment of instructions is performed using another novel concept called virtual clusters, which define sets of instructions that must execute in the same cluster. These clusters are managed during the deduction process to identify incompatibilities among instructions. The mapping of virtual to physical clusters is postponed until the scheduling of the instructions has finalized. The advantages this novel approach features include: (1) accurate scheduling information when assigning, and, (2) accurate information of the cluster assignment constraints imposed by scheduling decisions. We have implemented and evaluated the proposed scheme with superblocks extracted from Speclnt95 and MediaBench. The results show that this approach produces better schedules than the previous state-of-the-art. Speed-ups are up to 15%, with average speed-ups ranging from 2.5% (2-Clusters) to 9.5% (4-Clusters).Peer ReviewedPostprint (published version

    Storage constraint satisfaction for embedded processor compilers

    Get PDF
    Increasing interest in the high-volume high-performance embedded processor market motivates the stand-alone processor world to consider issues like design flexibility (synthesizable processor core), energy consumption, and silicon efficiency. Implications for embedded processor architectures and compilers are the exploitation of hardware acceleration, instruction-level parallelism (ILP), and distributed storage files. In that scope, VLIW architectures have been acclaimed for their parallelism in the architecture while orthogonality of the associated instruction sets is maintained. Code generation methods for such processors will be pressured towards an efficient use of scarce resources while satisfying tight real-time constraints imposed by DSP and multimedia applications. Limited storage (e.g. registers) availability poses a problem for traditional methods that perform code generation in separate stages, e.g. operation scheduling followed by register allocation. This is because the objectives of scheduling and register allocation cause conflicts in code generation in several ways. Firstly, register reuse can create dependencies that did not exist in the original code, but can also save spilling values to memory. Secondly, while a particular ordering of instructions may increase the potential for ILP, the reordering due to instruction scheduling may also extend the lifetime of certain values, which can increase the register requirement. Furthermore, the instruction scheduler requires an adequate number of local registers to avoid register reuse (since reuse limits the opportunity for ILP), while the register allocator would prefer sufficient global registers in order to avoid spills. Finally, an effective scheduler can lose its achieved degree of instruction-level parallelism when spill code is inserted afterwards. Without any communication of information and cooperation between scheduling and storage allocation phases, the compiler writer faces the problem of determining which of these phases should run first to generate the most efficient final code. The lack of communication and cooperation between the instruction scheduling and storage allocation can result in code that contains excess of register spills and/or lower degree of ILP than actually achievable. This problem called phase coupling cannot be ignored when constraints are tight and efficient solutions are desired. Traditional methods that perform code generation in separate stages are often not able to find an efficient or even a feasible solution. Therefore, those methods need an increasing amount of help from the programmer (or designer) to arrive at a feasible solution. Because this requires an excessive amount of design time and extensive knowledge of the processor architecture, there is a need for automated techniques that can cope with the different kinds of constraints during scheduling. This thesis proposes an approach for instruction scheduling and storage allocation that makes an extensive use of timing, resource and storage constraints to prune the search space for scheduling. The method in this approach supports VLIW architectures with (distributed) storage files containing random-access registers, rotating registers to exploit the available ILP in loops, stacks or fifos to exploit larger storage capacities with lower addressing costs. Potential access conflicts between values are analyzed before and during scheduling, according to the type of storage they are assigned to. Using constraint analysis techniques and properties of colored conflict graphs essential information is obtained to identify the bottlenecks for satisfying the storage file constraints. To reduce the identified bottlenecks, this method performs partial scheduling by ordering value accesses such that to allow a better reuse of storage. Without enforcing any specific storage assignment of values, the method continues until it can guarantee that any completion of the partial schedule will also result in a feasible storage allocation. Therefore, the scheduling freedom is exploited for satisfaction of storage, resource, and timing constraints in one phase
    • …
    corecore