256 research outputs found

    Hierarchical clustered register file organization for VLIW processors

    Get PDF
    Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version

    Modulo scheduling with integrated register spilling for clustered VLIW architectures

    Get PDF
    Clustering is a technique to decentralize the design of future wide issue VLIW cores and enable them to meet the technology constraints in terms of cycle time, area and power dissipation. In a clustered design, registers and functional units are grouped in clusters so that new instructions are needed to move data between them. New aggressive instruction scheduling techniques are required to minimize the negative effect of resource clustering and delays in moving data around. In this paper we present a novel software pipelining technique that performs instruction scheduling with reduced register requirements, register allocation, register spilling and inter-cluster communication in a single step. The algorithm uses limited backtracking to reconsider previously taken decisions. This backtracking provides the algorithm with additional possibilities for obtaining high throughput schedules with low spill code requirements for clustered architectures. We show that the proposed approach outperforms previously proposed techniques and that it is very scalable independently of the number of clusters, the number of communication buses and communication latency. The paper also includes an exploration of some parameters in the design of future clustered VLIW cores.Peer ReviewedPostprint (published version

    Improving multithreading performance for clustered VLIW architectures.

    Get PDF
    Very Long Instruction Word (VLIW) processors are very popular in embedded and mobile computing domain. Use of VLIW processors range from Digital Signal Processors (DSPs) found in a plethora of communication and multimedia devices to Graphics Processing Units (GPUs) used in gaming and high performance computing devices. The advantage of VLIWs is their low complexity and low power design which enable high performance at a low cost. Scalability of VLIWs is limited by the scalability of register file ports. It is not viable to have a VLIW processor with a single large register file because of area and power consumption implications of the register file. Clustered VLIW solve the register file scalability issue by partitioning the register file into multiple clusters and a set of functional units that are attached to register file of that cluster. Using a clustered approach, higher issue width can be achieved while keeping the cost of register file within reasonable limits. Several commercial VLIW processors have been designed using the clustered VLIW model. VLIW processors can be used to run a larger set of applications. Many of these applications have a good Lnstruction Level Parallelism (ILP) which can be efficiently utilized. However, several applications, specially the ones that are control code dominated do not exibit good ILP and the processor is underutilized. Cache misses is another major source of resource underutiliztion. Multithreading is a popular technique to improve processor utilization. Interleaved MultiThreading (IMT) hides cache miss latencies by scheduling a different thread each cycle but cannot hide unused instructions slots. Simultaneous MultiThread (SMT) can also remove ILP under-utilization by issuing multiple threads to fill the empty instruction slots. However, SMT has a higher implementation cost than IMT. The thesis presents Cluster-level Simultaneous MultiThreading (CSMT) that supports a limited form of SMT where VLIW instructions from different threads are merged at a cluster-level granularity. This lowers the hardware implementation cost to a level comparable to the cheap IMT technique. The more complex SMT combines VLIW instructions at the individual operation-level granularity which is quite expensive especially in for a mobile solution. We refer to SMT at operation-level as OpSMT to reduce ambiguity. While previous studies restricted OpSMT on a VLIW to 2 threads, CSMT has a better scalability and upto 8 threads can be supported at a reasonable cost. The thesis proposes several other techniques to further improve CSMT performance. In particular, Cluster renaming remaps the clusters used by instructions of different threads to reduce resource conflicts. Cluster renaming is quite effective in reducing the issue-slots under-utilization and significantly improves CSMT performance.The thesis also proposes: a hybrid between IMT and CSMT which increases the number of supported threads, heterogeneous instruction merging where some instructions are combined using SMT and CSMT rest, and finally, split-issue, a technique that allows to launch partially an instruction making it easier to be combined with others

    Clustered VLIW architecture based on queue register files

    Get PDF
    Institute for Computing Systems ArchitectureInstruction-level parallelism (ILP) is a set of hardware and software techniques that allow parallel execution of machine operations. Superscalar architectures rely most heavily upon hardware schemes to identify parallelism among operations. Although successful in terms of performance, the hardware complexity involved might limit the scalability of this model. VLIW architectures use a different approach to exploit ILP. In this case all data dependence analyses and scheduling of operations are performed at compile time, resulting in a simpler hardware organization. This allows the inclusion of a larger number of functional units (FUs) into a single chip. IN spite of this relative simplification, the scalability of VLIW architectures can be constrained by the size and number of ports of the register file. VLIW machines often use software pipelining techniques to improve the execution of loop structures, which can increase the register pressure. Furthermore, the access time of a register file can be compromised by the number of ports, causing a negative impact on the machine cycle time. For these reasons we understand that the benefits of having parallel FUs, which have motivated the investigation of alternative machine designs. This thesis presents a scalar VLIW architecture comprising clusters of FUs and private register files. Register files organised as queue structures are used as a mechanism for inter-cluster communication, allowing the enforcement of fixed latency in the process. This scheme presents better possibilities in terms of scalability as the size of the individual register files is not determined by the total number of FUs, suggesting that the silicon area may grow only linearly with respect to the total number of FUs. However, the effectiveness of such an organization depends on the efficiency of the code partitioning strategy. We have developed an algorithm for a clustered VLIW architecture integrating both software pipelining and code partitioning in a a single procedure. Experimental results show it may allow performance levels close to an unclustered machine without communication restraints. Finally, we have developed silicon area and cycle time models to quantify the scalability of performance and cost for this class of architecture

    Further Specialization of Clustered VLIW Processors: A MAP Decoder for Software Defined Radio

    Get PDF
    Turbo codes are extensively used in current communications standards and have a promising outlook for future generations. The advantages of software defined radio, especially dynamic reconfiguration, make it very attractive in this multi-standard scenario. However, the complex and power consuming implementation of the maximum a posteriori (MAP) algorithm, employed by turbo decoders, sets hurdles to this goal. This work introduces an ASIP architecture for the MAP algorithm, based on a dual-clustered VLIW processor. It displays the good performance of application specific designs along with the versatility of processors, which makes it compliant with leading edge standards. The machine deals with multi-operand instructions in an innovative way, the fetching and assertion of data is serialized and the addressing is automatized and transparent for the programmer. The performance-area trade-off of the proposed architecture achieves a throughput of 8 cycles per symbol with very low power dissipation

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Performance evaluation of CSMT for VLIW processors

    Get PDF
    Clustered VLIW embedded processors have become widespread due to benefits of simple hardware and low power. However, while some applications exhibit large amounts of instruction level parallelism (ILP) and benefit from very wide machines, others have little ILP, which wastes precious resources in wide processors. Simultaneous MultiThreading (SMT) is a well known technique that improves resource utilization by exploiting thread level parallelism at the instruction grain level. However, implementing SMT for VLIWs requires complex structures. CSMT (Clusterlevel Simultaneous MultiThreading) allows some degree of SMT in clustered VLIW processors. CSMT considers the set of operations that execute simultaneously in a given cluster (named bundle)as the assignment unit. All bundles belonging to a VLIW instruction from a given thread are issued simultaneously. To minimize cluster conflicts between threads, a very simple hardwarebased cluster renaming mechanism is proposed. The experimental results show that CSMT significantly improves ILP when compared with other multithreading approaches suited for VLIW. For instance, with 4 threads CSMT shows an average speedup of 113% over a single-thread VLIW architecture and 36% over Interleaved MultiThreading (IMT). In some cases, speedup can be as high as 228% over single thread architecture and 97% over IMT. Also CSMT for a 2-thread processor, achieves almost the same performance as IMT for a 4-thread processor and also outperforms it in some cases.Peer ReviewedPostprint (author’s final draft

    Code Generation for an Application-Specific VLIW Processor With Clustered, Addressable Register Files

    Get PDF
    International audienceModern compilers integrate recent advances in compiler construction, intermediate representations, algorithms and programming language front-ends. Yet code generation for appli\-cation-specific architectures benefits only marginally from this trend, as most of the effort is oriented towards popular general-purpose architectures. Historically, non-orthogonal architectures have relied on custom compiler technologies, some retargettable, but largely decoupled from the evolution of mainstream tool flows. Very Long Instruction Word (VLIW) architectures have introduced a variety of interesting problems such as clusterization, packetization or bundling, instruction scheduling for exposed pipelines, long delay slots, software pipelining, etc. These have been addressed in the literature, with a focus on the exploitation of Instruction Level Parallelism (ILP). While these are well known solutions already embedded into existing compilers, they rely on common hardware functionalities that are expected to be present in a fairly large subset of VLIW architectures. This paper presents our work on back-end compiler for Mephisto, a high performance low-power application-specific processor, based on LLVM. Mephisto is specialized enough to challenge established code generation solutions for VLIW and DSP processors, calling for an innovative compilation flow. Conversely, even though Mephisto might be seen a somewhat exotic processor, its hardware characteristics such as addressable register files benefit from existing analyses and transformations in LLVM. We describe our model of the Mephisto architecture, the difficulties we encountered, and the associated compilation methods, some of them new and specific to Mephisto

    Loop transformations for clustered VLIW architectures

    Get PDF
    With increasing demands for performance by embedded systems, especially by digital signal processing (DSP) applications, embedded processors must increase available instructionlevel parallelism (ILP) within significant constraints on power consumption and chip cost. Unfortunately, supporting a large amount of ILP on a processor while maintaining a single register file increases chip cost and potentially decreases overall performance due to increased cycle time. To address this problem, some modern embedded processors partition the register file into multiple low-ported register files, each directly connected with one or more functional units. These functional unit/register file groups are called clusters. Clustered VLIW (very long instruction word) architectures need extra copy operations or delays to transfer values among clusters. To take advantage of clustered architectures, the compiler must expose parallelism for maximal functional-unit utilization, and schedule instructions to reduce intercluster communication overhead. High-level loop transformations offer an excellent opportunity to enhance the abilities of low-level optimizers to generate code for clustered architectures. This dissertation investigates the effects of three loop transformations, i.e., loop fusion, loop unrolling, and unroll-and-jam, on clustered VLIW architectures. The objective is to achieve high performance with low communication overhead. This dissertation discusses the following techniques: Loop Fusion This research examines the impact of loop fusion on clustered architectures. A metric based upon communication costs for guiding loop fusion is developed and tested on DSP benchmarks. Unroll-and-jam and Loop Unrolling A new method that integrates a communication cost model with an integer-optimization problem is developed to determine unroll amounts for loop unrolling and unroll-and-jam automatically for a specific loop on a specific architecture. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures
    corecore