450 research outputs found

    Modulo scheduling with integrated register spilling for clustered VLIW architectures

    Get PDF
    Clustering is a technique to decentralize the design of future wide issue VLIW cores and enable them to meet the technology constraints in terms of cycle time, area and power dissipation. In a clustered design, registers and functional units are grouped in clusters so that new instructions are needed to move data between them. New aggressive instruction scheduling techniques are required to minimize the negative effect of resource clustering and delays in moving data around. In this paper we present a novel software pipelining technique that performs instruction scheduling with reduced register requirements, register allocation, register spilling and inter-cluster communication in a single step. The algorithm uses limited backtracking to reconsider previously taken decisions. This backtracking provides the algorithm with additional possibilities for obtaining high throughput schedules with low spill code requirements for clustered architectures. We show that the proposed approach outperforms previously proposed techniques and that it is very scalable independently of the number of clusters, the number of communication buses and communication latency. The paper also includes an exploration of some parameters in the design of future clustered VLIW cores.Peer ReviewedPostprint (published version

    Hierarchical clustered register file organization for VLIW processors

    Get PDF
    Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version

    Flexible compiler-managed L0 buffers for clustered VLIW processors

    Get PDF
    Wire delays are a major concern for current and forthcoming processors. One approach to attack this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the data cache remains centralized. However, as technology evolves, the latency of such a centralized cache increase leading to an important performance impact. In this paper, we propose to include flexible low-latency buffers in each cluster in order to reduce the performance impact of higher cache latencies. The reduced number of entries in each buffer permits the design of flexible ways to map data from L1 to these buffers. The proposed L0 buffers are managed by the compiler, which is responsible to decide which memory instructions make us of them. Effective instruction scheduling techniques are proposed to generate code that exploits these buffers. Results for the Mediabench benchmark suite show that the performance of a clustered VLIW processor with a unified L1 data cache is improved by 16% when such buffers are used. In addition, the proposed architecture also shows significant advantages over both MultiVLIW processors and clustered processors with a word-interleaved cache, two state-of-the-art designs with a distributed L1 data cache.Peer ReviewedPostprint (published version

    Exploiting Hardware Abstraction for Parallel Programming Framework: Platform and Multitasking

    Get PDF
    With the help of the parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are required to have excellent hardware programming skills and unique optimization techniques to explore the potential of FPGA resources fully. Intermediate frameworks above hardware circuits are proposed to improve either performance or productivity by leveraging parallel programming models beyond the multi-core era. In this work, we propose the PolyPC (Polymorphic Parallel Computing) framework, which targets enhancing productivity without losing performance. It helps designers develop parallelized applications and implement them on FPGAs. The PolyPC framework implements a custom hardware platform, on which programs written in an OpenCL-like programming model can launch. Additionally, the PolyPC framework extends vendor-provided tools to provide a complete development environment including intermediate software framework, and automatic system builders. Designers\u27 programs can be either synthesized as hardware processing elements (PEs) or compiled to executable files running on software PEs. Benefiting from nontrivial features of re-loadable PEs, and independent group-level schedulers, the multitasking is enabled for both software and hardware PEs to improve the efficiency of utilizing hardware resources. The PolyPC framework is evaluated regarding performance, area efficiency, and multitasking. The results show a maximum 66 times speedup over a dual-core ARM processor and 1043 times speedup over a high-performance MicroBlaze with 125 times of area efficiency. It delivers a significant improvement in response time to high-priority tasks with the priority-aware scheduling. Overheads of multitasking are evaluated to analyze trade-offs. With the help of the design flow, the OpenCL application programs are converted into executables through the front-end source-to-source transformation and back-end synthesis/compilation to run on PEs, and the framework is generated from users\u27 specifications

    High-Level Synthesis for Embedded Systems

    Get PDF
    corecore