2,778 research outputs found
A mathematical formulation of the loop pipelining problem
This paper presents a mathematical model for the loop pipelining problem that considers several parameters for optimization and supports any combination of resource and timing constraints. The unrolling degree of the loop is one of the variables explored by the model. By using Farey’s series, an optimal exploration of the unrolling degree is performed and optimal solutions not considered by other methods are obtained. Finding an optimal schedule that minimizes resource and register requirements is solved by using an Integer linear programming (ILP) model. A novel paradigm called branch and prune is proposed to eficiently converge towards the optimal schedule and prune the search tree for integer solutions, thus drastically reducing the running time. This is the first formulation that combines the unrolling degree of the loop with timing and resource constraints in a mathematical model that guarantees optimal solutions.Peer ReviewedPostprint (author's final draft
Recommended from our members
An efficient global resource constrained technique for exploiting instruction level parallelism
A new Global Resource-constrained Percolation (GRiP) scheduling technique is presented for exploiting instruction level parallelism. Other techniques that have been proposed either have been prohibitively expensive in terms of computation or have limited parallelism. The GRiP technique has been implemented and simulation results are presented
Recommended from our members
Chippe : a system for constraint driven behavioral synthesis
This report describes the Chippe system, gives some background previous work and describes several sample design runs of the system. Also presented are the sources of the design tradeoffs used by Chippe, and overview of the internal design model, and experiences using the system
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines
Parallel architectures are essential in order to take advantage of the
parallelism inherent in streaming applications. One particular branch of these
employ hardware SIMD pipelines. In this paper, we analyse several scheduling
techniques, namely ad hoc overlapped execution, modulo scheduling and modulo
scheduling with unrolling, all of which aim to efficiently utilize the special
architecture design. Our investigation focuses on improving throughput while
analysing other metrics that are important for streaming applications, such as
register pressure, buffer sizes and code size. Through experiments conducted on
several media benchmarks, we present and discuss trade-offs involved when
selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
- …