Search CORE

12,519 research outputs found

Software prefetching for software pipelined loops

Author: González Colás Antonio María
Sánchez Jesús
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The paper investigates the interaction between software pipelining and different software prefetching techniques for VLIW machines. It is shown that processor stalls due to memory dependencies have a great impact into execution time. A novel heuristic is proposed and it is show to outperform previous proposals.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

Fine grain software pipelining of non-vectorizable nested loops

Author: Kim Ki-chang
Nicolau Alexandru
Publication venue: eScholarship, University of California
Publication date: 01/01/1991
Field of study

This paper presents a new technique to parallelize nested loops at the statement level. It transforms sequential nested loops, either vectorizable or not, into parallel ones. Previously, the wavefront method was used to parallelize non-vectorizable nested loops. However, in order to reduce the complexity of parallelization, the wavefront method regards an iteration as an unbreakable scheduling unit and draws parallelism through iteration overlapping. Our technique takes a statement rather than an iteration as the scheduling unit and exploits parallelism by overlapping the statements in all dimensions. In this paper, we show how this finer grain parallelization can be achieved with reasonable computational complexity, and the effectiveness of the resulting method in exploiting parallelism

eScholarship - University of California

Recommended from our members

An efficient global resource constrained technique for exploiting instruction level parallelism

Author: Nicolau Alexandru
Novack Steven
Publication venue: eScholarship, University of California
Publication date: 21/01/1992
Field of study

A new Global Resource-constrained Percolation (GRiP) scheduling technique is presented for exploiting instruction level parallelism. Other techniques that have been proposed either have been prohibitively expensive in terms of computation or have limited parallelism. The GRiP technique has been implemented and simulation results are presented

eScholarship - University of California

Transformations of High-Level Synthesis Codes for High-Performance Computing

Author: Besta Maciej
Hoefler Torsten
Licht Johannes de Fine
Meierhans Simon
Publication venue
Publication date: 29/10/2019
Field of study

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

arXiv.org e-Print Archive

Repository for Publications and Research Data

Using ACL2 to Verify Loop Pipelining in Behavioral Synthesis

Author: Hao Kecheng
Puri Disha
Ray Sandip
Xie Fei
Publication venue: 'Open Publishing Association'
Publication date: 01/01/2014
Field of study

Behavioral synthesis involves compiling an Electronic System-Level (ESL) design into its Register-Transfer Level (RTL) implementation. Loop pipelining is one of the most critical and complex transformations employed in behavioral synthesis. Certifying the loop pipelining algorithm is challenging because there is a huge semantic gap between the input sequential design and the output pipelined implementation making it infeasible to verify their equivalence with automated sequential equivalence checking techniques. We discuss our ongoing effort using ACL2 to certify loop pipelining transformation. The completion of the proof is work in progress. However, some of the insights developed so far may already be of value to the ACL2 community. In particular, we discuss the key invariant we formalized, which is very different from that used in most pipeline proofs. We discuss the needs for this invariant, its formalization in ACL2, and our envisioned proof using the invariant. We also discuss some trade-offs, challenges, and insights developed in course of the project.Comment: In Proceedings ACL2 2014, arXiv:1406.123

arXiv.org e-Print Archive

Directory of Open Access Journals

PDXScholar (Portland State University)

Loop pipelining with resource and timing constraints

Author: Sánchez Carracedo Fermín
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/1996
Field of study

Developing efficient programs for many of the current parallel computers is not easy due to the architectural complexity of those machines. The wide variety of machine organizations often makes it more difficult to port an existing program than to reprogram it completely. Therefore, powerful translators are necessary to generate effective code and free the programmer from concerns about the specific characteristics of the target machine. This work focuses on techniques to be used by an important class of translators, whose objective is to transform sequential programs into equivalent more parallel programs. The transformations are performed at instruction level in order to exploit low level parallelism and increase memory locality.Most of the current applications are programmed in languages which do not allow us to express parallelism between high-level sentences (as Pascal, C or Fortran). Furthermore, a lot of applications written ten or more years ago are still used today, and it is not feasible to rewrite such applications for many reasons (not only technical reasons, but also economic ones). Translators enable programmers to write the application in a familiar sequential programming language, without concerning their selves with the architecture of the target machine. Current compilers for parallel architectures not only translate a program written on a high-level language to the appropriate machine language, but also perform some transformations in the final code in order to execute the program in a more parallel way. The transformations improve the performance in the execution of the program by making use of the knowledge that the compiler has about the machine architecture. The semantics of the program remain intact after any transformation.Experiments show that limiting parallelization to basic blocks not included in loops limits maximum speedup. This is because loops often comprise a large portion of the parallelism available to be exploited in a program. For this reason, a lot of effort has been devoted in the recent years to parallelize loop execution. Several parallel computer architectures and compilation techniques have been proposed to exploit such a parallelism at different granularities. Multiprocessors exploit coarse grained parallelism by distributing entire loop iterations to different processors. Systems oriented to the high-level synthesis (HLS) of VLSI circuits, superscalar processors and very long instruction word (VLIW) processors exploit fine-grained parallelism at instruction level. This work addresses fine-grained parallelization of loops addressed to the HLS of VLSI circuits. Two algorithms are proposed for resource constraints and for timing constraints. An algorithm to reduce the number of registers required to execute a loop in a given architecture is also proposed.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Secretaría de Estado de Cultura