Search CORE

9,587 research outputs found

Dynamic common sub-expression elimination during scheduling in high-level synthesis

Author: Alex Nicolau
Mehrdad Reshadi
Nick Savoiu
Nikil Dutt
Rajesh Gupta
Sumit Gupta
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Coordinated parallelizing compiler optimizations and high-level synthesis

Author: Aho A.
Alexandru Nicolau
Bergamaschi R.
Chaiyakul V.
Ebcioglu K.
Fisher J.
Gupta S.
Gupta S.
Gupta S.
Gupta S.
Gupta S.
Gupta S.
Iqbal Z.
Janssen M.
Kountouris A.
Ku D.
Li J.
Lobo D.
Nicolau A.
Nikil D. Dutt
Novack S.
Orailoglu A.
Peymandoust A.
Potkonjak M.
Rajesh Kumar Gupta
Sreedhar V.
Sumit Gupta
Wakabayashi K.
Wakabayashi K.
Walker R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2004
Field of study

We present a high-level synthesis methodology that applies a coordinated set of coarse-grain and fine-grain parallelizing transformations. The transformations are applied both during a presynthesis phase and during scheduling, with the objective of optimizing the results of synthesis and reducing the impact of control flow constructs on the quality of results. We first apply a set of source level presynthesis transformations that include common sub-expression elimination (CSE), copy propagation, dead code elimination and loop-invariant code motion, along with more coarse-level code restructuring transformations such as loop unrolling. We then explore scheduling techniques that use a set of aggressive speculative code motions to maximally parallelize the design by re-ordering, speculating and sometimes even duplicating operations in the design. In particular, we present a new technique called "Dynamic CSE" that dynamically coordinates CSE and code motions such as speculation and conditional speculation during scheduling. We implemented our parallelizing high-level synthesis in the SPARK framework. This framework takes a behavioral description in ANSI-C as input and generates synthesizable register-transfer level VHDL. Our results from computationally expensive portions of three moderately complex design targets, namely, MPEG-1, MPEG-2 and the GIMP image processing too], validate the utility of our approach to the behavioral synthesis of designs with complex control flows

eScholarship - University of California

Feedback Driven Annotation and Refactoring of Parallel Programs

Author: Larsen Per
Publication venue: Technical University of Denmark
Publication date: 01/01/2011
Field of study

Superscalar RISC-V Processor with SIMD Vector Extension

Author: He Jiongrui
Publication venue: 'University of Saskatchewan Library'
Publication date: 22/09/2020
Field of study

With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research. This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput. The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor. Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board. The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53× improvement on instruction throughput and 10.18× improvement on real-time throughput. Moreover, the integration also provides 2.22× energy efficiency compared with the superscalar processor along

University of Saskatchewan Research Archive