Search CORE

1,783 research outputs found

The effectiveness of loop unrolling for modulo scheduling in clustered VLIW architectures

Author: González Colás Antonio María
Sánchez Navarro F. Jesús
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2000
Field of study

Clustered organizations are becoming a common trend in the design of VLIW architectures. In this work we propose a novel modulo scheduling approach for such architectures. The proposed technique performs the cluster assignment and the instruction scheduling in a single pass, which is shown to be more effective than doing first the assignment and later the scheduling. We also show that loop unrolling significantly enhances the performance of the proposed scheduler especially when the communication channel among clusters is the main performance bottleneck. By selectively unrolling some loops, we can obtain the best performance with the minimum increase in code size. Performance evaluation for the SPECfp95 shows that the clustered architecture achieves about the same IPC (Instructions Per Cycle) as a unified architecture with the same resources. Moreover when the cycle time is taken into account, a 4-cluster configurations is 3.6 times faster than the unified architecture.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Hypernode reduction modulo scheduling

Author: Ayguadé Parra Eduard
González Colás Antonio María
Llosa Espuny José Francisco
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1995
Field of study

Software pipelining is a loop scheduling technique that extracts parallelism from loops by overlapping the execution of several consecutive iterations. Most prior scheduling research has focused on achieving minimum execution time, without regarding register requirements. Most strategies tend to stretch operand lifetimes because they schedule some operations too early or too late. The paper presents a novel strategy that simultaneously schedules some operations late and other operations early, minimizing all the stretchable dependencies and therefore reducing the registers required by the loop. The key of this strategy is a pre-ordering that selects the order in which the operations will be scheduled. The results show that the method described in this paper performs better than other heuristic methods and almost as well as a linear programming method but requiring much less time to produce the schedules.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

A unified modulo scheduling and register allocation technique for clustered processors

Author: Codina Viñas Josep M.
González Colás Antonio María
Sánchez Navarro F. Jesús
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assignment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solution is found. The proposed framework includes a mechanism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simultaneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outperforms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Swing modulo scheduling: a lifetime-sensitive approach

Author: Ayguadé Parra Eduard
González Colás Antonio María
Llosa Espuny José Francisco
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1996
Field of study

This paper presents a novel software pipelining approach, which is called Swing Modulo Scheduling (SMS). It generates schedules that are near optimal in terms of initiation interval, register requirements and stage count. Swing Modulo Scheduling is an heuristic approach that has a low computational cost. The paper describes the technique and evaluates it for the Perfect Club benchmark suite. SMS is compared with other heuristic methods showing that it outperforms them in terms of the quality of the obtained schedules and compilation time. SMS is also compared with an integer linear programming approach that generates optimum schedules but with a huge computational cost, which makes it feasible only for very small loops. For a set of small loops, SMS obtained the optimum initiation interval in all the cases and its schedules required only 5% more registers and a 1% higher stage count than the optimumPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Selective Vectorization for Short-Vector Instructions

Author: Amarasinghe Saman
Larsen Samuel
Rabbah Rodric
Publication venue
Publication date: 18/12/2009
Field of study

Multimedia extensions are nearly ubiquitous in today's general-purpose processors. These extensions consist primarily of a set of short-vector instructions that apply the same opcode to a vector of operands. Vector instructions introduce a data-parallel component to processors that exploit instruction-level parallelism, and present an opportunity for increased performance. In fact, ignoring a processor's vector opcodes can leave a significant portion of the available resources unused. In order for software developers to find short-vector instructions generally useful, however, the compiler must target these extensions with complete transparency and consistent performance. This paper describes selective vectorization, a technique for balancing computation across a processor's scalar and vector units. Current approaches for targeting short-vector instructions directly adopt vectorizing technology first developed for supercomputers. Traditional vectorization, however, can lead to a performance degradation since it fails to account for a processor's scalar resources. We formulate selective vectorization in the context of software pipelining. Our approach creates software pipelines with shorter initiation intervals, and therefore, higher performance. A key aspect of selective vectorization is its ability to manage transfer of operands between vector and scalar instructions. Even when operand transfer is expensive, our technique is sufficiently sophisticated to achieve significant performance gains. We evaluate selective vectorization on a set of SPEC FP benchmarks. On a realistic VLIW processor model, the approach achieves whole-program speedups of up to 1.35x over existing approaches. For individual loops, it provides speedups of up to 1.75x

DSpace@MIT

Instruction Re-Selection for Iterative Modulo Scheduling on High Performance Multi-Issue DSPs

Author: Ayyagari Ravi
Cho Doosan
Paek Yunheung
Uh Gang-Ryung
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

An iterative modulo scheduling is very important for compilers targeting high performance multi-issue digital signal processors. This is because these processors are often severely limited by idle state functional units and thus the reduced idle units can have a positively significant impact on their performance. However, complex instructions, which are used in most recent DSPs such as mac, usually increase data dependence complexity, and such complex dependencies that exist in signal processing applications often restrict modulo scheduling freedom and therefore, become a limiting factor of the iterative modulo scheduler. In this work, we propose a technique that efficiently reselects instructions of an application loop code considering dependence complexity, which directly resolve the dependence constraint. That is specifically featured for accelerating software pipelining performance by minimizing length of intrinsic cyclic dependencies. To take advantage of this feature, few existing compilers support a loop unrolling based dependence relaxing technique, but only use them for some limited cases. This is mainly because the loop unrolling typically occurs an overhead of huge code size increment, and the iterative modulo scheduling with relaxed dependence techniques for general cases is an NP-hard problem that necessitates complex assignments of registers and functional units. Our technique uses a heuristic to efficiently handle this problem in pre-stage of iterative modulo scheduling without loop unrolling

Boise State University - ScholarWorks

Postgraduate Conference 2014: Technical report 1-14

Author: Jungjit Suwimol
Kapinchev Konstantin
Publication venue
Publication date: 30/05/2014
Field of study

Kent Academic Repository

A Parallel Adaptive P3M code with Hierarchical Particle Reordering

Author: Anderson
Bagla
Balsara
Barnes
Becciani
Blumenthal
Bode
Boris
Brieu
Couchman
Couchman
Dave
Decyk
Dubinski
Dubinski
Eastwood
Efstathiou
Evrard
Ferrell
Frenk
Frigo
Gingold
Greengard
H.M.P. Couchman
Hernquist
Hernquist
Hockney
Kawata
Kravtsov
Li
Lia
MacFarland
Miocchi
Monaghan
Navarro
Pearce
Robert J. Thacker
Serna
Snir
Spergel
Springel
Springel
Steinmetz
Sugimoto
Swarztrauber
Thacker
Thacker
Thacker
Thacker
Theuns
Vetterling
Wadsley
White
Wisdom
Wood
Publication venue: 'Elsevier BV'
Publication date: 01/01/2005
Field of study

We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics Communication

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server

Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage

Author: Mounir Bahtat
Philippe Elleaume
Philippe Le Gall
Said Belkouch
Publication venue: Springer Nature
Publication date: 01/01/2016
Field of study

Springer - Publisher Connector