Search CORE

17 research outputs found

Instruction replication for clustered microarchitectures

Author: Aleta Ortega Alexandre
Codina Viñas Josep M.
David Kaeli
González Colás Antonio María
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2003
Field of study

This work presents a new compilation technique that uses instruction replication in order to reduce the number of communications executed on a clustered microarchitecture. For such architectures, the need to communicate values between clusters can result in a significant performance loss. Inter-cluster communications can be reduced by selectively replicating an appropriate set of instructions. However, instruction replication must be done carefully since it may also degrade performance due to the increased contention it can place on processor resources. The proposed scheme is built on top of a previously proposed state-of-the-art modulo scheduling algorithm that effectively reduces communications. Results show that the number of communications can decrease using replication, which results in significant speed-ups. IPC is increased by 25% on average for a 4-cluster microarchitecture and by as mush as 70% for selected programs.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Integrated modulo scheduling and cluster assignment for TI TMS320C64x+ architecture

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Crossref

Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage

Author: Mounir Bahtat
Philippe Elleaume
Philippe Le Gall
Said Belkouch
Publication venue: Springer Nature
Publication date: 01/01/2016
Field of study

Springer - Publisher Connector

Virtual cluster scheduling through the scheduling graph

Author: Codina Viñas Josep M.
González Colás Antonio María
Sánchez Navarro Jesús
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

This paper presents an instruction scheduling and cluster assignment approach for clustered processors. The proposed technique makes use of a novel representation named the scheduling graph which describes all possible schedules. A powerful deduction process is applied to this graph, reducing at each step the set of possible schedules. In contrast to traditional list scheduling techniques, the proposed scheme tries to establish relations among instructions rather than assigning each instruction to a particular cycle. The main advantage is that wrong or poor schedules can be anticipated and discarded earlier. In addition, cluster assignment of instructions is performed using another novel concept called virtual clusters, which define sets of instructions that must execute in the same cluster. These clusters are managed during the deduction process to identify incompatibilities among instructions. The mapping of virtual to physical clusters is postponed until the scheduling of the instructions has finalized. The advantages this novel approach features include: (1) accurate scheduling information when assigning, and, (2) accurate information of the cluster assignment constraints imposed by scheduling decisions. We have implemented and evaluated the proposed scheme with superblocks extracted from Speclnt95 and MediaBench. The results show that this approach produces better schedules than the previous state-of-the-art. Speed-ups are up to 15%, with average speed-ups ranging from 2.5% (2-Clusters) to 9.5% (4-Clusters).Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

Ordonnancement pour processeurs à parallélisme d'instructions en utilisant des techniques de recherche de motifs

Author: Yviquel Hervé
Publication venue: HAL CCSD
Publication date: 23/06/2010
Field of study

Dans le but de satisfaire les différentes contraintes matérielles, une exploration architecturale peut permettre de définir les paramètres optimaux d'un processeur VLIW (Very Long Instruction Word) pour une application donnée tels que le nombre d'unités fonctionnelles, le nombre de registres, etc. Les paramètres du processeur sont ajustés en fonction du niveau de parallélisme d'instructions de l'application. De la même manière, l'utilisation de jeux d'instructions spécifiques à une application est adaptée à une utilisation au sein des systèmes embarqués qui sont, dans la majeure partie des cas, dédiés à un traitement spécifique. Toutes ces spécialisations permettent d'améliorer efficacement le rapport entre performance, surface et consommation. Ce rapport présente un nouvel outil dont le but est de définir les paramètres optimaux d'un processeur de type VLIW pour une application donnée en termes de dimensionnement, d'organisation et de spécialisation de son jeu d'instructions. Cet outil repose sur la modélisation des problèmes à résoudre en utilisant la programmation par contraintes et exploite la technique de couverture de graphe à l'aide de motifs de calculs. Différentes structures de processeurs pourront alors être comparées en termes de performance et de complexité matérielle

HAL Descartes

Hal-Diderot

HAL-Rennes 1

AGAMOS: A graph-based approach to modulo scheduling for clustered microarchitectures

Author: Aleta Ortega Alexandre
Codina Viñas Josep M.
González Colás Antonio María
Kaeli D
Sánchez Navarro F. Jesús
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

This paper presents AGAMOS, a technique to modulo schedule loops on clustered microarchitectures. The proposed scheme uses a multilevel graph partitioning strategy to distribute the workload among clusters and reduces the number of intercluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of intercluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme outperforms a state-of-the-art scheduler for all programs and different cluster configurations. For some configurations, the speedup obtained when using this new scheme is greater than 40 percent, and for selected programs, performance can be more than doubled.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Loop transformations for clustered VLIW architectures

Author: Qian Yi
Publication venue: Digital Commons @ Michigan Tech
Publication date: 01/01/2002
Field of study

With increasing demands for performance by embedded systems, especially by digital signal processing (DSP) applications, embedded processors must increase available instructionlevel parallelism (ILP) within significant constraints on power consumption and chip cost. Unfortunately, supporting a large amount of ILP on a processor while maintaining a single register file increases chip cost and potentially decreases overall performance due to increased cycle time. To address this problem, some modern embedded processors partition the register file into multiple low-ported register files, each directly connected with one or more functional units. These functional unit/register file groups are called clusters. Clustered VLIW (very long instruction word) architectures need extra copy operations or delays to transfer values among clusters. To take advantage of clustered architectures, the compiler must expose parallelism for maximal functional-unit utilization, and schedule instructions to reduce intercluster communication overhead. High-level loop transformations offer an excellent opportunity to enhance the abilities of low-level optimizers to generate code for clustered architectures. This dissertation investigates the effects of three loop transformations, i.e., loop fusion, loop unrolling, and unroll-and-jam, on clustered VLIW architectures. The objective is to achieve high performance with low communication overhead. This dissertation discusses the following techniques: Loop Fusion This research examines the impact of loop fusion on clustered architectures. A metric based upon communication costs for guiding loop fusion is developed and tested on DSP benchmarks. Unroll-and-jam and Loop Unrolling A new method that integrates a communication cost model with an integer-optimization problem is developed to determine unroll amounts for loop unrolling and unroll-and-jam automatically for a specific loop on a specific architecture. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures

Michigan Technological University

トクテイ　オウヨウ　ブンヤ　ムケ　メイレイ　セット　ヲ　モツ　クミコミ　プロセッサ　ノ　タメノ　コード　カイカツ　シュホウ

Author: Tanaka Hiroaki
タナカヒロアキ
田中浩明
Publication venue
Publication date
Field of study

Osaka University Knowledge Archive

A VLIW DSP data path with multiple controllers

Author: Vrijnsen J.H.G.M.
Publication venue
Publication date: 01/01/2003
Field of study

Repository TU/e

Pure OAI Repository

Compilation Techniques for High-Performance Embedded Systems with Multiple Processors

Author: Franke Bjorn
Publication venue: University of Edinburgh. College of Science and Engineering. School of Informatics.
Publication date: 01/01/2004
Field of study

Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems, programming of embedded high-performance computing systems based on Digital Signal Processors (DSPs) is still a highly skilled manual task. This is true for single-processor systems, and even more for embedded systems based on multiple DSPs. Compilers often fail to optimise existing DSP codes written in C due to the employed programming style. Parallelisation is hampered by the complex multiple address space memory architecture, which can be found in most commercial multi-DSP configurations. This thesis develops an integrated optimisation and parallelisation strategy that can deal with low-level C codes and produces optimised parallel code for a homogeneous multi-DSP architecture with distributed physical memory and multiple logical address spaces. In a first step, low-level programming idioms are identified and recovered. This enables the application of high-level code and data transformations well-known in the field of scientific computing. Iterative feedback-driven search for “good” transformation sequences is being investigated. A novel approach to parallelisation based on a unified data and loop transformation framework is presented and evaluated. Performance optimisation is achieved through exploitation of data locality on the one hand, and utilisation of DSP-specific architectural features such as Direct Memory Access (DMA) transfers on the other hand. The proposed methodology is evaluated against two benchmark suites (DSPstone & UTDSP) and four different high-performance DSPs, one of which is part of a commercial four processor multi-DSP board also used for evaluation. Experiments confirm the effectiveness of the program recovery techniques as enablers of high-level transformations and automatic parallelisation. Source-to-source transformations of DSP codes yield an average speedup of 2.21 across four different DSP architectures. The parallelisation scheme is – in conjunction with a set of locality optimisations – able to produce linear and even super-linear speedups on a number of relevant DSP kernels and applications

CiteSeerX

Edinburgh Research Archive