Search CORE

231 research outputs found

Advances in Parallel-Stage Decoupled Software Pipelining Leveraging Loop Distribution, Stream-Computing and the SSA Form

Author: Cohen Albert
Li Feng
Pop Antoniu
Publication venue: Florent Bouchez and Sebastian Hack and Eelco Visser
Publication date: 02/04/2011
Field of study

8 pages Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors-Compilers, OptimizationInternational audienceDecoupled Software Pipelining (DSWP) is a program partitioning method enabling compilers to extract pipeline parallelism from sequential programs. Parallel Stage DSWP (PS-DSWP) is an extension that also exploits the data parallelism within pipeline filters. This paper presents the preliminary design of a new PS-DSWP method capable of handling arbitrary structured control flow, a slightly better algorithmic complexity, the natural exploitation of nested parallelism with communications across arbitrary levels, with a seamless integration with data-flow parallel programming environments. It is inspired by loop-distribution and supports nested/structured partitioning along with the hierarchy of control dependences. The method relies on a data-flow streaming extension of OpenMP. These advances are made possible thanks to progresses in compiler intermediate representation. We describe our usage of the Static Single Assignment (SSA) form, how we extend it to the context of concurrent streaming tasks, and we discuss the benefits and challenges for PS-DSWP

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

A domain-specific high-level programming model

Author: Balarin
Bhattacharya
Blumofe
Board
Boulos
Cameron
Chandra
Grotker
Hiram
Houzet
Kirk
Lee
Munshi
Pacheco
Parhi
Polukhin
Reinders
Sanders
Skillicorn
Valiant
Publication venue: 'Wiley'
Publication date: 22/09/2015
Field of study

International audienceNowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi-core and many-core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high-levelabstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high-level PPM specific to digital signal processing applications. It is based on data-flow graph models of computation, and a dynamic runtime model of execution (StarPU). We show how the user can easily express this digital signal processing application, and can take advantage of task, data and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU, Xeon Phi

Crossref

Hal - Université Grenoble Alpes

MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators

Author: Chen Xiaohui
Publication venue: Scholarship@Western
Publication date: 24/03/2017
Field of study

Parallel programming is gaining ground in various domains due to the tremendous computational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configuration is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level programming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e.g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time

Scholarship@Western

Paralelização de laços doacross usando anotações de componentes e probabilidade de Loop-Carried

Author: Mattos Luís Felipe Souza de, 1990-
Publication venue: [s.n.]
Publication date: 18/02/2019
Field of study

Orientadores: Guido Costa Souza de Araújo, Márcio Machado PereiraDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A paralelização de laços é usada para se obter melhor desempenho em algoritmos intensivos, entretando, não são todos os laços que podem ser facilmente paralelizados. Os laços chamados de DOACROSS possuem dependências entre iterações, i.e. uma iteração calcula um dado que é usado por outra iteração futura. Este tipo de dependência é chamada de loop-carried e não pode ser paralelizada trivialmente porque a ordem de execução das iterações deve ser respeitada. Algumas técnicas podem ser usadas para paralelizar este tipo de laço, porém o programador deve entender como funciona o algoritmo e deve escolher quais instruções podem ser executadas em paralelo e quais instruções devem ser executadas sequencialmente. Estas componentes sequenciais e paralelas precisam ser separadas manualmente pelo programador e a comunicação entre as componentes deve ser incluída, a fim de respeitar as dependências entre componentes e as dependências entre iterações. Implementar essas técnicas é um trabalho laborioso que requer uma certa experiência do programador para separar as componentes e encontrar as dependências para implementar a comunicação entre as componentes/threads. Esta comunicação pode ser feita através de filas ou buffers, dependendo do algoritmo de paralelização escolhido. Uma das técnicas de paralelização é o algoritmo mais tradicional, chamado de DOACROSS que foi implementado no OpenMP 4.5 através da cláusula depend da diretiva ordered. Este pragma deve ser usado dentro da região de um laço paralelo do OpenMP a fim de separar as componentes que devem ser sequenciais. A comunicação e a sincronização são implementadas automaticamente utilizando a biblioteca de runtime do OpenMP. Este método remove do programador o trabalho de programação, entretando, ainda é necessário delimitar explicitamente as componentes sequenciais. Outro algoritmo de paralelização estudado foi o Batched DoAcross (BDX). Este algoritmo pode ser usado para reduzir o overhead da comunicação entre componentes, entretanto, a implementação deve ser feita manualmente pelo programador e requer que o programador separe as componentes sequenciais e paralelas, crie barreiras de sincronização para as componentes sequenciais, crie buffers para a comunicação entre componentes e crie variáveis compartilhadas para a comunicação entre as threads (dependências entre iterações). Nos experimentos, foi percebido que a escolha do algoritmo de paralelização depende de alguns fatores, i.e. a estrutura do algoritmo, a proporção das dependências entre iterações, o número de iterações do laço e o tamanho do laço. Foi criada então uma nova cláusula para o OpenMP que, quando usada juntamente com a diretiva ordered, consegue separar as componentes sequenciais e paralelas e implementar essas técnicas de forma automática. Esta cláusula, chamada de use, deve receber um parâmetro que especifica qual técnica o programador quer utilizar para paralelizar o laçoAbstract: Loop parallelization can be used to achieve better performance on intensive algorithms, however, not all loops can be easily parallelized. The called 'DOACROSS' loops have dependences between different iterations, i.e. some iteration computes a data which is used in a later iteration. This kind of dependence is called loop-carried dependence and cannot be simply parallelized because iterations execution order must be respected. Some techniques can be used to parallelize this kind of loop, however, the programmer must understand how the algorithm works and choose which instructions can be executed in parallel and which instructions need to be serialized. These serial and parallel components need to be manually separated by programmer and communication between components must be included to respect dependences inside loop body and between threads to respect loop-carried dependences. Implementing these techniques is a laborious work that requires a certain expertise from programmer to separate loop components and find dependences to implement communication between components/threads. This communication can be done by using a queue or a buffer, depending on the algorithm used to parallelize. One of these parallelization techniques is the traditional DOACROSS, which was implemented by using depend clause for the ordered directive in OpenMP 4.5. This OpenMP construct is used within OpenMP loop region to separate serial and parallel components, then, communication and synchronization are automatically implemented by OpenMP Runtime. This method removes most of the programming work from the programmer, however still requires to explicitly delimit serial region. Another studied parallelization technique is the Batched DoAcross (BDX). This algorithm can be used to reduce the communication overhead of synchronization between components, however, the implementation must be done manually by programmer, which requires for the programmer to separate serial and parallel components, create barriers to synchronization in serial components, create buffers for communication between components and create the shared variables for communication between threads (loop-carried dependences). In our experiments, we noticed that some factors must be taken for the choice of parallelization technique, i.e. algorithm structure, loop-carried ratio, number of loop iterations and loop size. We created a new OpenMP clause that, used together with the ordered directive, can separate these components and implement these techniques automatically. This clause, is called use, receive a parameter for specifying which parallelization technique the programmer want to be implementedMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

Repositorio da Producao Cientifica e Intelectual da Unicamp

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

Porting a Lattice Boltzmann Simulation to FPGAs Using OmpSs

Author: Calore Enrico
SCHIFANO Sebastiano Fabio
Publication venue: 'IOS Press'
Publication date: 01/01/2020
Field of study

Reconfigurable computing, exploiting Field Programmable Gate Arrays (FPGA), has become of great interest for both academia and industry research thanks to the possibility to greatly accelerate a variety of applications. The interest has been further boosted by recent developments of FPGA programming frameworks which allows to design applications at a higher-level of abstraction, for example using directive based approaches. In this work we describe our first experiences in porting to FPGAs an HPC application, used to simulate Rayleigh-Taylor instability of fluids with different density and temperature using Lattice Boltzmann Methods. This activity is done in the context of the FET HPC H2020 EuroEXA project which is developing an energyefficient HPC system, at exa-scale level, based on Arm processors and FPGAs. In this work we use the OmpSs directive based programming model, one of the models available within the EuroEXA project. OmpSs is developed by the Barcelona Supercomputing Center (BSC) and allows to target FPGA devices as accelerators, but also commodity CPUs and GPUs, enabling code portability across different architectures. In particular, we describe the initial porting of this application, evaluating the programming efforts required, and assessing the preliminary performances on a Trenz development board hosting a Xilinx Zynq UltraScale+ MPSoC embedding a 16nm FinFET+ programmable logic and a multi-core Arm CPU

Archivio istituzionale della ricerca - Università di Ferrara

Towards dynamic threading support for OpenMP

Author: Stadler Jacques
Publication venue: ETH, Swiss Federal Institute of Technology, Laboratory for Software Technology
Publication date: 01/01/2009
Field of study

Repository for Publications and Research Data