136 research outputs found
Modular termination of C programs
In this paper we describe a general method to prove termination of C programs in a scalable and modular way. The program to analyse is reduced to the smallest relevant subset through a termination-specific slicing technique. Then, the program is divided into pieces of code that are analysed separately, thanks to an external engine for termination. The result is implemented in the prototype \stoptool over our previous toolsuite WTC [compsys-termination-sas10] and preliminary results shows the feasibility of the method
High-Level Synthesis of Pipelined FSM from Loop Nests
Embedded systems raise many challenges in power, space and speed efficiency. The current trend is to build heterogeneous systems on a chip with specialized processors and hardware accelerators. Generating an hardware accelerator from a computational kernel requires a deep reorganization of the code and the data. Typically, parallelism and memory bandwidth are met thanks to fine-grain loop transformations. Unfortunately, the resulting control automaton is often very complex and eventually bound the circuit frequency, which limits the benefits of the optimization. This is a major lock, which strongly limits the power of the code optimizations applicable by high-level synthesis tools.In this report, we propose an architecture of control automaton and an algorithm of high-level synthesis which translates efficiently the control required by fine-grain loop optimizations. Unlike the previous approaches, our control automaton can be pipelined at will, without any restriction. Hence, the frequency of the automaton can be as high as possible. Experimental results on FPGA confirms that our control circuit can reach a high frequency with a reasonable resource consumption
Program Analysis and Source-Level Communication Optimizations for High-Level Synthesis
The use of hardware accelerators, e.g., with GPGPUs or customized circuits using FPGAs, are particularly interesting for accelerating data- and compute-intensive applications. However, to get high performance, it is mandatory to restructure the application code, to generate adequate communication mechanisms, and to compile the different communicating processes so that the resulting application is highly-optimized, with full usage of the memory bandwidth. In the context of the high-level synthesis (HLS) of hardware accelerators, we show how to automatically generate such an optimized organization for an accelerator communicating to an external DDR memory. Our technique relies on loop tiling, the generation of pipelined processes (overlapping communications & computations), and the automatic design (synchronizations and sizes) of local buffers. Our first contribution is a program analysis that specifies the data to be read from and written to the external memory so as to reduce communications and reuse data as much as possible in the accelerator. This specification, which can be used in different contexts, handles the cases where data can be redefined in the accelerator and/or approximations are needed because of non-analyzable data accesses. Our second contribution is an optimized code generation scheme, entirely at source-level, that allows us to compile all the necessary glue (the communication processes) with the same HLS tool as for the computation kernel. Both contributions use advanced polyhedral techniques for program analysis and transformation. Experiments with Altera HLS tools show the correctness and efficiency of our technique.Les accélérateurs matériels, comme par exemple via l'utilisation de GPGPUs ou de circuits dédiés sur FPGAs, sont particulièrement intéressants pour accélérer les applications gourmandes en calculs et en accès aux données. En revanche, pour obtenir de bonnes performances, il est indispensable de restructurer le code de l'application, de générer des mécanismes de communication adéquats, et de compiler les différents processus communicants de sorte que l'application résultante soit hautement optimisée, avec un bon usage de la bande passante vers la mémoire. Dans le contexte de la synthèse de haut niveau (HLS) d'accélérateurs matériels, nous montrons comment générer automatiquement une telle organisation optimisée pour un accélérateur communiquant avec une mémoire externe DDR. Notre technique repose sur le tiling (calcul par bloc), la génération de processus pipelinés (en recouvrant calculs et communications), et la conception automatique (synchronisations et tailles) de buffers locaux. Notre première contribution est une analyse de programme qui spécifie les données à lire depuis la mémoire externe et à écrire dans cette mémoire de façon à réduire les communications et à réutiliser les données, autant que faire se peut, dans l'accélérateur. Cette spécification, qui peut être utilisée dans d'autres contextes, prend en compte les cas où les données peuvent être redéfinies dans l'accélérateur et/ou des approximations sont nécessaires du fait d'accès non analysables. Notre seconde contribution est un schéma de génération de code optimisé, entièrement au niveau source, qui nous permet de compiler tous les mécanismes d'optimisation (les processus communicants) avec le même outil de HLS que le noyau de calcul lui-même. Ces deux contributions utilisent des techniques polyédriques avancées d'analyse et de transformation de programme. Les expérimentations menées avec l'outil de synthèse d'Altera C2H montre la correction et l'efficacité de notre technique
fkcc: the Farkas Calculator
In this paper, we presentfkcc, a scripting tool to proto-type program analyses and transformations exploiting the affine form ofFarkas lemma. Our language is general enough to prototype in a fewlines sophisticated termination and scheduling algorithms. The tool isfreely available and may be tried online via a web interface. We believethatfkccis the missing chain to accelerate the development of programanalyses and transformations exploiting the affine form of Farkas lemma
Contributions à l'optimisation de programmes et à la synthèse de circuits haut-niveau
Since the end of Dennard scaling, power efficiency is the limiting factor for large-scale computing. Hardware accelerators such as reconfigurable circuits (FPGA, CGRA) or Graphics Processing Units (GPUs) were introduced to improve the performance under a limited energy budget, resulting into complex heterogeneous platforms. This document presents a synthetic description of my research activities over the last decade on compilers for high-performance computing and high-level synthesis of circuits (HLS) for FPGA accelerators. Specifically, my contributions covers both theoretical and practical aspects of automatic parallelization and HLS in a general theoretical framework called the polyhedral model.A first chapter describes our contributions to loop tiling, a key program transformation for automatic parallelization which splits the computation atomic blocks called tiles.We rephrase loop tiling in the polyhedral model to enable any polyhedral tile shape whose size depends on a single parameter (monoparametric tiling), and we present a tiling transformation for programs with reductions – accumulations w.r.t. an associative/commutative operator. Our results open the way for semantic program transformations ; program transformations which does not preserve the computation but still lead to an equivalent program.A second chapter describes our contributions to algorithm recognition. A compiler optimization will never replace a good algorithm, hence the idea to recognize algorithm instances in a program and to substitute them by a call to a performance library. In our PhD thesis, we have addressed the recognition of templates – functionswith first-order variables – into programs and its application to program optimization. We propose a complementary algorithm recognition framework which leverages our monoparametric tiling and our reduction tiling transformations. This automates semantic tiling, a new semantic program transformation which increases the grain of operators (scalar → matrix).A third chapter presents our contributions to the synthesis of communications with an off-chip memory in the context of high-level circuit synthesis (HLS). We propose an execution model based on loop tiling, a pipelined architecture and a source-level compilation algorithm which, connected to the C2H HLS tool from Altera, ends up to a FPGA configuration with minimized data transfers. Our compilation algorithm is optimal – the data are loaded as late as possible and stored as soon as possible with a maximal reuse.A fourth chapter presents our contributions to design a unified polyhedral compilation model for high-level circuit synthesis.We present the Data-aware Process Networks (DPN), a dataflow intermediate representation which leverages the ideas developed in chapter 3 to explicit the data transfers with an off-chip memory. We propose an algorithm to compile a DPN from a sequential program, and we present our contribution to the synthesis of DPN to a circuit. In particular, we present our algorithms to compile the control, the channels and the synchronizations of a DPN. These results are used in the production compiler of the Xtremlogic start-up.Depuis la fin du Dennard scaling, l’efficacité énergétique est le facteur limitant pour le calcul haute performance. Les accélérateurs matériels comme les circuits reconfigurables (FPGA, CGRA) ou les accélérateurs graphiques (GPUs) ont été introduits pour améliorer les performances sous un budget énergétique limité, menant à des plateformes hétérogènes complexes.Mes travaux de recherche portent sur les compilateurs et la synthèse de circuits haut-niveau (High-Level Synthesis, HLS) pour le calcul haute-performance. Specifiquement, mes contributions couvrent les aspects théoriques etpratiques de la parallélisation automatique et la HLS dans le cadre général du modèle polyédrique.Un premier chapitre décrit mes contributions au tuilage de boucles, une transformation fondamentale pour la parallélisation automatique, qui découpe le calcul en sous-calculs atomiques appelés tuiles. Nous reformulons le tuilage de boucles dans le modèle polyédrique pour permettre n’importe tuile polytopique dont la taille dépend d’un facteur homothétique (tuilage monoparamétrique), et nous décrivons une transformation de tuilage pour des programmes avec des réductions – une accumulation selon un opérateur associative et commutatif. Nos résultats ouvrent la voie à des transformations de programme sémantiques ; qui ne préservent pas le calcul, mais produisent un programme équivalent.Un second chapitre décrit mes contributions à la reconnaissance d’algorithmes. Une optimisation de compilateur ne remplacera jamais un bon algorithme, d’où l’idée de reconnaître les instances d’un algorithme dans un programme et de les substituer par un appel vers une bibliothèque hauteperformance, chaque fois que c’est possible et utile.Dans notre thèse, nous avons traité la reconnaissance de templates – des fonctions avec des variables d’ordre 1 – dans un programme et son application à l’optimisation de programes. Nous proposons une approche complémentaire qui s’appuie sur notre tuilage monoparamétrique complété par une transformation pour tuiler les réductions. Ceci automatise le tuilage sémantique, une nouvelle transformation sémantique qui augmente le grain des opérateurs (scalaire → matrice).Un troisième chapitre présente mes contributions à la synthèse des communications avec une mémoire off-chip dans le contexte de la synthèse de circuits haut-niveau. Nous proposons un modèle d’exécution basé sur le tuilage de boucles, une architecture pipelinée et un algorithme de compilation source-à -source qui, connecté à l’outil de HLS C2H d’Altera, produit une configuration de circuit FPGA qui réalise un volume minimal de transferts de données. Notre algorithme est optimal – les données sont chargées le plus tard possible et stockées le plus tôt possible, avec une réutilisation maximale et sans redondances.Enfin, un quatrième chapitre présente mes contributions pour construire un modèle de compilation polyédrique unifié pour la synthèse de circuits haut-niveau.Nous présentons les réseaux de processus DPN (Data-aware Process Networks), une représentation intermédiaire dataflow qui s’appuie sur les idées développées au chapitre 3 pour expliciter les transferts de données entre le circuit et la mémoire off-chip. Nous proposons une suite d’algorithmes pour compiler un DPN à partir d’un programme séquentiel et nous présentons nos contributions à la synthèse d’un DPN en circuit. En particulier, nous présentons nos algorithmes pour compiler le contrôle, les canaux et les synchronisations d’un DPN. Ces résultats sont utilisés dans le compilateur de production de la start-up XtremLogic
Estimation of Parallel Complexity with Rewriting Techniques
International audienceWe show how monotone interpretations – a termination analysis technique for term rewritingsystems – can be used to assess the inherent parallelism of recursive programs manipulatinginductive data structures. As a side effect, we show how monotone interpretations specify aparallel execution order, and how our approach extends naturally affine scheduling – a powerfulanalysis used in parallelising compilers – to recursive programs. This work opens new perspectivesin automatic parallelisation
On Channel Restructuring for Complete FIFO Recovery
International audienceDataflow models of computation are a natural intermediate representation for high-level synthesis. Many criteria must be fulfill to end up with an efficient circuit implementation, the first one being channel implementation. After scheduling the processes, it is very likely that producer/consumer communication patterns can no longer be implemented as a FIFO, causing major inefficiency in the final circuit as non-FIFO channels required additionnal synchronization circuitry and may slow-down dramatically the whole implementation.In this poster, we focus on a popular scheduling technique, the loop-tiling, widely used in automatic parallelization; and we study an algorithm to restructure the channels so the FIFOs broken by the loop tiling are restored. We exhibit a class of process networks -- the data-aware process networks -- for which the recovery is complete: after a loop tiling, all the FIFOs can always be recovered. Experimental results confirm the completeness of the recovery into the DPN class -- and measure the non-completeness outside of the DPN class
Multi-dimensional Rankings, Program Termination, and Complexity Bounds of Flowchart Programs
International audienceProving the termination of a flowchart program can be done by exhibiting a ranking function, i.e., a function from the program states to a well-founded set, which strictly decreases at each program step. A standard method to automatically generate such a function is to compute invariants for each program point and to search for a ranking in a restricted class of functions that can be handled with linear programming techniques. Previous algorithms based on affine rankings either are applicable only to simple loops (i.e., single-node flowcharts) and rely on enumeration, or are not complete in the sense that they are not guaranteed to find a ranking in the class of functions they consider, if one exists. Our first contribution is to propose an efficient algorithm to compute ranking functions: It can handle flowcharts of arbitrary structure, the class of candidate rankings it explores is larger, and our method, although greedy, is provably complete. Our second contribution is to show how to use the ranking functions we generate to get upper bounds for the computational complexity (number of transitions) of the source program. This estimate is a polynomial, which means that we can handle programs with more than linear complexity. We applied the method on a collection of test cases from the literature. We also show the links and differences with previous techniques based on the insertion of counters
Bounding the Computational Complexity of Flowchart Programs with Multi-dimensional Rankings
Proving the termination of a flowchart program can be done by exhibiting a ranking function, i.e., a function from the program states to a well-founded set, which strictly decreases at each program step. A standard method to automatically generate such a function is to compute invariants for each program point and to search for a ranking in a restricted class of functions that can be handled with linear programming techniques. Our first contribution is to propose an efficient algorithm to compute ranking functions: It can handle flowcharts of arbitrary structure, the class of candidate rankings it explores is larger, and our method, although greedy, is provably complete. Our second contribution is to show how to use the ranking functions we generate to get upper bounds for the computational complexity (number of transitions) of the source program, again for flowcharts of arbitrary structure. This estimate is a polynomial, which means that we can handle programs with more than linear complexity. We applied the method on a collection of test cases from the literature. We also point out important extensions, mainly to do with the scalability of the algorithm and, in particular, the integration of techniques based on cutpoints
- …