87 research outputs found

    N–Dimensional Orthogonal Tile Sizing Problem

    Get PDF
    AMS subject classification: 68Q22, 90C90We discuss in this paper the problem of generating highly efficient code when a n + 1-dimensional nested loop program is executed on a n-dimensional torus/grid of distributed-memory general-purpose machines. We focus on a class of uniform recurrences with non-negative components of the dependency matrix. Using tiling the iteration space strategy we show that minimizing the total running time reduces to solving a non-trivial non-linear integer optimization problem. For the later we present a mathematical framework that enables us to derive an O(n log n) algorithm for finding a good approximate solution. The theoretical evaluations and the experimental results show that the obtained solution approximates the original minimum sufficiently well in the context of the considered problem. Such algorithm is realtime usable for very large values of n and can be used as optimization techniques in parallelizing compilers as well as in performance tuning of parallel codes by hand

    Extracting Data-Level Parallelism in High-Level Synthesis for Reconfigurable Architectures

    Get PDF
    High-Level Synthesis (HLS) tools are a set of algorithms that allow programmers to obtain implementable Hardware Description Language (HDL) code from specifications written high-level, sequential languages such as C, C++, or Java. HLS has allowed programmers to code in their preferred language while still obtaining all the benefits hardware acceleration has to offer without them needing to be intimately familiar with the hardware platform of the accelerator. In this work we summarize and expand upon several of our approaches to improve the automatic memory banking capabilities of HLS tools targeting reconfigurable architectures, namely Field-Programmable Gate Arrays or FPGA\u27s. We explored several approaches to automatically find the optimal partition factor and a usable banking scheme for stencil kernels including a tessellation based approach using multiple families of hyperplanes to do the partitioning which was able to find a better banking factor than current state-of-the-art methods and a graph theory methodology that allowed us to mathematically prove the optimality of our banking solutions. For non-stencil kernels we relaxed some of the conditions in our graph-based model to propose a best-effort solution to arbitrarily reduce memory access conflicts (simultaneous accesses to the same memory bank). We also proposed a non-linear transformation using prime factorization to convert a small subset of non-stencil kernels into stencil memory accesses, allowing us to use all previous work in memory partition to them. Our approaches were able to obtain better results than commercial tools and state-of-the-art algorithms in terms of reduced resource utilization and increased frequency of operation. We were also able to obtain better partition factors for some stencil kernels and usable baking schemes for non-stencil kernels with better performance than any applicable existing algorithm

    Le tuilage mono-paramétrique est une transformation polyédrique

    Get PDF
    Tiling is a crucial program transformation with many benefits: it improves locality, exposes parallelism, allows for adjusting the ops-to-bytes balance of codes, and can be applied at multiple levels. Allowing tile sizes to be symbolic parameters at compile time has many benefits, including efficient autotuning, and run-time adaptability to system variations. For polyhedral programs, parametric tiling in its full generality is known to be non-linear, breaking the mathematical closure properties of the polyhedral model. Most compilation tools therefore either avoid it by only performing fixed size tiling, or apply it in only the final, code generation step. Both strategies have limitations. We first introduce mono-parametric partitioning, a restricted parametric, tiling-like transformation which can be used to express a tiling. We show that, despite being parametric, it is a polyhedral transformation. We first prove that applying mono-parametric partitioning (i) to a polyhedron yields a union of polyhedra, and (ii) to an affine function produces a piecewise-affine function. We then use these properties to show how to partition an entire polyhedral program, including one with reductions. Next, we generalize this transformation to tiles with arbitrary tile shapes that can tesselate the iteration space (e.g., hexagonal, trapezoidal, etc). We show how mono-parametric tiling can be applied at multiple levels, and enables a wide range of polyhedral analysis and transformations to be applied

    Contributions à l'optimisation de programmes et à la synthèse de circuits haut-niveau

    Get PDF
    Since the end of Dennard scaling, power efficiency is the limiting factor for large-scale computing. Hardware accelerators such as reconfigurable circuits (FPGA, CGRA) or Graphics Processing Units (GPUs) were introduced to improve the performance under a limited energy budget, resulting into complex heterogeneous platforms. This document presents a synthetic description of my research activities over the last decade on compilers for high-performance computing and high-level synthesis of circuits (HLS) for FPGA accelerators. Specifically, my contributions covers both theoretical and practical aspects of automatic parallelization and HLS in a general theoretical framework called the polyhedral model.A first chapter describes our contributions to loop tiling, a key program transformation for automatic parallelization which splits the computation atomic blocks called tiles.We rephrase loop tiling in the polyhedral model to enable any polyhedral tile shape whose size depends on a single parameter (monoparametric tiling), and we present a tiling transformation for programs with reductions – accumulations w.r.t. an associative/commutative operator. Our results open the way for semantic program transformations ; program transformations which does not preserve the computation but still lead to an equivalent program.A second chapter describes our contributions to algorithm recognition. A compiler optimization will never replace a good algorithm, hence the idea to recognize algorithm instances in a program and to substitute them by a call to a performance library. In our PhD thesis, we have addressed the recognition of templates – functionswith first-order variables – into programs and its application to program optimization. We propose a complementary algorithm recognition framework which leverages our monoparametric tiling and our reduction tiling transformations. This automates semantic tiling, a new semantic program transformation which increases the grain of operators (scalar → matrix).A third chapter presents our contributions to the synthesis of communications with an off-chip memory in the context of high-level circuit synthesis (HLS). We propose an execution model based on loop tiling, a pipelined architecture and a source-level compilation algorithm which, connected to the C2H HLS tool from Altera, ends up to a FPGA configuration with minimized data transfers. Our compilation algorithm is optimal – the data are loaded as late as possible and stored as soon as possible with a maximal reuse.A fourth chapter presents our contributions to design a unified polyhedral compilation model for high-level circuit synthesis.We present the Data-aware Process Networks (DPN), a dataflow intermediate representation which leverages the ideas developed in chapter 3 to explicit the data transfers with an off-chip memory. We propose an algorithm to compile a DPN from a sequential program, and we present our contribution to the synthesis of DPN to a circuit. In particular, we present our algorithms to compile the control, the channels and the synchronizations of a DPN. These results are used in the production compiler of the Xtremlogic start-up.Depuis la fin du Dennard scaling, l’efficacité énergétique est le facteur limitant pour le calcul haute performance. Les accélérateurs matériels comme les circuits reconfigurables (FPGA, CGRA) ou les accélérateurs graphiques (GPUs) ont été introduits pour améliorer les performances sous un budget énergétique limité, menant à des plateformes hétérogènes complexes.Mes travaux de recherche portent sur les compilateurs et la synthèse de circuits haut-niveau (High-Level Synthesis, HLS) pour le calcul haute-performance. Specifiquement, mes contributions couvrent les aspects théoriques etpratiques de la parallélisation automatique et la HLS dans le cadre général du modèle polyédrique.Un premier chapitre décrit mes contributions au tuilage de boucles, une transformation fondamentale pour la parallélisation automatique, qui découpe le calcul en sous-calculs atomiques appelés tuiles. Nous reformulons le tuilage de boucles dans le modèle polyédrique pour permettre n’importe tuile polytopique dont la taille dépend d’un facteur homothétique (tuilage monoparamétrique), et nous décrivons une transformation de tuilage pour des programmes avec des réductions – une accumulation selon un opérateur associative et commutatif. Nos résultats ouvrent la voie à des transformations de programme sémantiques ; qui ne préservent pas le calcul, mais produisent un programme équivalent.Un second chapitre décrit mes contributions à la reconnaissance d’algorithmes. Une optimisation de compilateur ne remplacera jamais un bon algorithme, d’où l’idée de reconnaître les instances d’un algorithme dans un programme et de les substituer par un appel vers une bibliothèque hauteperformance, chaque fois que c’est possible et utile.Dans notre thèse, nous avons traité la reconnaissance de templates – des fonctions avec des variables d’ordre 1 – dans un programme et son application à l’optimisation de programes. Nous proposons une approche complémentaire qui s’appuie sur notre tuilage monoparamétrique complété par une transformation pour tuiler les réductions. Ceci automatise le tuilage sémantique, une nouvelle transformation sémantique qui augmente le grain des opérateurs (scalaire → matrice).Un troisième chapitre présente mes contributions à la synthèse des communications avec une mémoire off-chip dans le contexte de la synthèse de circuits haut-niveau. Nous proposons un modèle d’exécution basé sur le tuilage de boucles, une architecture pipelinée et un algorithme de compilation source-à-source qui, connecté à l’outil de HLS C2H d’Altera, produit une configuration de circuit FPGA qui réalise un volume minimal de transferts de données. Notre algorithme est optimal – les données sont chargées le plus tard possible et stockées le plus tôt possible, avec une réutilisation maximale et sans redondances.Enfin, un quatrième chapitre présente mes contributions pour construire un modèle de compilation polyédrique unifié pour la synthèse de circuits haut-niveau.Nous présentons les réseaux de processus DPN (Data-aware Process Networks), une représentation intermédiaire dataflow qui s’appuie sur les idées développées au chapitre 3 pour expliciter les transferts de données entre le circuit et la mémoire off-chip. Nous proposons une suite d’algorithmes pour compiler un DPN à partir d’un programme séquentiel et nous présentons nos contributions à la synthèse d’un DPN en circuit. En particulier, nous présentons nos algorithmes pour compiler le contrôle, les canaux et les synchronisations d’un DPN. Ces résultats sont utilisés dans le compilateur de production de la start-up XtremLogic

    Motion-Planning and Control of Autonomous Vehicles to Satisfy Linear Temporal Logic Specifications

    Get PDF
    Motion-planning is an essential component of autonomous aerial and terrestrial vehicles. The canonical Motion-planning problem, which is widely studied in the literature, is of planning point-to-point motion while avoiding obstacles. However, the desired degree of vehicular autonomy has steadily risen, and has consequently led to motion-planning problems where a vehicle is required to accomplish a high-level intelligent task, rather than simply move between two points. One way of specifying such intelligent tasks is via linear temporal logic (LTL) formulae. LTL is a formal logic system that includes temporal operators such as always, eventually, and until besides the usual logical operators. For autonomous vehicles, LTL formulae can concisely express tasks such as persistent surveillance, safety requirements, and temporal orders of visits to multiple locations. Recent control theoretic literature has discussed the generation of reference trajectories and/or the synthesis of feedback control laws to enable a vehicle to move in manners that satisfy LTL specifications. A crucial step in such synthesis is the generation of a so-called discrete abstraction of a vehicle kinematic/dynamic model. Typical techniques of generating a discrete abstraction require strong assumptions on controllability and/or linearity. This dissertation discusses fast motion-planning and control techniques to satisfy LTL specifications for vehicle models with nonholonomic kinematic constraints, which do not satisfy the aforesaid assumptions. The main contributions of this dissertation are as follows. First, we present a new technique for constructing discrete abstractions of a Dubins vehicle model (namely, a vehicle that moves forward at a constant speed with a minimum turning radius). This technique relies on the so-called method of lifted graphs and precomputed reachable set calculations. Using this technique, we provide an algorithm to generate vehicle reference trajectories satisfying LTL specifications without requiring complete controllability in the presence of workspace constraints, and without requiring linearity or linearization of the vehicle model. Second, we present a technique for centralized motion-planning for a team of vehicles to collaboratively satisfy a common LTL specification. This technique is also based on the method of lifted graphs. Third, we present an incremental version of the proposed motion-planning techniques, which has an “anytime property. This property means that a feasible solution is computed quickly, and the iterative updates are made to this solution with a guarantee of convergence to an optimal solution. This version is suited for real-time implementation, where a hard bound on the computation time is imposed. Finally, we present a randomized sampling-based technique for generating reference trajectories that satisfy given LTL specifications. This technique is an alternative to the aforesaid technique based on lifted graphs. We illustrate the proposed techniques using numerical simulation examples. We demonstrate the superiority of the proposed techniques in comparison to the existing literature in terms of computational time and memory requirements

    Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

    Full text link
    This thesis attempts to design and implement a compiler framework based on the polyhedral model. The compiler automatically parallelizes loop nests; especially stencil kernels, into efficient GPU code by loop tiling transformations which the polyhedral model describes. To enhance parallel performance, we introduce three practically efficient techniques to process different types of loop nests. The experimental results of our compiler framework have demonstrated that these advanced techniques can outperform previous approaches. Firstly, we aim to find efficient tiling transformations without violating data dependences. How to select a tile's shape and size is an open issue that is performance-critical and influenced by GPU's hardware constraints. We propose an approach to determine the tile shapes out of consideration for improving two-level parallelism of GPUs. The new approach finds appropriate tiling hyperplanes by embedding parallelism-enhancing constraints into the polyhedral model to maximize intra-tile, i.e., intra-SM parallelism. This improves the load balance among the streaming processors (SPs), which execute a wavefront of loop iterations within a tile. We eliminate parallelism-hindering false dependences to optimize inter-tile, i.e., inter-SM parallelism. This improves the load balance among the streaming multiprocessors (SMs), which execute a wavefront of tiles. Furthermore, to avoid combinatorial explosion of tile size's configurations, we present a model-driven approach to automating tile size selection that is performance-critical for loop tiling transformations, especially for DOACROSS loop nests. Our tile size selection model accurately estimates the execution times of tiled loop nests running on GPUs. The selected tile sizes lead to the performance results that are close to the best observed for a range of problem sizes tested. Finally, to address the difficulty and low-performance of parallelizing widely used SOR stencil loop nests, we present a new tiled parallel SOR method, called MLSOR, which admits more efficient data-parallel SIMD execution on GPUs. Unlike the previous two approaches that are dependence-preserving, the basic idea is to algorithmically restructure a stencil kernel based on a non-dependence-preserving parallelization scheme to avoid pipelining for higher parallelism. The new approach can be implemented in compilers through a pattern matching pass to optimize SOR-like DOACROSS loop nests on GPUs

    Proceedings of the 3rd International Workshop on Polyhedral Compilation Techniques

    Get PDF
    IMPACT 2013 in Berlin, Germany (in conjuction with HiPEAC 2013) is the third workshop in a series of international workshops on polyhedral compilation techniques. The previous workshops were held in Chamonix, France (2011) in conjuction with CGO 2011 and Paris, France (2012) in conjuction with HiPEAC 2012

    Multilevel tiling for non-rectangular interation spaces

    Get PDF
    La motivación principal de esta tesis es el desarrollo de nuevas técnicas de compilación dirigidas a conseguir mayor rendimiento encódigos numéricos complejos que definen es pacios de iteraciones no rectangulares. En particular, nos centramos en la trasformación de "loop tiling" (también conocida como "blocking") y nuestro propósito es mejorar la transformación de loop tiling cuando se aplica a códigos numéricos complejos. Nuestro objetivo es conseguir, a través de la transformación de loop tiling, los mismos o mejores rendimientos que las librerías numéricas proporcionadas por el fabricante que están optimizadas manualmente.En la tesis se muestra que la razón principal por la que los compiladores comerciales actuales consiguen bajos rendimiento en este tipo de aplicaciones es que no son capaces de aplicar loop tiling a nivel de registros. En su lugar, para mejorar la localidad de los datos y el ILP, los compiladores actuales usan y combinan otras transformaciones que no explotan el nivel de registros tan bien como loop tiling. Previamente no se ha considerado aplicar loop tiling a nivel de registro porque en códigos numéricos complejos no es trivial debido a la naturaleza irregular de los espacios de iteraciones. La primera contribución de esta tesis es un algoritmo general de loop tiling a nivel de registros que es aplicable a cualquier tipo de espacio de iteraciones y no sólo a los espacios rectangulares. Nuestro método incluye una heurística muy sencilla para decidir los parámetros de los cortes a nivel de registros. A primera vista parece que loop tiling a nivel de registros (a partir de ahora, register tiling) se tiene que aplicar de tal manera que el bucle que ofrece más reuso temporal de los datos no debe de ser partido. De esta manera maximizamos la reutilización de los registros y minimizamos el número total de load/stores ejecutados. Sin embargo, mostraremos que en espacios de iteraciones no rectangulares, si solamente tenemos en cuenta las direcciones del reuso y no la forma del espacio de iteraciones, los códigos pueden sufrir una degradación en rendimiento. Nuestra segunda contribución es la propuesta de una heurística muy sencilla que determina los parámetros del tiling a nivel de registros considerando no sólo el reuso temporal sino también la forma del espacio de iteraciones. Además, la heurística es suficientemente sencilla para que pueda ser implementada en un compilador comercial.Sin embargo, para conseguir rendimientos similares que códigos optimizados a mano, no es suficiente con aplicar loop tiling a nivel de registros. Con las arquitecturas de hoy en día que disponen de jerarquías de memoria complejas y múltiples procesadores, es necesario que el compilador aplique loop tiling en cuatro o más niveles (paralelismo, cache L2, cache L1 y registros) para conseguir altos rendimientos. Por lo tanto, en las arquitecturas actuales es crucial tener un algoritmo eficiente para aplicar loop tiling en varios niveles de la jerarquía de memoria (tiling multinivel). Además, como mostramos en esta tesis, la transformación de tiling multinivel siempre tendrá que incluir el nivel de registro porque este es el nivel de la jerarquía de memoria que ofrece mayor rendimiento cuando es tratado correctamente.Cuando tiling multinivel incluye el nivel de registros, es necesario que los límites de los bucles sean exactos y que no haya límites redundantes. La razón es que la complejidad y la cantidad de código que se genera con nuestra técnica de register tiling depende polinómicamente del número de límites de los bucles.Sin embargo, hasta ahora, el problema de calcular límites exactos y eliminar límites redundantes es que todas las técnicas conocidas son muy caras en términos de tiempo de compilación y, por ello, difícil de integrar en un compilador comercial. La tercera contribución de esta tesis es una nueva implementación de tiling multinivel que calcula límites exactos y es mucho menos costosa que técnicas tradicionales. Mostraremos que la complejidad de nuestra implementación es proporcional a la complejidad de aplicar una permutación de bucles en el código original (antes de aplicar loop tiling), mientras que las técnicas tradicionales tienen complejidades más altas. Además, nuestra implementación genera menos límites redundantes y permite eliminar los límites redundantes que quedan a menor coste. En conjunto, la eficiencia de nuestra implementación hace posible que pueda ser implementada dentro de un compilador comercial sin tener que preocuparse por los tiempos de compilación.La última parte de esta tesis está dedicada al estudio del rendimiento de tiling multinivel. Se muestran los efectos de tiling en los diferentes niveles de memoria y presentamos datos que comparan los beneficios de tiling a nivel de registros, tiling a nivel de cache y tiling a los dos niveles, cache y registros, simultáneamente. Finalmente, comparamos el rendimiento de códigos optimizados automáticamente con códigos optimizados manualmente (librerías numéricas que ofrecen los fabricantes) sobre dos arquitecturas diferentes (ALPHA 21164 and MIPS R10000) para concluir que actualmente la tecnología de los compiladores hace posible que códigos numéricos complejos consigan el mismo rendimiento que códigos optimizados manualmente.The main motivation of this thesis is to develop new compilation techniques that address the lack of performance of complex numerical codes consisting of loop nests defining non-rectangular iteration spaces. Specifically, we focus on the loop tiling transformation (also known as blocking) and our purpose is the improvement of the loop tiling transformation when dealing with complex numerical codes. Our goal is to achieve via the loop tiling transformation the same or better performance as hand-optimized vendor-supplied numerical libraries. We will observe that the main reason why current commercial compilers perform poorly when dealing with this type of codes is that they do not apply tiling for the register level. Instead, to enhance locality at this level and to improve ILP, they use/combine other transformations that do not exploit the register level as well as loop tiling. Tiling for the register level has not generally been considered because, in complex numerical codes, it is far from being trivial due to the irregular nature of the iteration space. Our first contribution in this thesis will be a general compiler algorithm to perform tiling at the register level that handles arbitrary iteration space shapes and not only simple rectangular shapes.Our method includes a very simple heuristic to make the tile decisions for the register level. At first sight, register tiling should be performed so that whichever loop carries the most temporal reuse is not tiled. This way, register reuse is maximized and the number of load/store instructions executed is minimized. However, we will show that, for complex loop nests, if we only consider reuse directions and do not take into account the iteration space shape, the tiled loop nest can suffer performance degradation. Our second contribution will be a proposal of a very simple heuristic to determine the tiling parameters for the register level, that considers not only temporal reuse, but also the iteration space shape. Moreover, the heuristic is simple enough to be suitable for automatic implementation by compilers.However, to be able to achieve similar performance to hand-optimized codes, it is not enough by tiling only for the register level. With today's architectures having complex memory hierarchies and multiple processors, it is quite common that the compiler has to perform tiling at four or more levels (parallelism, L2-cache, L1-cache and registers) in order to achieve high performance. Therefore, in today's architectures it is crucial to have an efficient algorithm that can perform multilevel tiling at multiple levels of the memory hierarchy. Moreover, as we will see in this thesis, multilevel tiling should always include the register level, as this is the memory hierarchy level that yields most performance when properly tiled.When multilevel tiling includes the register level, it is critical to compute exact loop bounds and to avoid the generation of redundant bounds. The reason is that the complexity and the amount of code generated by our register tiling technique both depend polynomially on the number of loop bounds. However, to date, the drawback of generating exact loop bounds and eliminating redundant bounds has been that all techniques known were extremely expensive in terms of compilation time and, thus, difficult to integrate in a production compiler. Our third contribution in this thesis will be a new implementation of multilevel tiling that computes exact loop bounds at a much lower complexity than traditional techniques. In fact, we will show that the complexity of our implementation is proportional to the complexity of performing a loop permutation in the original loop nest (before tiling), while traditional techniques have much larger complexities. Moreover, our implementation generates less redundant bounds in the multilevel tiled code and allows removing the remaining redundant bounds at a lower cost. Overall, the efficiency of our implementation makes it possible to integrate multilevel tiling including the register level in a production compiler without having to worry about compilation time.The last part of this thesis is dedicated to studying the performance of multilevel tiling. We will discuss the effects of tiling for different memory levels and present quantitative data comparing the benefits of tiling only for the register level, tiling only for the cache level and tiling for both levels simultaneously. Finally, we will compare automatically-optimized codes against hand-optimized vendor-supplied numerical libraries, on two different architectures (ALPHA 21164 and MIPS R10000), to conclude that compiler technology can make it possible for complex numerical codes to achieve the same performance as hand-optimized codes on modern microprocessors
    • …
    corecore