2,925 research outputs found

    Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

    Get PDF
    Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014

    The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral Modeling

    Get PDF
    International audienceThere may be a huge gap between the statements outlined by programmers in a program source code and instructions that are actually performed by a given processor architecture when running the executable code. This gap is due to the way the input code has been interpreted, translated and transformed by the compiler and the final processor hardware. Thus, there is an opportunity for efficient optimization strategies, that are dedicated to specific control structures and memory access patterns, to apply as soon as the actual runtime behavior has been discovered, even if they could not have been applied on the original source code. In this paper, we develop this idea by identifying code extracts that behave as polyhedral-compliant loops at runtime, while not having been outlined at all as loops in the original source code. In particular, we are interested in recursive functions whose runtime behavior can be modeled as polyhedral loops. Therefore, the scope of this study exclusively includes recursive functions whose control flow and memory accesses exhibit an affine behavior, which means that there exists a semantically equivalent affine loop nest, candidate for poly-hedral optimizations. Accordingly, our approach is based on analyzing early executions of a recursive program using a Nested Loop Recognition (NLR) algorithm, performing the affine loop modeling of the original program runtime behavior , which is then used to generate an equivalent iterative program, finally optimized using the polyhedral compiler Polly. We present some preliminary results showing that this approach brings recursion optimization techniques into a higher level in addition to widening the scope of the polyhe-dral model to include originally non-loop programs

    Rec2Poly: Converting Recursions to Polyhedral Optimized Loops Using an Inspector-Executor Strategy

    Get PDF
    International audienceIn this paper, we propose Rec2Poly, a framework which detects automatically if recursive programs may be transformed into affine loops that are compliant with the polyhedral model. If successful, the replacing loops can then take advantage of advanced loop optimizing and parallelizing transformations as tiling or skewing. Rec2Poly is made of two main phases: an offline profiling phase and an inspector-executor phase. In the profiling phase, the original recursive program, which has been instrumented, is run. Whenever possible, the trace of collected information is used to build equivalent affine loops from the runtime behavior. Then, an inspector-executor program is automatically generated, where the inspector is made of a light version of the original recursive program, whose aim is reduced to the generation and verification of the information which is essential to ensure the correctness of the equivalent affine loop program. The collected information is mainly related to the touched memory addresses and the control flow of the so-called "impacting" basic blocks of instructions. Moreover, in order to exhibit the lowest possible time-overhead, the inspector is implemented as a parallel process where several memory buffers of information are verified simultaneously. Finally, the executor is made of the equivalent affine loops that have been optimized and parallelized

    Identification of regular patterns within sparse data structures

    Get PDF
    2020 Spring.Includes bibliographical references.Sparse matrix-vector multiplication (SpMV) is an essential computation in linear algebra. There is a well-known trade-off between operating on a dense or a sparse structure when performing SpMV. In the dense version of SpMV, useless operations are performed but the computation is amenable SIMD vectorization. In the sparse version, only useful operations are executed. However, an indirection array must be used, thus hindering the compiler's ability to perform optimizations that exploit the vector units available on the majority of modern processors. Our process automatically builds sets of regular sub-computations from the irregular sparse data structure. We mine for regular regions in the irregular data structure, grouping together non-contiguous points from the reorderable set of coordinates representing the sparse structure. The coordinates become partitioned into groupings of coordinates of pre-defined shapes using polyhedra. This partition models the exact same points from the input set of coordinates in a way that is specialized to the input's sparsity pattern. Once we have obtained a partition of the points into sets of polyhedra, we then scan these polyhedra to synthesize code that does not store any coordinates of zero-valued elements and does not require any indirection array to access data, thus making it amenable to SIMD vectorization

    PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

    Full text link
    High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie

    Loop-based Modeling of Parallel Communication Traces

    Get PDF
    This paper describes an algorithm that takes a trace of a distributed program and builds a model of all communications of the program. The model is a set of nested loops representing repeated patterns. Loop bodies collect events representing communication actions performed by the various processes, like sending or receiving messages, and participating in collective operations. The model can be used for compact visualization of full executions, for program understanding and debugging, and also for building statistical analyzes of various quantitative aspects of the program's behavior. The construction of the communication model is performed in two phases. First, a local model is built for each process, capturing local regularities; this phase is incremental and fast, and can be done on-line, during the execution. The second phase is a reduction process that collects, aligns, and finally merges all local models into a global, system-wide model. This global model is a compact representation of all communications of the original program, capturing patterns across groups of processes. It can be visualized directly and, because it takes the form of a sequence of loop nests, can be used to replay the original program's communication actions. Because the model is based on communication events only, it completely ignores other quantitative aspects like timestamps or messages sizes. Including such data would in most case break regularities, reducing the usefulness of trace-based modeling. Instead, the paper shows how one can efficiently access quantitative data kept in the original trace(s), by annotating the model and compiling data scanners automatically.Ce rapport de recherche décrit un algorithme qui prend en entrée la trace d'un programme distribué, et construit un modèle de l'ensemble des communications du programme. Le modèle prend la forme d'un ensemble de boucles imbriquées représentant la répétition de motifs de communication. Le corps des boucles regroupe des événements représentant les actions de communication réalisées par les différents processus impliqués, tels que l'envoi et la réception de messages, ou encore la participation à des opérations collectives. Le modèle peut servir à la visualisation compact d'exécutions complètes, à la compréhension de programme et au debugging, mais également à la construction d'analyses statistiques de divers aspects quantitatifs du comportement du programme. La construction du modèle de communication s'effectue en deux phases. Premièrement, un modèle local est construit au sein de chaque processus, capturant les régularités locales~; cette phase est incrémentale et rapide, et peut être réalisée au cours de l'exécution. La seconde phase est un processus de réduction qui rassemble, aligne, et finalement fusionne tous les modèles locaux en un modèle global décrivant la totalité du système. Ce modèle global est une représentation compacte de toutes les communications du programme original, représentant des motifs de communication entre groupes de processus. Il peut être visualisé directement et, puisqu'il prend la forme d'un ensemble de nids de boucles, peut servir à rejouer les opérations de communication du programme initial. Puisque le modèle construit se base uniquement sur les opérations de communication, il ignore complètement d'autres données quantitatives, telles que les informations chronologiques, ou les tailles de messages. L'inclusion de telles données briserait dans la plupart des cas les régularités topologiques, réduisant l'efficacité de la modélisation par boucles. Nous préférons, dans ce rapport, montrer comment, grâce au modèle construit, il est possible d'accéder efficacement aux données quantitatives si celles-ci sont conservées dans les traces individuelles, en annotant le modèle et en l'utilisant pour compiler automatiquement des programmes d'accès aux données

    Detecting SIMDization Opportunities through Static/Dynamic Dependence Analysis

    Get PDF
    International audienceUsing SIMD instructions is essential in modern processor architecture for high performance computing. Compilers automatic vectorization shows limited efficiency in general, due to conservative dependence analysis, complex control flow or indexing. This paper presents a technique to detect SIMDization opportunities, complementing in a more detailed way compiler optimization reports. The method is based on static and dynamic dependence analysis, able to analyze codes not vectorized by a compiler. This method generates user-hints to help vectorize applications. We show on TSVC benchmark the benefits of this approach.L'utilisation des instructions SIMD est essentielle pour obtenir de bonnes performances de calcul sur les processeurs d'architecture moderne. La vectorisation automatique proposée par les compilateurs s'avère d'efficacité limitée en général, du fait d'une analyse de dépendances conservatrice, de flots de contrôle ou d'indices complexes. Cet article présente une technique de détection des opportunités de SIMDisation, complétant de façon plus détaillée les rapports d'optimisation des compilateurs. Cette méthode est basée sur l'analyse statique et dynamique conjointe des dépendances. Elle est capable d'analyser des codes non vectorisés par un compilateur. Cette méthode génère des suggestions à destination de l'utilisateur, afin de l'aider à vectoriser des applications. Nous montrons les bénéfices de cette approche sur le benchmark TSVC

    Rewriting System for Profile-Guided Data Layout Transformations on Binaries

    Get PDF
    International audienceCareful data layout design is crucial for achieving high performance. However exploring data layouts is time-consuming and error-prone, and assessing the impact of a layout transformation on performance is difficult without performing it. We propose to guide application programmers through data layout restructuring by providing a comprehensive multidimensional description of the initial layout, built from trace analysis, and then by giving a performance evaluation of the transformations tested and an expression of each transformed layout. The programmer can limit the exploration to layouts matching some patterns. We apply this method to two multithreaded applications. The performance prediction of multiple transformations matches within 5% the performance of hand-transformed layout code

    Lightweight Array Contraction by Trace-Based Polyhedral Analysis

    Get PDF
    International audienceArray contraction is a compilation optimization used to reduce memory consumption, by reducing the size of temporary arrays in a program while preserving its correctness. The usual approach to this problem is to perform a static analysis of the given program, creating overhead in the compilation cycle. In this work, we take a look at exploiting execution traces of programs of the polyhedral model, in order to infer reduced sizes for the temporary arrays used during calculations. We designed a four step process to reduce the storage requirements of a temporary array of a given scheduled program, in which we used an algorithm to deduce array access functions for which bounds are modulos of affine functions of parameters of the program. Our results show memory reductions of an order of magnitude on several benchmarks examples from PolyBench, a collection of programs from the polyhedral community. Execution time is compared to a baseline implementation of a static algorithm, and results show speed-up factors up to 20

    Création d'une Représentation Polyédrique depuis une Exécution

    Get PDF
    The polyhedral model has been successfully used in production compilers. Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. In this work, we go one step further in that direction. We propose a dynamic analysis that builds a compact polyhedral representation from a program execution. It is able to accurately detect affine dependencies and fixed-stride memory accesses in programs. The analysis scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affine code. This is enabled by a safe fine-grain polyhedral over-approximation mechanism applied to each analyzed expression. We evaluate our analysis on the entire Rodinia benchmark suite, enabling accurate feedback about potential for complex polyhedral transformations.Le modèle polyédrique est aujourd'hui utilisé à grande échelle via son intégration dans des compilateurs très largement utilisés. Néanmoins, seule une classe très restreinte de programmes peut en bénéficier. Des travaux récents ont montré comment des informations provenant d'une exécution du programme pouvaient être utilisées afin d'étendre la portée dumodèle polyédrique. Ce travail s'inscrit dans ce contexte d'analyse dynamique de programmes pour appliquer le modèle polyédrique plus largement. Nous proposons une analyse dynamique capable de construire une représentation polyédrique d'un programme à partir d'une éxecution instrumentée. Cette analyse détecte de façon précise les dépendances affines ainsi que les accès mémoire avec incréments constants présents dans le programme. Notre analyse passe à l'échellesur de vraies applications qui contiennent souvent quelques dépendances et accès mémoire non affines. Ce passage à l'échelle est possible grâce à un mécanisme de sur-approximation. Nous évaluons notre analyse sur la suite de benchmarks Rodinia en montrant quel est le retour fourni à l'utilisateur en ce qui concerne de potentielles transformations polyédriques
    • …
    corecore