20 research outputs found
Improving X10 Program Performances by Clock Removal
International audienceX10 is a promising recent parallel language designed specifically to address the challenges of productively programming a wide variety of target platforms. The sequential core of X10 is an object-oriented language in the Java family. This core is augmented by a few parallel constructs that create activities as a generalization of the well known fork/join model. Clocks are a generalization of the familiar barriers. Synchronization on a clock is specified by the advance() method call. Activities that execute \emph{advances} stall until all existent activities have done the same, and then are released at the same (logical) time. This naturally raises the following question: are clocks strictly necessary for X10 programs? Surprisingly enough, the answer is no, at least for sufficiently regular programs. One assigns a date to each operation, denoting the number of advances that the activity has executed before the operation. Operations with the same date constitute a \emph{front}, fronts are executed sequentially in order of increasing dates, while operations in a front are executed in parallel if possible. Depending on the nature of the program, this may entail some overhead, which can be reduced to zero for polyhedral programs. We show by experiments that, at least for the current X10 runtime, this transformation usually improves the performance of our benchmarks. Besides its theoretical interest, this transformation may be of interest for simplifying a compiler or runtime library
A Space and Bandwidth Efficient Multicore Algorithm for the Particle-in-Cell Method
International audienceThe Particle-in-Cell (PIC) method allows solving partial differential equation through simulations, with important applications in plasma physics. To simulate thousands of billions of particles on clusters of multicore machines, prior work has proposed hybrid algorithms that combine domain decomposition and particle decomposition with carefully optimized algorithms for handling particles processed on each multicore socket. Regarding the multicore processing, existing algorithms either suffer from suboptimal execution time, due to sorting operations or use of atomic instructions, or suffer from suboptimal space usage. In this paper, we propose a novel parallel algorithm for two-dimensional PIC simulations on multicore hardware that features asymptotically-optimal memory consumption, and does not perform unnecessary accesses to the main memory. In practice, our algorithm reaches 65% of the maximum bandwidth, and shows excellent scalability on the classical Landau damping and two-stream instability test cases
Rec2Poly: Converting Recursions to Polyhedral Optimized Loops Using an Inspector-Executor Strategy
International audienceIn this paper, we propose Rec2Poly, a framework which detects automatically if recursive programs may be transformed into affine loops that are compliant with the polyhedral model. If successful, the replacing loops can then take advantage of advanced loop optimizing and parallelizing transformations as tiling or skewing. Rec2Poly is made of two main phases: an offline profiling phase and an inspector-executor phase. In the profiling phase, the original recursive program, which has been instrumented, is run. Whenever possible, the trace of collected information is used to build equivalent affine loops from the runtime behavior. Then, an inspector-executor program is automatically generated, where the inspector is made of a light version of the original recursive program, whose aim is reduced to the generation and verification of the information which is essential to ensure the correctness of the equivalent affine loop program. The collected information is mainly related to the touched memory addresses and the control flow of the so-called "impacting" basic blocks of instructions. Moreover, in order to exhibit the lowest possible time-overhead, the inspector is implemented as a parallel process where several memory buffers of information are verified simultaneously. Finally, the executor is made of the equivalent affine loops that have been optimized and parallelized
Transparent Parallelization of Binary Code
International audienceThis paper describes a system that applies automatic parallelization techniques to binary code. The system works by raising raw executable code to an intermediate representation that exhibits all memory accesses and relevant register definitions, but outlines detailed computations that are not relevant for parallelization. It then uses an off-the-shelf polyhedral parallelizer, first applying appropriate enabling transformations if necessary. The last phase lowers the internal representation into a new executable fragment, re-injecting low-level instructions into the transformed code. The system is shown to leverage the power of polyhedral parallelization techniques in the absence of source code, with performance approaching those of source-to-source tools
Loop-based Modeling of Parallel Communication Traces
This paper describes an algorithm that takes a trace of a distributed program and builds a model of all communications of the program. The model is a set of nested loops representing repeated patterns. Loop bodies collect events representing communication actions performed by the various processes, like sending or receiving messages, and participating in collective operations. The model can be used for compact visualization of full executions, for program understanding and debugging, and also for building statistical analyzes of various quantitative aspects of the program's behavior. The construction of the communication model is performed in two phases. First, a local model is built for each process, capturing local regularities; this phase is incremental and fast, and can be done on-line, during the execution. The second phase is a reduction process that collects, aligns, and finally merges all local models into a global, system-wide model. This global model is a compact representation of all communications of the original program, capturing patterns across groups of processes. It can be visualized directly and, because it takes the form of a sequence of loop nests, can be used to replay the original program's communication actions. Because the model is based on communication events only, it completely ignores other quantitative aspects like timestamps or messages sizes. Including such data would in most case break regularities, reducing the usefulness of trace-based modeling. Instead, the paper shows how one can efficiently access quantitative data kept in the original trace(s), by annotating the model and compiling data scanners automatically.Ce rapport de recherche décrit un algorithme qui prend en entrée la trace d'un programme distribué, et construit un modèle de l'ensemble des communications du programme. Le modèle prend la forme d'un ensemble de boucles imbriquées représentant la répétition de motifs de communication. Le corps des boucles regroupe des événements représentant les actions de communication réalisées par les différents processus impliqués, tels que l'envoi et la réception de messages, ou encore la participation à des opérations collectives. Le modèle peut servir à la visualisation compact d'exécutions complètes, à la compréhension de programme et au debugging, mais également à la construction d'analyses statistiques de divers aspects quantitatifs du comportement du programme. La construction du modèle de communication s'effectue en deux phases. Premièrement, un modèle local est construit au sein de chaque processus, capturant les régularités locales~; cette phase est incrémentale et rapide, et peut être réalisée au cours de l'exécution. La seconde phase est un processus de réduction qui rassemble, aligne, et finalement fusionne tous les modèles locaux en un modèle global décrivant la totalité du système. Ce modèle global est une représentation compacte de toutes les communications du programme original, représentant des motifs de communication entre groupes de processus. Il peut être visualisé directement et, puisqu'il prend la forme d'un ensemble de nids de boucles, peut servir à rejouer les opérations de communication du programme initial. Puisque le modèle construit se base uniquement sur les opérations de communication, il ignore complètement d'autres données quantitatives, telles que les informations chronologiques, ou les tailles de messages. L'inclusion de telles données briserait dans la plupart des cas les régularités topologiques, réduisant l'efficacité de la modélisation par boucles. Nous préférons, dans ce rapport, montrer comment, grâce au modèle construit, il est possible d'accéder efficacement aux données quantitatives si celles-ci sont conservées dans les traces individuelles, en annotant le modèle et en l'utilisant pour compiler automatiquement des programmes d'accès aux données
Aspects de la classification dans un système de représentation des connaissances par objets
National audienceDans cet article, nous étudions certains aspects que peut prendre le processus de classification dans le cadre des systèmes de représentation de connaissances par objets. Nous nous intéressons essentiellement aux opérations de classification de classes et d'instances. Nous évoquons ensuite trois applications particulièrement importantes dans le cadre de la conception de systèmes intelligents qui reposent toutes trois sur le processus de classification : la formation incrémentale de classes et de hiérarchies de classes, l'extraction de connaissances à partir de bases de données et le raisonnement à partir de cas
Prediction and trace compression of data access addresses through nested loop recognition
International audienceThis paper describes an algorithm that takes a trace (i.e., a sequence of numbers or vectors of numbers) as input, and from that produces a sequence of loop nests that, when run, produces exactly the original sequence. The input format is suitable for any kind of program execution trace, and the output conforms to standard models of loop nests. The first, most obvious, use of such an algorithm is for program behavior modeling for any measured quantity (memory accesses, number of cache misses, etc.). Finding loops amounts to detecting periodic behavior and provides an explanatory model. The second application is trace compression, i.e., storing the loop nests instead of the original trace. Decompression consists of running the loops, which is easy and fast. A third application is value prediction. Since the algorithm forms loops while reading input, it is able to extrapolate the loop under construction to predict further incoming values. Throughout the paper, we provide examples that explain our algorithms. Moreover, we evaluate trace compression and value prediction on a subset of the SPEC2000 benchmarks
Efficient Memory Tracing by Program Skeletonization
International audienceMemory profiling is useful for a variety of tasks, most notably to produce traces of memory accesses for cache simulation. However, instrumenting every memory access incurs a large overhead, in the amount of code injected in the original program as well as in execution time. This paper describes how static analysis of the binary code can be used to reduce the amount of instrumentation. The analysis extracts loops and memory access functions by tracking how memory addresses are computed from a small set of base registers holding, e.g., routine parameters and loop counters. Instrumenting these base registers instead of memory operands reduces the weight of instrumentation, first statically by reducing the amount of injected code, and second dynamically by reducing the amount of instrumentation code actually executed. Also, because the static analysis extracts intermediate-level program structures (loops and branches) and access functions in symbolic form, it is easy to transform the original executable into a skeleton program that consumes base register values and produces memory addresses. The first advantage of using a skeleton is to be able to overlap the execution of the instrumented program with that of the skeleton, thereby reducing the overhead of recomputing addresses. The second advantage is that the skeleton program and its shorter input trace can be saved and rerun as many times as necessary without requiring access to the original architecture, e.g., for cache design space exploration. Experiments are performed on SPEC benchmarks compiled for the x86-64 instruction set