2,028 research outputs found

    Parallel machine architecture and compiler design facilities

    Get PDF
    The objective is to provide an integrated simulation environment for studying and evaluating various issues in designing parallel systems, including machine architectures, parallelizing compiler techniques, and parallel algorithms. The status of Delta project (which objective is to provide a facility to allow rapid prototyping of parallelized compilers that can target toward different machine architectures) is summarized. Included are the surveys of the program manipulation tools developed, the environmental software supporting Delta, and the compiler research projects in which Delta has played a role

    Costing JIT Traces

    Get PDF
    Tracing JIT compilation generates units of compilation that are easy to analyse and are known to execute frequently. The AJITPar project aims to investigate whether the information in JIT traces can be used to make better scheduling decisions or perform code transformations to adapt the code for a specific parallel architecture. To achieve this goal, a cost model must be developed to estimate the execution time of an individual trace. This paper presents the design and implementation of a system for extracting JIT trace information from the Pycket JIT compiler. We define three increasingly parametric cost models for Pycket traces. We perform a search of the cost model parameter space using genetic algorithms to identify the best weightings for those parameters. We test the accuracy of these cost models for predicting the cost of individual traces on a set of loop-based micro-benchmarks. We also compare the accuracy of the cost models for predicting whole program execution time over the Pycket benchmark suite. Our results show that the weighted cost model using the weightings found from the genetic algorithm search has the best accuracy

    Group Communication Patterns for High Performance Computing in Scala

    Full text link
    We developed a Functional object-oriented Parallel framework (FooPar) for high-level high-performance computing in Scala. Central to this framework are Distributed Memory Parallel Data structures (DPDs), i.e., collections of data distributed in a shared nothing system together with parallel operations on these data. In this paper, we first present FooPar's architecture and the idea of DPDs and group communications. Then, we show how DPDs can be implemented elegantly and efficiently in Scala based on the Traversable/Builder pattern, unifying Functional and Object-Oriented Programming. We prove the correctness and safety of one communication algorithm and show how specification testing (via ScalaCheck) can be used to bridge the gap between proof and implementation. Furthermore, we show that the group communication operations of FooPar outperform those of the MPJ Express open source MPI-bindings for Java, both asymptotically and empirically. FooPar has already been shown to be capable of achieving close-to-optimal performance for dense matrix-matrix multiplication via JNI. In this article, we present results on a parallel implementation of the Floyd-Warshall algorithm in FooPar, achieving more than 94 % efficiency compared to the serial version on a cluster using 100 cores for matrices of dimension 38000 x 38000

    Behavioral networks as a model for intelligent agents

    Get PDF
    On-going work at NASA Langley Research Center in the development and demonstration of a paradigm called behavioral networks as an architecture for intelligent agents is described. This work focuses on the need to identify a methodology for smoothly integrating the characteristics of low-level robotic behavior, including actuation and sensing, with intelligent activities such as planning, scheduling, and learning. This work assumes that all these needs can be met within a single methodology, and attempts to formalize this methodology in a connectionist architecture called behavioral networks. Behavioral networks are networks of task processes arranged in a task decomposition hierarchy. These processes are connected by both command/feedback data flow, and by the forward and reverse propagation of weights which measure the dynamic utility of actions and beliefs

    JIT-Based cost analysis for dynamic program transformations

    Get PDF
    Tracing JIT compilation generates units of compilation that are easy to analyse and are known to execute frequently. The AJITPar project investigates whether the information in JIT traces can be used to dynamically transform programs for a specific parallel architecture. Hence a lightweight cost model is required for JIT traces. This paper presents the design and implementation of a system for extracting JIT trace information from the Pycket JIT compiler. We define three increasingly parametric cost models for Pycket traces. We determine the best weights for the cost model parameters using linear regression. We evaluate the effectiveness of the cost models for predicting the relative costs of transformed programs

    Approaches to the determination of parallelism in computer programs

    Get PDF
    Approaches to the determination of parallelism in computer program

    Using Decision Tree Voting to Select a Polyhedral Model Loop Transformation

    Get PDF
    Algorithms in fields like image manipulation, sound and signal processing, and statistics frequently employ tight loops. These loops are computationally intensive and CPU-bound, making their performance highly dependent on efficient utilization of the CPU pipeline and memory bus. Recent years have seen CPU pipelines becoming more and more complicated, with features such as branch prediction and speculative execution. At the same time, clock speeds have stopped their prior exponential growth rate due to heat dissipation issues, and multiple cores have become prevalent. These developments have made it more difficult for developers to reason about how their code executes on the CPU, which in turn makes it difficult to write performant code. An automated method to take code and optimize it for most efficient execution would, therefore, be desirable. The Polyhedral Model allows the generation of alternative transformations for a loop nest that are semantically equivalent to the original. The transformations vary the degree of loop tiling, loop fusion, loop unrolling, parallelism, and vectorization. However, selecting the transformation that would most efficiently utilize the architecture remains challenging. Previous work utilizes regression models to select a transformation, using as features hardware performance counter values collected during a sample run of the program being optimized. Due to inaccuracies in the resulting regression model, the transformation selected by the model as the best transformation often yields unsatisfactory performance. As a result, previous work resorts to using a five-shot technique, which entails running the top five transformations suggested by the model and selecting the best one based on their actual runtime. However, for long-running benchmarks, five runs may be take an excessive amount of time. I present a variation on the previous approach which does not need to resort to the five-shot selection process to achieve performance comparable to the best five-shot results reported in previous work. With the transformations in the search space ranked in reverse runtime order, the transformation selected by my classifier is, on average, in the 86th percentile. There are several key contributing factors to the performance improvements attained by my method: formulating the problem as a classification problem rather than a regression problem, using static features in addition to dynamic performance counter features, performing feature selection, and using ensemble methods to boost the performance of the classifier. Decision trees are constructed from pairs of features (performance counters and structural features than can be determined statically from the source code). The trees are then evaluated according to the number of benchmarks for which they select a transformation that performs better than two baseline variants, the original program and the expected runtime if a randomly selected transformation were applied. The top 20 trees vote to select a final transformation
    • …
    corecore