13 research outputs found

    GPU Behavior on a Large HPC Cluster

    No full text

    Online impact analysis via dynamic compilation technology

    No full text
    Dynamic impact analysis based on whole path profiling of method calls and returns has been shown to provide more useful predictions of software change impacts than methodlevel static slicing and to avoid the overhead of expensive dependency analysis needed for dynamic slicing-based impact analysis. This paper presents the design, implementation, and evaluation of an online approach to dynamic impact analysis as an extension to the DynamoRIO binary code modification system and to the Jikes Research Virtual Machine. Storage and postmortem analysis of program traces, even compressed, are avoided. 1

    Maestro: Data Orchestration and Tuning for OpenCL Devices

    No full text

    An OpenMP 3.1 Validation Testsuite

    No full text
    Parallel programming models are evolving so rapidly that it needs to be ensured that OpenMP can be used easily to program multicore devices. There is also effort involved in getting OpenMP to be accepted as a de facto standard in the embedded system community. However, in order to ensure correctness of OpenMP’s implementation, there is a requirement of an up-to-date validation suite. In this paper, we present a portable and robust validation testsuite execution environment to validate the OpenMP implementation in several compilers. We cover all the directives and clauses of OpenMP until the latest release, OpenMP Version 3.1. Our primary focus is to determine and evaluate the correctness of the OpenMP implementation in our research compiler, OpenUH and few others such as Intel, Sun/Oracle and GNU. We also aim to find the ambiguities in the OpenMP specification and help refine the same with the validation suite. Furthermore, we also include deeper tests such as cross tests and orphan tests in the testsuite

    Improving Performance Portability in OpenCL Programs

    No full text

    An Automated Approach to Improve Communication-Computation Overlap in Clusters. Senior Thesis

    No full text
    Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. T

    CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-Core Clusters

    No full text
    Abstract. Rapid advancements in multi-core processor architectures along with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs to efficiently utilize all the resources in such a cluster is still a major challenge. Programmers have to manually deal with low-level details that should ideally be the responsibility of an intelligent compiler or a run-time layer. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing number of CUDA developers and benchmarks along with the stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation
    corecore