155 research outputs found

    Center for Programming Models for Scalable Parallel Computing

    Get PDF
    Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf

    Towards Accelerating High-Order Stencils on Modern GPUs and Emerging Architectures with a Portable Framework

    Full text link
    PDE discretization schemes yielding stencil-like computing patterns are commonly used for seismic modeling, weather forecast, and other scientific applications. Achieving HPC-level stencil computations on one architecture is challenging, porting to other architectures without sacrificing performance requires significant effort, especially in this golden age of many distinctive architectures. To help developers achieve performance, portability, and productivity with stencil computations, we developed StencilPy. With StencilPy, developers write stencil computations in a high-level domain-specific language, which promotes productivity, while its backends generate efficient code for existing and emerging architectures, including NVIDIA, AMD, and Intel GPUs, A64FX, and STX. StencilPy demonstrates promising performance results on par with hand-written code, maintains cross-architectural performance portability, and enhances productivity. Its modular design enables easy configuration, customization, and extension. A 25-point star-shaped stencil written in StencilPy is one-quarter of the length of a hand-crafted CUDA code and achieves similar performance on an NVIDIA H100 GPU

    LoopTune: Optimizing Tensor Computations with Reinforcement Learning

    Full text link
    Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds

    Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles

    Full text link
    Abstract—Applications must scale well to make efficient use of today’s class of petascale computers, which contain hundreds of thousands of processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks in large-scale executions. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of scaling problems. Load imbalance is one of the most common scaling problems. To provide actionable insight into load imbalance, we present post-mortem parallel analysis techniques for pinpointing and quantifying load imbalance in the context of call path profiles of parallel programs. We show how to identify load imbalance in its static and dynamic context by using only low-overhead asyn-chronous call path profiling to locate regions of code responsible for communication wait time in SPMD executions. We describe the implementation of these techniques within HPCTOOLKIT. I

    Perceived locus of causality and internalization: Examining reasons for acting in two domains.

    Get PDF
    Theories of internalization typically suggest that self-perceptions of the "causes" of(i.e., reasons for) behavior are differentiated along a continuum of autonomy that contains identifiable gradations. A model of perceived locus of causality (PLOC) is developed, using children's self-reported reasons for acting. In Project 1, external, introjected, identified, and intrinsic types of reasons for achievementrelated behaviors are shown to conform to a simplex-like (ordered correlation) structure in four samples. These reason categories are then related to existing measures of PLOC and to motivation. A second project examines 3 reason categories (external, introject, and identification) within the domain of prosoeial behavior. Relations with measures of empathy, moral judgment, and positive interpersonal relatedness are presented. Finally, the proposed model and conceptualization of PLOC are discussed with regard to intrapersonal versus interpersonal perception, internalization, causereason distinctions, and the significance of perceived autonomy in human behavior. A central issue for theories of motivation concerns the perceived locus relative to the person of variables that cause or give impetus to behavior, Heider (1958) introduced the concept of perceived locus of causality (PLOC) primarily in reference to interpersonal perception, and more specifically with regard to the phenomenal analysis of how one infers the motives and intentions of others. He distinguished between personal causation, the critical feature of which is intention, and impersonal causation, in which environments, independent of the person's intentions, produce a given effect. DeCharms (1968) elaborated and extended Heider's phenomenal analysis, particularly with regard to the explanation of behavior (as opposed to outcomes). DeCharms argued that there is a further distinction within personal causation or intentional behavior between an internal PLOC, in which the actor is perceived as an "origin" of his or her behavior, and an external PLOC, in which the actor is seen as a "pawn" to heteronomous forces. The distinction between internal and external PLOC has since been crucial for studies of intrinsic versus extrinsic motivation and of perceived autonomy more generally (Deci &amp
    • …
    corecore