155 research outputs found
Center for Programming Models for Scalable Parallel Computing
Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf
Towards Accelerating High-Order Stencils on Modern GPUs and Emerging Architectures with a Portable Framework
PDE discretization schemes yielding stencil-like computing patterns are
commonly used for seismic modeling, weather forecast, and other scientific
applications. Achieving HPC-level stencil computations on one architecture is
challenging, porting to other architectures without sacrificing performance
requires significant effort, especially in this golden age of many distinctive
architectures.
To help developers achieve performance, portability, and productivity with
stencil computations, we developed StencilPy. With StencilPy, developers write
stencil computations in a high-level domain-specific language, which promotes
productivity, while its backends generate efficient code for existing and
emerging architectures, including NVIDIA, AMD, and Intel GPUs, A64FX, and STX.
StencilPy demonstrates promising performance results on par with hand-written
code, maintains cross-architectural performance portability, and enhances
productivity. Its modular design enables easy configuration, customization, and
extension. A 25-point star-shaped stencil written in StencilPy is one-quarter
of the length of a hand-crafted CUDA code and achieves similar performance on
an NVIDIA H100 GPU
LoopTune: Optimizing Tensor Computations with Reinforcement Learning
Advanced compiler technology is crucial for enabling machine learning
applications to run on novel hardware, but traditional compilers fail to
deliver performance, popular auto-tuners have long search times and
expert-optimized libraries introduce unsustainable costs. To address this, we
developed LoopTune, a deep reinforcement learning compiler that optimizes
tensor computations in deep learning models for the CPU. LoopTune optimizes
tensor traversal order while using the ultra-fast lightweight code generator
LoopNest to perform hardware-specific optimizations. With a novel graph-based
representation and action space, LoopTune speeds up LoopNest by 3.2x,
generating an order of magnitude faster code than TVM, 2.8x faster than
MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the
level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order
of seconds
Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles
Abstract—Applications must scale well to make efficient use of today’s class of petascale computers, which contain hundreds of thousands of processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks in large-scale executions. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of scaling problems. Load imbalance is one of the most common scaling problems. To provide actionable insight into load imbalance, we present post-mortem parallel analysis techniques for pinpointing and quantifying load imbalance in the context of call path profiles of parallel programs. We show how to identify load imbalance in its static and dynamic context by using only low-overhead asyn-chronous call path profiling to locate regions of code responsible for communication wait time in SPMD executions. We describe the implementation of these techniques within HPCTOOLKIT. I
Perceived locus of causality and internalization: Examining reasons for acting in two domains.
Theories of internalization typically suggest that self-perceptions of the "causes" of(i.e., reasons for) behavior are differentiated along a continuum of autonomy that contains identifiable gradations. A model of perceived locus of causality (PLOC) is developed, using children's self-reported reasons for acting. In Project 1, external, introjected, identified, and intrinsic types of reasons for achievementrelated behaviors are shown to conform to a simplex-like (ordered correlation) structure in four samples. These reason categories are then related to existing measures of PLOC and to motivation. A second project examines 3 reason categories (external, introject, and identification) within the domain of prosoeial behavior. Relations with measures of empathy, moral judgment, and positive interpersonal relatedness are presented. Finally, the proposed model and conceptualization of PLOC are discussed with regard to intrapersonal versus interpersonal perception, internalization, causereason distinctions, and the significance of perceived autonomy in human behavior. A central issue for theories of motivation concerns the perceived locus relative to the person of variables that cause or give impetus to behavior, Heider (1958) introduced the concept of perceived locus of causality (PLOC) primarily in reference to interpersonal perception, and more specifically with regard to the phenomenal analysis of how one infers the motives and intentions of others. He distinguished between personal causation, the critical feature of which is intention, and impersonal causation, in which environments, independent of the person's intentions, produce a given effect. DeCharms (1968) elaborated and extended Heider's phenomenal analysis, particularly with regard to the explanation of behavior (as opposed to outcomes). DeCharms argued that there is a further distinction within personal causation or intentional behavior between an internal PLOC, in which the actor is perceived as an "origin" of his or her behavior, and an external PLOC, in which the actor is seen as a "pawn" to heteronomous forces. The distinction between internal and external PLOC has since been crucial for studies of intrinsic versus extrinsic motivation and of perceived autonomy more generally (Deci &
Recommended from our members
Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team
The Performance Engineering Institute (PERI) originally proposed a tiger team activity as a mechanism to target significant effort optimizing key Office of Science applications, a model that was successfully realized with the assistance of two JOULE metric teams. However, the Office of Science requested a new focus beginning in 2008: assistance in forming its ten year facilities plan. To meet this request, PERI formed the Architecture Tiger Team, which is modeling the performance of key science applications on future architectures, with S3D, FLASH and GTC chosen as the first application targets. In this activity, we have measured the performance of these applications on current systems in order to understand their baseline performance and to ensure that our modeling activity focuses on the right versions and inputs of the applications. We have applied a variety of modeling techniques to anticipate the performance of these applications on a range of anticipated systems. While our initial findings predict that Office of Science applications will continue to perform well on future machines from major hardware vendors, we have also encountered several areas in which we must extend our modeling techniques in order to fulfill our mission accurately and completely. In addition, we anticipate that models of a wider range of applications will reveal critical differences between expected future systems, thus providing guidance for future Office of Science procurement decisions, and will enable DOE applications to exploit machines in future facilities fully
- …