Search CORE

155 research outputs found

Center for Programming Models for Scalable Parallel Computing

Author: Mellor-Crummey John
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 29/02/2008
Field of study

Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf

Crossref

UNT Digital Library

Towards Accelerating High-Order Stencils on Modern GPUs and Emerging Architectures with a Portable Framework

Author: Araya-Polo Mauricio
Mellor-Crummey John
Sai Ryuichi
Xu Jinfan
Publication venue
Publication date: 08/09/2023
Field of study

PDE discretization schemes yielding stencil-like computing patterns are commonly used for seismic modeling, weather forecast, and other scientific applications. Achieving HPC-level stencil computations on one architecture is challenging, porting to other architectures without sacrificing performance requires significant effort, especially in this golden age of many distinctive architectures. To help developers achieve performance, portability, and productivity with stencil computations, we developed StencilPy. With StencilPy, developers write stencil computations in a high-level domain-specific language, which promotes productivity, while its backends generate efficient code for existing and emerging architectures, including NVIDIA, AMD, and Intel GPUs, A64FX, and STX. StencilPy demonstrates promising performance results on par with hand-written code, maintains cross-architectural performance portability, and enhances productivity. Its modular design enables easy configuration, customization, and extension. A 25-point star-shaped stencil written in StencilPy is one-quarter of the length of a hand-crafted CUDA code and achieves similar performance on an NVIDIA H100 GPU

arXiv.org e-Print Archive

LoopTune: Optimizing Tensor Computations with Reinforcement Learning

Author: Cummins Chris
Grubisic Dejan
Mellor-Crummey John
Wasti Bram
Zlateski Aleksandar
Publication venue
Publication date: 08/09/2023
Field of study

Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds

arXiv.org e-Print Archive

Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles

Author: John M. Mellor-crummey
Laksono Adhianto
Nathan R. Tallent
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract—Applications must scale well to make efficient use of today’s class of petascale computers, which contain hundreds of thousands of processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks in large-scale executions. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of scaling problems. Load imbalance is one of the most common scaling problems. To provide actionable insight into load imbalance, we present post-mortem parallel analysis techniques for pinpointing and quantifying load imbalance in the context of call path profiles of parallel programs. We show how to identify load imbalance in its static and dynamic context by using only low-overhead asyn-chronous call path profiling to locate regions of code responsible for communication wait time in SPMD executions. We describe the implementation of these techniques within HPCTOOLKIT. I

CiteSeerX

Crossref

Perceived locus of causality and internalization: Examining reasons for acting in two domains.

Author: Christina Frederick
Cynthia Mellor-Crummey
Edward Deci
Elizabeth Whitehead
James P Connell
James Wellborn
John Lynch
Rachel Avery
Richard M Ryan
Wendy Grolnick
Publication venue
Publication date: 01/01/1989
Field of study

Theories of internalization typically suggest that self-perceptions of the "causes" of(i.e., reasons for) behavior are differentiated along a continuum of autonomy that contains identifiable gradations. A model of perceived locus of causality (PLOC) is developed, using children's self-reported reasons for acting. In Project 1, external, introjected, identified, and intrinsic types of reasons for achievementrelated behaviors are shown to conform to a simplex-like (ordered correlation) structure in four samples. These reason categories are then related to existing measures of PLOC and to motivation. A second project examines 3 reason categories (external, introject, and identification) within the domain of prosoeial behavior. Relations with measures of empathy, moral judgment, and positive interpersonal relatedness are presented. Finally, the proposed model and conceptualization of PLOC are discussed with regard to intrapersonal versus interpersonal perception, internalization, causereason distinctions, and the significance of perceived autonomy in human behavior. A central issue for theories of motivation concerns the perceived locus relative to the person of variables that cause or give impetus to behavior, Heider (1958) introduced the concept of perceived locus of causality (PLOC) primarily in reference to interpersonal perception, and more specifically with regard to the phenomenal analysis of how one infers the motives and intentions of others. He distinguished between personal causation, the critical feature of which is intention, and impersonal causation, in which environments, independent of the person's intentions, produce a given effect. DeCharms (1968) elaborated and extended Heider's phenomenal analysis, particularly with regard to the explanation of behavior (as opposed to outcomes). DeCharms argued that there is a further distinction within personal causation or intentional behavior between an internal PLOC, in which the actor is perceived as an "origin" of his or her behavior, and an external PLOC, in which the actor is seen as a "pawn" to heteronomous forces. The distinction between internal and external PLOC has since been crucial for studies of intrinsic versus extrinsic motivation and of perceived autonomy more generally (Deci &amp

CiteSeerX

Fast, contention-free combining tree barriers for shared-memory multiprocessors

Author: D. Hensgen
E. D. Brooks III
G. Graunke
J. M. Mellor-Crummery
John M. Mellor-Crummey
M. Herlihy
Michael L. Scott
P. L. Lehman
P.-C Yew
R. Gupta
T. E. Anderson
Y. Sagiv
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recommended from our members

Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team

Author: Alam Sadaf
Bailey David H.
Carrington Laura
Daley Chris
de Supinski Bronis R.
Dubey Anshu
Gamblin Todd
Gunter Dan
Hovland Paul D.
Jagode Heike
Karavanic Karen
Marin Gabriel
Mellor-Crummey John
Moore Shirley
Norris Boyana
Oliker Leonid
Olschanowsky Catherine
Roth Philip C.
Schulz Martin
Shende Sameer
Snavely Allan
Spear Wyatt
Tikir Mustafa
Vetter Jeff
Worley Pat
Wright Nicholas
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 26/06/2009
Field of study

The Performance Engineering Institute (PERI) originally proposed a tiger team activity as a mechanism to target significant effort optimizing key Office of Science applications, a model that was successfully realized with the assistance of two JOULE metric teams. However, the Office of Science requested a new focus beginning in 2008: assistance in forming its ten year facilities plan. To meet this request, PERI formed the Architecture Tiger Team, which is modeling the performance of key science applications on future architectures, with S3D, FLASH and GTC chosen as the first application targets. In this activity, we have measured the performance of these applications on current systems in order to understand their baseline performance and to ensure that our modeling activity focuses on the right versions and inputs of the applications. We have applied a variety of modeling techniques to anticipate the performance of these applications on a range of anticipated systems. While our initial findings predict that Office of Science applications will continue to perform well on future machines from major hardware vendors, we have also encountered several areas in which we must extend our modeling techniques in order to fulfill our mission accurately and completely. In addition, we anticipate that models of a wider range of applications will reveal critical differences between expected future systems, thus providing guidance for future Office of Science procurement decisions, and will enable DOE applications to exploit machines in future facilities fully

UNT Digital Library