21 research outputs found

    A lightweight tool for detecting inter-thread communication

    Get PDF
    In a multicore environment, inter-thread communication can provide valuable insights about application performance. Literature detecting inter-thread communication either employs hardware simulators or binary instrumentation. Those techniques bring both space and time overhead, which makes them impractical to use on real-life applications. Instead, we take a completely different approach that leverages hardware performance counters and debug registers to detect communication volume between threads. The information generated by our tool can be utilized in several places to guide optimizations, understand performance behavior, and compare architectural features. In this talk, I present the design details of our tool along with experimental results on small to very large applications. I would like to note that this work is nominated for the best paper and the best student paper at SC19

    Programming Abstractions for Data Locality

    Get PDF
    The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal

    Domain-specific translator and optimizer for massive on- chip parallelism

    No full text
    Future supercomputers will rely on massive on-chip parallelism that requires dramatic changes be made to node architecture. Node architecture will become more heterogeneous and hierarchical, with software-managed on- chip memory becoming more prevalent. To meet the performance expectations, application software will undergo extensive redesign. In response, support from programming models is crucial to help scientists adopt new technologies without requiring significant programming effort. In this dissertation, we address the programming issues of a massively parallel single chip processor with a software-managed memory. We propose the Mint programming model and domain-specific compiler as a means of simplifying application development. Mint abstracts away the programmer's view of the hardware by providing a high- level interface to low-level architecture-specific optimizations. The Mint model requires modest recoding of the application and is based on a small number of compiler directives, which are sufficient to take advantage of massive parallelism. We have implemented the Mint model on a concrete instance of a massively parallel single chip processor: the Nvidia GPU (Graphics Processing Unit). The Mint source-to-source translator accepts C source with Mint annotations and generates CUDA C. The translator includes a domain-specific optimizer targeting stencil methods. Stencil methods arise in image processing applications and in a wide range of partial differential equation solvers. The Mint optimizer performs data locality optimizations, and uses on-chip memory to reduce memory accesses, particularly useful for stencil methods. We have demonstrated the effectiveness of Mint on a set of widely used stencil kernels and three real-world applications. The applications include an earthquake- induced seismic wave propagation code, an interest point detection algorithm for volume datasets and a model for signal propagation in cardiac tissue. In cases where hand- coded implementations are available, we have verified that Mint delivered competitive performance. Mint realizes around 80% of the performance of the hand-optimized CUDA implementations of the kernels and applications on the Tesla C1060 and C2050 GPUs. By facilitating the management of parallelism and the memory hierarchy on the chip at a high-level, Mint enables computational scientists to accelerate their software development time. Furthermore, by performing domain-specific optimizations, Mint delivers high performance for stencil method

    Trends in Data Locality Abstractions for HPC Systems

    No full text

    Load Balancing for Parallel Multiphase Flow Simulation

    No full text
    This paper presents a scalable dynamic load balancing scheme for a parallel front-tracking method based multiphase flow simulation. In this simulation employing both Lagrangian and Eulerian grids, processes operating on Lagrangian grid are susceptible to load imbalance due to moving Lagrangian grid points (bubbles) and load distribution based on spatial location of bubbles. To load balance these processes, we distribute load keeping in view both current processor load distribution and bubble spatial locality and remap interprocess communication. The result is a uniform processor load distribution and predictable and less expensive communication scheme. Scalability studies on the Hazel Hen supercomputer demonstrate excellent scaling with exponential savings in execution time as the problem size becomes increasingly large. While moderate speedup is observed for strong scaling, speedup of up to 30% is achieved over nonload-balanced version when simulating 13824 bubbles on 4096 cores for weak scaling studies
    corecore