Search CORE

18 research outputs found

Recommended from our members

Automation of Determination of Optimal Intra-Compute Node Parallelism

Author: Brown James C.
Gómez-Iglesias Antonio
Publication venue
Publication date: 01/01/2016
Field of study

Maximizing the productivity of modern multicore and manycore chips requires optimizing parallelism at the compute node level. This is, however, a complex multi-step process. It is an iterative method requiring determining optimal degrees of parallel scalability and optimizing memory access behavior. Further, there are multiple cases to be considered, programs which use only MPI or OpenMP and hybrid (MPI +OpenMP) programs. This paper presents a set of three coordinated workﬂows for determining the optimal parallelism at the program level for MPI programs and at the loop level for hybrid (MPI+OpenMP) cases. The paper also details mostly automated implementations of these workﬂows using the PerfExpert infrastructure. Finally the paper presents case studies demonstrating both the applicability and the effectiveness of optimizing parallelism at the compute node level. The results shown in the paper will provide valuable information to further advance in the full automation of the workﬂows. The software implementing the parallelism scalability optimization is open source and available for download.Texas Advanced Computing Center (TACC)Computer Science

Texas ScholarWorks

A Graphical User Interface for Composing Parallel Computation in the STAPL Skeleton Library

Author: Malkey Abby M
Publication venue
Publication date: 24/07/2018
Field of study

Parallel programming is a quickly growing field in computer science. It involves splitting the computation among multiple processors to decrease the run time of programs. The computations assigned to a processor can depend on the results of another computation. This dependence in- troduces a partial ordering between tasks that requires coordination of the execution of the tasks assigned to each processor. OpenMP and MPI are current heavily utilized approaches and require the use of low level primitives to express very simple scientific applications. Newer environments, such as STAPL, Charm++, and Chapel, among others, raise the level of abstraction, but the challenge of specifying the flow of data between computations remains. However, graphical user interfaces (GUIs) can simplify this task. The purpose of this project is to create a GUI that al- lows a user to specify a parallel application written in STAPL by composing high-level components and by defining the flow of data between them. The idea is that the user creates the layout of the code using shapes and lines, which produce the composition on an underlying layer, eliminating the need to write complex composition specifications directly in the code

Texas A&M Repository

A Graphical User Interface for Composing Parallel Computation in the STAPL Skeleton Library

Author: Malkey Abby M
Publication venue
Publication date: 24/07/2018
Field of study

Texas A&M Repository

Nested Parallelism with Algorithmic Skeletons

Author: Majidi Alireza
Publication venue
Publication date: 22/02/2021
Field of study

New trend in design of computer architectures, from memory hierarchy design to grouping computing units in different hierarchical levels in CPUs, pushes developers toward algorithms that can exploit these hierarchical designs. This trend makes support of nested-parallelism an important feature for parallel programming models. It enables implementation of parallel programs that can then be mapped onto the system hierarchy. However, supporting nested-parallelism is not a trivial task due to complexity in spawning nested sections, destructing them and more importantly communication between these nested parallel sections. Structured parallel programming models are proven to be a good choice since while they hide the parallel programming complexities from the programmers, they allow programmers to customize the algorithm execution without going through radical changes to the other parts of the program. In this thesis, nested algorithm composition in the STAPL Skeleton Library (SSL) is presented, which uses a nested dataflow model as its internal representation. We show how a high level program specification using SSL allows for asynchronous computation and improved locality. We study both the specification and performance of the STAPL implementation of Kripke, a mini-app developed by Lawrence Livermore National Laboratory. Kripke has multiple levels of parallelism and a number of data layouts, making it an excellent test bed to exercise the effectiveness of a nested parallel programming approach. Performance results are provided for six different nesting orders of the benchmark under different degrees of nested-parallelism, demonstrating the flexibility and performance of nested algorithmic skeleton composition in STAPL

Texas A&M Repository

Fast Automatic Heuristic Construction Using Active Learning

Author: Leather Hugh
Ogilvie William
Petoumenos Pavlos
Wang Zheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2014
Field of study

Building effective optimization heuristics is a challenging task which often takes developers several months if not years to complete. Predictive modelling has recently emerged as a promising solution, automatically constructing heuristics from training data. However, obtaining this data can take months per platform. This is becoming an ever more critical problem and if no solution is found we shall be left with out of date heuristics which cannot extract the best performance from modern machines. In this work, we present a low-cost predictive modelling approach for automatic heuristic construction which significantly reduces this training overhead. Typically in supervised learning the training instances are randomly selected to evaluate regardless of how much useful information they carry. This wastes effort on parts of the space that contribute little to the quality of the produced heuristic. Our approach, on the other hand, uses active learning to select and only focus on the most useful training examples. We demonstrate this technique by automatically constructing a model to determine on which device to execute four parallel programs at differing problem dimensions for a representative Cpu–Gpu based heterogeneous system. Our methodology is remarkably simple and yet effective, making it a strong candidate for wide adoption. At high levels of classification accuracy the average learning speed-up is 3x, as compared to the state-of-the-art

Crossref

Edinburgh Research Explorer

The University of Manchester - Institutional Repository

Lancaster E-Prints

A HIERARCHICAL SYSTEM VIEW AND ITS USE IN THE DATA DISTRIBUTION OF COMPOSED CONTAINERS IN STAPL

Author: Shen Junjie
Publication venue
Publication date: 16/01/2019
Field of study

In parallel programming, a concurrent container usually distributes its elements to all processing units (locations) equally to maximize the processing ability. However, this distribution strategy does not perform well when we apply nested parallel func- tions on a composed concurrent container, such as a concurrent vector of vectors or a concurrent map of lists. The distribution of the inner concurrent containers across the system will mess up the locality of the elements in the composed containers, generating a lot of inter-process communication when the nested parallel operations are called to access the container's elements. As the hierarchy in modern high per- formance computing (HPC) systems become large and complex, a large amount of inter-process communication, especially those between two remote processing units (such as two cores on different nodes), will have dramatic negative impact on the performance of the parallel applications. In this thesis, we introduce a hierarchical system view that represents the topol- ogy of the processing units in a HPC system, and use it to guide the distribution of the composed concurrent containers. It reduces the number of processing elements involved in storing in the inner concurrent containers, which reduces memory usage and improves construction time. It also reduces the amount of inter-process com- munication by improving the locality of the elements when we apply nested parallel functions on a composed concurrent container. To evaluate our approach, we implement two concurrent associative multi-key containers, multimap and multiset, in the Standard Template Adaptive Parallel Li- brary (STAPL), and use the hierarchical system view on the distribution of composed 2D and 3D containers. Finally, we show great improvement on both the construction time and the execution time of the nested parallel functions with various numbers of cores and hierarchies

Texas A&M Repository

Interpreting and Visualizing Performance Portability Metrics

Author: Deakin Tom J
Jacobsen Douglas W
Mcintosh-Smith Simon N
Pennycook John
Sewall Jason
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/11/2020
Field of study

Explore Bristol Research

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

Author: Balaprakash Prasanna
Geltz Brad
Hall Mary
Hovland Paul
Jana Siddhartha
Koo Jaehoon
Kruse Michael
Taylor Valerie
Videau Brice
Wu Xingfu
Publication venue
Publication date: 28/03/2023
Field of study

As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime and power/energy for energy efficient application execution, then use this framework to autotune four ECP proxy applications -- XSBench, AMG, SWFFT, and SW4lite. Our approach uses Bayesian optimization with a Random Forest surrogate model to effectively search parameter spaces with up to 6 million different configurations on two large-scale production systems, Theta at Argonne National Laboratory and Summit at Oak Ridge National Laboratory. The experimental results show that our autotuning framework at large scales has low overhead and achieves good scalability. Using the proposed autotuning framework to identify the best configurations, we achieve up to 91.59% performance improvement, up to 21.2% energy savings, and up to 37.84% EDP improvement on up to 4,096 nodes

arXiv.org e-Print Archive

Exploring and Evaluating Array Layout Restructuration for SIMDization

Author: Aumage Olivier
Barthou Denis
Enguerrand Petit
Haine Christopher
Publication venue: HAL CCSD
Publication date: 15/09/2014
Field of study

International audienceSIMD processor units have become ubiquitous. Using SIMD instructions is the key for performance for many applications. Modern compilers have made immense progress in generating efficient SIMD code. However, they still may fail or SIMDize poorly, due to conservativeness, source complexity or missing capabilities. When SIMDization fails, programmers are left with little clues about the root causes and actions to be taken. Our proposed guided SIMDization framework builds on the assembly-code quality assessment toolkit MAQAO to analyzes binaries for possible SIMDization hindrances. It proposes improvement strategies and readily quantifies their impact, using in vivo evaluations of suggested transformation. Thanks to our framework, the programmer gets clear directions and quantified expectations on how to improve his/her code SIMDizability. We show results of our technique on TSVC benchmark.Les unités de calcul vectorielles sont désormais omniprésentes dans les processeurs. L'utilisation des jeux d'instructions vectoriels est un facteur clé dans la recherche de performances pour de nombreuses applications. Les compilateurs modernes ont fait d'immenses progrès dans la génération d'un code vectorisé efficace. Cependant, ils peuvent encore échouer ou générer un code vectorisé de mauvaise qualité dans certains cas, du fait d'un conservatisme trop important, de la complexité du code source ou de capacités insuffisantes. Lorsque la vectorisation échoue, les programmeurs n'obtiennent que peu d'indices sur les causes réelles et les actions correctives à entreprendre. Notre proposition d'environnement de vectorisation guidée se base sur notre outil MAQAO de contrôle qualitatif de code assembleur pour analyser les binaires produits et rechercher les causes possibles empêchant la vectorisation. Cet environnement propose des stratégies d'amélioration du code et permet d'en vérifier immédiatement leur impact en termes de performances, à l'aide d'évaluations in-vivo des transformations suggérées. Grâce à notre environnement, le programmeur obtiens des orientations claires sur la manière d'améliorer son code et une estimation quantifiée du gain espéré de telles transformations. Nous présentons les résultat de notre outil sur la suite de tests TSVC

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Automatic translation of non-repetitive OpenMP to MPI

Author: Jubair Fahed
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2014
Field of study

Cluster platforms with distributed-memory architectures are becoming increasingly available low-cost solutions for high performance computing. Delivering a productive programming environment that hides the complexity of clusters and allows writing efficient programs is urgently needed. Despite multiple efforts to provide shared memory abstraction, message-passing (MPI) is still the state-of-the-art programming model for distributed-memory architectures. ^ Writing efficient MPI programs is challenging. In contrast, OpenMP is a shared-memory programming model that is known for its programming productivity. Researchers introduced automatic source-to-source translation schemes from OpenMP to MPI so that programmers can use OpenMP while targeting clusters. Those schemes limited their focus on OpenMP programs with repetitive communication patterns (where the analysis of communication can be simplified). This dissertation reduces this limitation and presents a novel OpenMP-to-MPI translation scheme that covers OpenMP programs with both repetitive and non-repetitive communication patterns. We target laboratory-size clusters of ten to hundred nodes (commonly found in research laboratories and small enterprises). ^ With our translation scheme, six non-repetitive and four repetitive OpenMP benchmarks have been efficiently scaled to a cluster of 64 cores. By contrast, the state-of-the-art translator scaled only the four repetitive benchmarks. In addition, our translation scheme was shown to outperform or perform as well as the state-of-the-art translator. We also compare the translation scheme with available hand-coded MPI and Unified Parallel C (UPC) programs

Purdue E-Pubs