Search CORE

23 research outputs found

Award ER25844: Minimizing System Noise Effects for Extreme-Scale Scientific Simulation Through Function Delegation

Author: Lumsdaine Andrew
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 20/11/2012
Field of study

In software running on distributed computing clusters, time spent on communication between nodes in the cluster can be a significant portion of the overall computation time; background operating system tasks and other computational �noise� on the nodes of the system can have a significant impact on the amount of time this communication takes, especially on large systems. The research completed in this period has improved understanding of when such noise will have a significant impact. Specifically, it was demonstrated that not just noise on the nodes, but also noise on the network between nodes can have a significant impact on computation time. It was also demonstrated that noise patterns matter more than noise intensity: very regular noise can cause less disruption than lighter (on average) but less regular noise. It was also demonstrated that the effect of noise is more prominent as the speed of the network between nodes is increased. Furthermore, a tracing tool, Netgauge, was improved via our work, and a system simulator, LogGOPSim, was developed; they can be used by application developers to improve performance of their program and by system designers to mitigate the effects of noise by adjusting the noise characteristics of the operating system. Both have been made freely available as open source programs. In the course of developing these tools, we demonstrated weaknesses in existing methodologies for modeling communication, and we introduced a more detailed model, LogGOPS, for simulating systems. Not only were the deleterious effects of noise explored but we have also offered solutions. Our studies of simulations of system noise have led to specific recommendations on tuning systems to mitigate noise. We have also improved existing approaches to mitigating noise. �Non-blocking collective communication� avoids the effects of noise by letting communication continue simultaneously with computation (thus being �non-blocking�), so that the delays in communication introduced by noise have a smaller impact on overall computation time. Potentially, noise can be reduced much further by �offloading� communication tasks to a separate processing element than the operating system is using. We have improved our library LibNBC, which provides an implementation of non-blocking collectives, via this work. During this research, our proposal to include non-blocking collectives (which used LibNBC as a reference implementation) in the upcoming MPI-3 standard was accepted. As MPI is a ubiquitous and important standard for communication in parallel computing, this demonstrates a certain acceptance of the practicality and desirability of non-blocking collectives. Now that non-blocking collectives are a part of the standard we can expect to see optimized platform-specific implementations of non-blocking collectives. Also as part of this work we have also developed a language GOAL (Global Operation Assembly Language) that can be used as a starting point for defining languages to express optimized communication algorithms

Crossref

UNT Digital Library

Performance evaluation and enhancement of Dendro

Author: Mukherjee Jayanta
Publication venue
Publication date
Field of study

DENDRO is a collection of tools for solving Finite Element problems in parallel. This package is written in C++ using the standard template library (STL) and uses the Message Passing (MPI). Dendro uses an octree data-structure to solve image-registration problems using finite element techniques. For analyzing the behavior of the package in terms of speed-up and scalability, it is important to know which part of the package is consuming most of the execution-time. The single node performance and the overall performance of the package is dependent on the code-organization and class-hierarchy. We used the PETSC profiler to collect the performance statistics and instrument the code to know which part of the code takes most of the time. Along with the function-specific execution timings, PETSC profiler also provides the information regarding how many floating point operations is being performed in total and on average (FLOP/second). PETSC also provides information related to memory usage and number of MPI messages and reductions being performed to execute that particular function. We have analyzed these performance-statistics to provide some guidelines to how we can make Dendro more efficient by optimizing certain functions. We obtained around 12X speedup over the performance of (default) Dendro by using compiler-provided optimizations and achieved more than 65% speedup over compiler optimized performance (20X over the naive Dendro performance) by manually tuning some-block of code along with the compiler-optimizations

Illinois Digital Environment for Access to Learning and Scholarship Repository

Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

Author: Mudalige Gihan R.
Publication venue
Publication date
Field of study

Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms used for the solution of many scientific and engineering applications. In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing and evaluating their operational performance. Wavefront codes exhibit complex computation, communication, synchronisation patterns, and as a result there exist a large variety of such codes and possible optimisations. The problem is compounded by each new generation of high performance computing system, which has often introduced a previously unexplored architectural trait, requiring previous performance models to be rewritten and reevaluated. In this thesis, we address the performance modelling and optimisation of this class of application, as a whole. This differs from previous studies in which bespoke models are applied to specific applications. The analytic performance models are generalised and reusable, and we demonstrate their application to the predictive analysis and optimisation of pipelined wavefront computations running on modern high performance computing systems. The performance model is based on the LogGP parameterisation, and uses a small number of input parameters to specify the particular behaviour of most wavefront codes. The new parameters and model equations capture the key structural and behavioural differences among different wavefront application codes, providing a succinct summary of the operations for each application and insights into alternative wavefront application design. The models are applied to three industry-strength wavefront codes and are validated on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model predictions show high quantitative accuracy (less than 20% error) for all high performance configurations and excellent qualitative accuracy. The thesis presents applications, projections and insights for optimisations using the model, which show the utility of reusable analytic models for performance engineering of high performance computing codes. In particular, we demonstrate the use of the model for: (1) evaluating application configuration and resulting performance; (2) evaluating hardware platform issues including platform sizing, configuration; (3) exploring hardware platform design alternatives and system procurement and, (4) considering possible code and algorithmic optimisations

Warwick Research Archives Portal Repository

Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

Author: Mudalige Gihan Ravideva
Publication venue
Publication date: 01/01/2009
Field of study

OpenGrey Repository

Performance modeling for systematic performance tuning

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Crossref

Co-design Hardware and Algorithm for Vector Search

Author: Alonso Gustavo
He Zhenhao
Hoefler Torsten
Jiang Wenqi
Li Shigang
Licht Johannes de Fine
Rekatsinas Theodoros
Renggli Cedric
Shi Runbin
Zhang Shuai
Zhu Yu
Publication venue
Publication date: 27/06/2023
Field of study

Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce \textit{FANNS}, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, \textit{FANNS} automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. \textit{FANNS} attains up to 23.0

\times

and 37.2

\times

speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5

\times

and 7.6

\times

speedup in median and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator configuration. The remarkable performance of \textit{FANNS} lays a robust groundwork for future FPGA integration in data centers and AI supercomputers.Comment: 11 page

arXiv.org e-Print Archive

Partial aggregation for collective communication in distributed memory machines

Author: Kowalewski Roger
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 03/08/2021
Field of study

High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

Digitale Hochschulschriften der LMU

Algorithm engineering for parallel computation

Author: Bader D. A.
Moret B. M. E.
Sanders P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/12/2006
Field of study

Infoscience - École polytechnique fédérale de Lausanne

PARALiA: a performance aware runtime for auto-tuning linear algebra on heterogeneous systems

Author: Anastasiadis Petros
Goumas Georgios
Hoppe Dennis
Koziris Nectarios
Papadopoulou Nikela
Zhong Li
Publication venue: ACM
Publication date: 01/12/2023
Field of study

Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems

Enlighten