Search CORE

6 research outputs found

Evaluating the performance of the allreduce collective operation on clusters. Approach and results

Author: Anshus Otto J.
Bjørndalen John Markus
Bongo Lars Ailo
Publication venue: University of Tromsø
Publication date: 01/01/2004
Field of study

The performance of the collective operations provided by a communication library is important for many applications run on clusters. The communication structure of collective operations can be organized as a tree. Performance can be improved by configuring and mapping the tree to the clusters in use. We describe and demonstrate an approach for evaluating the performance of different configurations and mappings of allreduce run on clusters of different size, consisting of single-CPU hosts, and SMPs with a different number of CPUs. A breakdown of the cost of allreduce using the best configuration on different clusters is provided. For all, the broadcast part is more expensive than the reduce part. Inter-host communication contributes more to the time per allreduce than the synchronization in the allreduce components. For the small messages sizes used (4 and 256 bytes), the time spent computing the partial reductions is insignificant. Reconfiguring hierarchy aware trees improved performance up to a factor of 1.49, by avoiding scalability problems of the components on SMPs, and by finding the right balance between available concurrency, load on 'root' hosts and the number of network links in a tree. Extending a tree by adding more threads, or by combining two trees does not have a negative influence on the performance of a configuration, but increasing message size does

Munin - Open Research Archive

Compiler and Runtime Optimization Techniques for Implementation Scalable Parallel Applications

Author: Khatami Zahra
Publication venue: LSU Digital Commons
Publication date: 03/08/2017
Field of study

The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided by a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and desired application scalability. These compiler techniques should consider both the static information gathered at compile time and dynamic analysis captured at runtime about the system to generate a safe parallel application. On the other hand, runtime information is often speculative. Solely relying on it doesn\u27t guarantee maximal parallel performance. So collecting information at compile time could significantly improve the runtime techniques performance. The goal is achieved in this research by introducing new techniques proposed for both compiler and runtime system that enable them to contribute with each other and utilize both static and dynamic analysis information to maximize application parallel performance. In the proposed framework, a compiler can implement dynamic runtime methods in its parallelization optimizations and a runtime system can apply static information in its parallelization methods implementation. The proposed techniques are able to use high-level programming abstractions and machine learning to relieve the programmer of difficult and tedious decisions that can significantly affect program behavior and performance

Louisiana State University

Dynamics and pragmatics for high performance concurrency

Author: Barnes Frederick R. M.
Publication venue
Publication date: 02/11/2021
Field of study

This thesis is concerned with support at all levels for building highly concurrent and dynamic parallel processing systems. The CSP model of concurrency, as (largely) embodied in the occam programming language is used due to its simplicity, expressiveness, architecture- independent nature, and potential for high performance. Additionally, occam provides guarantees regarding freedom from aliasing and race-hazard error. This thesis addresses one of the grand challenges of present day computer science: providing a software technology that offers the dynamic flexibility and performance of mainstream object oriented environments with the level of safety, formal analysis, modularity and lightweight concurrency offered by CSP/occam. Two approaches to this challenge are possible: do something to make the mainstream languages (e.g. Java, C++) safe, or make occam dynamic -- without compromising its existing good properties. This thesis follows the latter route. The first part of this thesis concentrates on enhancing the occam language and run-time system, on a commodity platform (IBM PC) running the freely available Linux operating system. After a brief introduction to the various components of the kroc occam system, additions and extensions to the occam programming language and supporting run-time system are examined. These provide a greater degree of programming flexibility in occam (for example, by adding support for dynamic allocation, mobile semantics and dynamic network construction), without compromising the safety of programs which use them. Benchmarks are reported that demonstrate significant improvements in performance (for example, channel communication in tens of nano-seconds). The second part concentrates on improving the level of interaction between occam programs and the OS environment. Providing easy access to sockets and networking, for example. This thesis concludes with a discussion of the work presented herein, with consideration given to parallels with object-oriented languages. Also described are details of ongoing and potential future research. The modified language grammar, details of new compiler generated code, and miscellany are provided in the appendices

Kent Academic Repository

Dynamics and pragmatics for high performance concurrency

Author: Barnes Frederick R M
Publication venue
Publication date: 01/01/2003
Field of study

OpenGrey Repository

A unified model for inter- and intra-processor concurrency

Author: Schweigler Mario
Publication venue
Publication date: 16/11/2021
Field of study

Although concurrency is generally perceived to be a `hard' subject, it can in fact be very simple --- provided that the underlying model is simple. The occam-pi parallel processing language provides such a simple yet powerful concurrency model that is based on CSP and the pi-calculus. This thesis presents pony, the occam-pi Network Environment. occam-pi and pony provide a new, unified, concurrency model that bridges inter- and intra-processor concurrency. This enables the development of distributed applications in a transparent, dynamic and highly scalable way. The author specified the layout of the pony system as presented in this thesis, and carried out about 90% of the implementation. This thesis is structured into three main parts, as well as an introduction and an appendix. In the introduction, the need for a unified concurrency model is examined in detail. Thereupon, the pony environment is presented as a solution that provides such a unified model. The first part of this thesis is concerned with the usage of the pony environment for the development of distributed applications. It presents the interface between pony and the user-level code, as well as pony's configuration and a sample application. The second part presents the design and implementation of the pony environment. It explains the internal structure of pony, the implementation of pony's components and public processes, and the integration of pony in the KRoC compiler. The third part evaluates pony's performance and contains the final conclusions. It presents a number of performance tests and concludes with a discussion of the work presented in this thesis, along with an outline of possible future research

Kent Academic Repository

A unified model for inter- and intra-processor concurrency

Author: Schweigler Mario
Publication venue
Publication date: 01/01/2006
Field of study

OpenGrey Repository