6 research outputs found

    Evaluating the performance of the allreduce collective operation on clusters. Approach and results

    Get PDF
    The performance of the collective operations provided by a communication library is important for many applications run on clusters. The communication structure of collective operations can be organized as a tree. Performance can be improved by configuring and mapping the tree to the clusters in use. We describe and demonstrate an approach for evaluating the performance of different configurations and mappings of allreduce run on clusters of different size, consisting of single-CPU hosts, and SMPs with a different number of CPUs. A breakdown of the cost of allreduce using the best configuration on different clusters is provided. For all, the broadcast part is more expensive than the reduce part. Inter-host communication contributes more to the time per allreduce than the synchronization in the allreduce components. For the small messages sizes used (4 and 256 bytes), the time spent computing the partial reductions is insignificant. Reconfiguring hierarchy aware trees improved performance up to a factor of 1.49, by avoiding scalability problems of the components on SMPs, and by finding the right balance between available concurrency, load on 'root' hosts and the number of network links in a tree. Extending a tree by adding more threads, or by combining two trees does not have a negative influence on the performance of a configuration, but increasing message size does

    Compiler and Runtime Optimization Techniques for Implementation Scalable Parallel Applications

    Get PDF
    The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided by a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and desired application scalability. These compiler techniques should consider both the static information gathered at compile time and dynamic analysis captured at runtime about the system to generate a safe parallel application. On the other hand, runtime information is often speculative. Solely relying on it doesn\u27t guarantee maximal parallel performance. So collecting information at compile time could significantly improve the runtime techniques performance. The goal is achieved in this research by introducing new techniques proposed for both compiler and runtime system that enable them to contribute with each other and utilize both static and dynamic analysis information to maximize application parallel performance. In the proposed framework, a compiler can implement dynamic runtime methods in its parallelization optimizations and a runtime system can apply static information in its parallelization methods implementation. The proposed techniques are able to use high-level programming abstractions and machine learning to relieve the programmer of difficult and tedious decisions that can significantly affect program behavior and performance

    Dynamics and pragmatics for high performance concurrency

    Get PDF
    This thesis is concerned with support at all levels for building highly concurrent and dynamic parallel processing systems. The CSP model of concurrency, as (largely) embodied in the occam programming language is used due to its simplicity, expressiveness, architecture- independent nature, and potential for high performance. Additionally, occam provides guarantees regarding freedom from aliasing and race-hazard error. This thesis addresses one of the grand challenges of present day computer science: providing a software technology that offers the dynamic flexibility and performance of mainstream object oriented environments with the level of safety, formal analysis, modularity and lightweight concurrency offered by CSP/occam. Two approaches to this challenge are possible: do something to make the mainstream languages (e.g. Java, C++) safe, or make occam dynamic -- without compromising its existing good properties. This thesis follows the latter route. The first part of this thesis concentrates on enhancing the occam language and run-time system, on a commodity platform (IBM PC) running the freely available Linux operating system. After a brief introduction to the various components of the kroc occam system, additions and extensions to the occam programming language and supporting run-time system are examined. These provide a greater degree of programming flexibility in occam (for example, by adding support for dynamic allocation, mobile semantics and dynamic network construction), without compromising the safety of programs which use them. Benchmarks are reported that demonstrate significant improvements in performance (for example, channel communication in tens of nano-seconds). The second part concentrates on improving the level of interaction between occam programs and the OS environment. Providing easy access to sockets and networking, for example. This thesis concludes with a discussion of the work presented herein, with consideration given to parallels with object-oriented languages. Also described are details of ongoing and potential future research. The modified language grammar, details of new compiler generated code, and miscellany are provided in the appendices

    Dynamics and pragmatics for high performance concurrency

    Get PDF
    This thesis is concerned with support at all levels for building highly concurrent and dynamic parallel processing systems. The CSP model of concurrency, as (largely) embodied in the occam programming language is used due to its simplicity, expressiveness, architecture- independent nature, and potential for high performance. Additionally, occam provides guarantees regarding freedom from aliasing and race-hazard error. This thesis addresses one of the grand challenges of present day computer science: providing a software technology that offers the dynamic flexibility and performance of mainstream object oriented environments with the level of safety, formal analysis, modularity and lightweight concurrency offered by CSP/occam. Two approaches to this challenge are possible: do something to make the mainstream languages (e.g. Java, C++) safe, or make occam dynamic -- without compromising its existing good properties. This thesis follows the latter route. The first part of this thesis concentrates on enhancing the occam language and run-time system, on a commodity platform (IBM PC) running the freely available Linux operating system. After a brief introduction to the various components of the kroc occam system, additions and extensions to the occam programming language and supporting run-time system are examined. These provide a greater degree of programming flexibility in occam (for example, by adding support for dynamic allocation, mobile semantics and dynamic network construction), without compromising the safety of programs which use them. Benchmarks are reported that demonstrate significant improvements in performance (for example, channel communication in tens of nano-seconds). The second part concentrates on improving the level of interaction between occam programs and the OS environment. Providing easy access to sockets and networking, for example. This thesis concludes with a discussion of the work presented herein, with consideration given to parallels with object-oriented languages. Also described are details of ongoing and potential future research. The modified language grammar, details of new compiler generated code, and miscellany are provided in the appendices.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    A unified model for inter- and intra-processor concurrency

    Get PDF
    Although concurrency is generally perceived to be a `hard' subject, it can in fact be very simple --- provided that the underlying model is simple. The occam-pi parallel processing language provides such a simple yet powerful concurrency model that is based on CSP and the pi-calculus. This thesis presents pony, the occam-pi Network Environment. occam-pi and pony provide a new, unified, concurrency model that bridges inter- and intra-processor concurrency. This enables the development of distributed applications in a transparent, dynamic and highly scalable way. The author specified the layout of the pony system as presented in this thesis, and carried out about 90% of the implementation. This thesis is structured into three main parts, as well as an introduction and an appendix. In the introduction, the need for a unified concurrency model is examined in detail. Thereupon, the pony environment is presented as a solution that provides such a unified model. The first part of this thesis is concerned with the usage of the pony environment for the development of distributed applications. It presents the interface between pony and the user-level code, as well as pony's configuration and a sample application. The second part presents the design and implementation of the pony environment. It explains the internal structure of pony, the implementation of pony's components and public processes, and the integration of pony in the KRoC compiler. The third part evaluates pony's performance and contains the final conclusions. It presents a number of performance tests and concludes with a discussion of the work presented in this thesis, along with an outline of possible future research

    A unified model for inter- and intra-processor concurrency

    Get PDF
    Although concurrency is generally perceived to be a `hard' subject, it can in fact be very simple --- provided that the underlying model is simple. The occam-pi parallel processing language provides such a simple yet powerful concurrency model that is based on CSP and the pi-calculus. This thesis presents pony, the occam-pi Network Environment. occam-pi and pony provide a new, unified, concurrency model that bridges inter- and intra-processor concurrency. This enables the development of distributed applications in a transparent, dynamic and highly scalable way. The author specified the layout of the pony system as presented in this thesis, and carried out about 90% of the implementation. This thesis is structured into three main parts, as well as an introduction and an appendix. In the introduction, the need for a unified concurrency model is examined in detail. Thereupon, the pony environment is presented as a solution that provides such a unified model. The first part of this thesis is concerned with the usage of the pony environment for the development of distributed applications. It presents the interface between pony and the user-level code, as well as pony's configuration and a sample application. The second part presents the design and implementation of the pony environment. It explains the internal structure of pony, the implementation of pony's components and public processes, and the integration of pony in the KRoC compiler. The third part evaluates pony's performance and contains the final conclusions. It presents a number of performance tests and concludes with a discussion of the work presented in this thesis, along with an outline of possible future research.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore