15 research outputs found

    Shared memory parallelism in Modern C++ and HPX

    Full text link
    Parallel programming remains a daunting challenge, from the struggle to express a parallel algorithm without cluttering the underlying synchronous logic, to describing which devices to employ in a calculation, to correctness. Over the years, numerous solutions have arisen, many of them requiring new programming languages, extensions to programming languages, or the addition of pragmas. Support for these various tools and extensions is available to a varying degree. In recent years, the C++ standards committee has worked to refine the language features and libraries needed to support parallel programming on a single computational node. Eventually, all major vendors and compilers will provide robust and performant implementations of these standards. Until then, the HPX library and runtime provides cutting edge implementations of the standards, as well as proposed standards and extensions. Because of these advances, it is now possible to write high performance parallel code without custom extensions to C++. We provide an overview of modern parallel programming in C++, describing the language and library features, and providing brief examples of how to use them

    Optimizing the Performance of Multi-threaded Linear Algebra Libraries Based on Task Granularity

    Get PDF
    Linear algebra libraries play a very important role in many HPC applications. As larger datasets are created everyday, it also becomes crucial for the multi-threaded linear algebra libraries to utilize the compute resources properly. Moving toward exascale computing, the current programming models would not be able to fully take advantage of the advances in memory hierarchies, computer architectures, and networks. Asynchronous Many-Task(AMT) Runtime systems would be the solution to help the developers to manage the available parallelism. In this Dissertation we propose an adaptive solution to improve the performance of a linear algebra library based on a set of compile-time and runtime characteristics including the machine architecture, the expression being evaluated, the number of cores to run the application on, the type of the operation, and also the size of the matrices, to get as close as possible to the highest performance. Our focus is on machine learning applications, where we are potentially dealing with very large matrices, which could make creating temporaries very expensive. For this purpose we selected Blaze C++ library, a high performance template-based math library that gives us this option to access the expression tree at compile time, along with HPX, a C++ standard library for concurrency and parallelism, as our runtime system. HPX, as an AMT runtime system, offers scalibility and fine-grained parallelism through creating lightweight threads for fast context switching between the threads. Finding the optimum task granularity is a challenge in AMTs. Creating too many small tasks would result in performance degradation due to scheduling overhead of the tasks, and creating too few tasks would lead to under-utilization of the resources. Our work focuses on finding the optimum task granularity for each specific problem. We tried two different approaches to model the relationship between the performance and the grain size, in order to find a range of grain sizes that could lead us to the maximum performance. First, we used polynomial functions to model how throughput changes in terms of the grain size, and number of cores. Although this method was successful in finding the range of grain sizes for maximum throughput, it was not physical. This motivated us to go deeper and try to develop an analytical model for execution time in terms of grain size for balanced parallel for loops. Based on the analytical model we propose a method to predict the range of grain size for minimum execution time. Moreover, since the parameters of the proposed model only depend on the system architecture, we suggest to use a parallel for-loop benchmark to find these parameters on a system, and use it to find the range of grain size for minimum execution time for arbitrary balanced parallel for-loop applications ran on the same machine. Having the mentioned models, we changed the current implementation of the HPX backend for Blaze by adding two parameters to represent the unit of work, and the number of units included in each task, for fine grained control of the parallelism, which is possible through HPX runtime system. Also, a complexity estimation function has been added to Blaze to estimate the number of floating point operations occurring in each unit of work. The model parameters estimated through the parallel for-loop benchmark could also be plugged into Blaze at compile time, in order to find the optimum range of grain size at runtime based on the matrix sizes and complexity of the operations. In the next step, we used the identified range of grain size to extend the previous implementation of splittable tasks, as an algorithm to control task granularity. We modified the current implementation by scheduling the tasks on idle cores directly instead of waiting for them to be stolen, and integrating the lower-bound of the analytical model as the threshold to stop splitting, in order to adapt the threshold to the system architecture and the application being executed

    Parallel Programming with Global Asynchronous Memory: Models, C++ APIs and Implementations

    Get PDF
    In the realm of High Performance Computing (HPC), message passing has been the programming paradigm of choice for over twenty years. The durable MPI (Message Passing Interface) standard, with send/receive communication, broadcast, gather/scatter, and reduction collectives is still used to construct parallel programs where each communication is orchestrated by the developer-based precise knowledge of data distribution and overheads; collective communications simplify the orchestration but might induce excessive synchronization. Early attempts to bring shared-memory programming model鈥攚ith its programming advantages鈥攖o distributed computing, referred as the Distributed Shared Memory (DSM) model, faded away; one of the main issue was to combine performance and programmability with the memory consistency model. The recently proposed Partitioned Global Address Space (PGAS) model is a modern revamp of DSM that exposes data placement to enable optimizations based on locality, but it still addresses (simple) data- parallelism only and it relies on expensive sharing protocols. We advocate an alternative programming model for distributed computing based on a Global Asynchronous Memory (GAM), aiming to avoid coherency and consistency problems rather than solving them. We materialize GAM by designing and implementing a distributed smart pointers library, inspired by C++ smart pointers. In this model, public and pri- vate pointers (resembling C++ shared and unique pointers, respectively) are moved around instead of messages (i.e., data), thus alleviating the user from the burden of minimizing transfers. On top of smart pointers, we propose a high-level C++ template library for writing applications in terms of dataflow-like networks, namely GAM nets, consisting of stateful processors exchanging pointers in fully asynchronous fashion. We demonstrate the validity of the proposed approach, from the expressiveness perspective, by showing how GAM nets can be exploited to implement both standalone applications and higher-level parallel program- ming models, such as data and task parallelism. As for the performance perspective, preliminary experiments show both close-to-ideal scalability and negligible overhead with respect to state-of-the-art benchmark implementations. For instance, the GAM implementation of a high-quality video restoration filter sustains a 100 fps throughput over 70%-noisy high-quality video streams on a 4-node cluster of Graphics Processing Units (GPUs), with minimal programming effort

    Performance Analysis and Improvement for Scalable and Distributed Applications Based on Asynchronous Many-Task Systems

    Get PDF
    As the complexity of recent and future large-scale data and exascale systems architectures grows, so do productivity, portability, software scalability, and efficient utilization of system resources challenges presented to both industry and the research community. Software solutions and applications are expected to scale in performance on such complex systems. Asynchronous many-task (AMT) systems, taking advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling, are showing promise in addressing these challenges. In this research, we implement several scalable and distributed applications based on HPX, an exemplar AMT runtime system. First, a distributed HPX implementation for a parameterized benchmark Task Bench is introduced. The performance bottleneck is analyzed where the repeated HPX threads creation costs and a global barrier for all threads limit the performance. The methodologies to retain the spawning threads alive and overlap communication and computation are presented. The evaluation results prove the effectiveness of the improved approach, where HPX is comparable with the prevalent programming models and takes advantages of multi-task scenarios. Second, an algorithms and data-structures SHAD library with HPX support is introduced. The methodologies to support local and remote operations in synchronous and asynchronous manners are developed. The HPX implementation in support of the SHAD library is further provided. Performance results demonstrate that the proposed system presents the similar performance as SHAD with Intel TBB (Threading Building Blocks) support for shared-memory parallelism and is better to explore the distributed-memory parallelism than SHAD with GMT (Global Memory and Threading) support. Third, an asynchronous array processing framework Phylanx is introduced. The methodologies that support a distributed alternating least square algorithm are developed. The implementation of this algorithm along with a number of distributed primitives are provided. The performance results show that Phylanx implementation presents a good scalability. Finally, a scalable second-order method for optimization is introduced. The implementation of a Krylov-Newton second-order method via PyTorch framework is provided. Evaluation results illustrate the effectiveness of scalability, convergence, and robust to hyper-parameters of the proposed method

    Scheduling optimization of parallel linear algebra algorithms using Supervised Learning

    Full text link
    Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations.Comment: Accepted at HPCML1

    Relajaciones de ejecuci贸n definidas por el usuario para la mejora de la programabilidad en computaci贸n paralela de altas prestaciones

    Get PDF
    Tesis de la Universidad Complutense de Madrid, Facultad de Inform谩tica, le铆da el 22-11-2019This thesis proposes the development and implementation of a new programming model basedon execution relaxations, and focused on High-Performance Parallel Computing. Specifically,the main goals of the thesis are:1. Advocate a development methodology in which users define the basic computing units(tasks), together with a set of relaxations in, possibly, multiple dimensions. These relaxationswill be translated, at execution time, into expanded (and complex) scheduling opportunitiesdepending on the underlying architectural features, yielding improvements in termsof desired output metrics (e.g., performance or energy consumption).2. Abstract away users from the complexity of the underlying heterogeneous hardware, delegatingthe proper exploitation of expanded scheduling choices to a system software component(typically referred as a runtime). This piece of software, armed with knowledge fromstatic architectural characteristics and dynamic status of the hardware at execution time,will exploit those combinations considered optimal among those relaxations proposed bythe user for each task ready for execution.3. Extend this abstraction in order to describe both computing systems, by means of executor/ allocator hierarchies that describe the heterogeneous computing architecture, and applications,in terms of sets of interdependent tasks. In addition, the relations between executorsand tasks are categorized into a new task-executor taxonomy, which motivates ambiguityfreeHPC programming frontends based on the STSE, Single Task - Single Executor classification,distinguished from fully-automated runtime backends.4. Propose a new programming model (STEEL) based on previous ideas, that gathers featuresconsidered to be basic for future task-based programming models, namely: performance,composability, expressivity and hard-to-misuse interfaces.5. Specify an API to support the STEEL programming model, and a runtime implementationleveraging techniques and programming paradigms supported by modern C++, illustratingits flexibility, ease of use and performance impact by means of simple use cases and examples.Hence, the proposed methodology stands for a clear and strict separation of concerns betweenthe involved actors in a parallel executions: user / codes and underlying hardware. This kind ofabstractions allows a delegation of the expert knowledge from the user toward the system software(runtime) in a systematic way, and facilitates the integration of mechanisms to automate optimizations,adapting performance to the specificities of the heterogeneous parallel architecture in whichthe code is instantiated and executed.From this perspective, the thesis designs, implements and validates mechanisms to perform aso-called complexity formalization, classifying many actions that are currently done by the userand building a framework in which these complexities can be delegated to the runtime system. Thedelegation of these decisions is already a step forward to next generation of programming modelsseeking performance, expressivity, programmability and portability...La presente tesis doctoral propone el dise帽o e implementaci贸n de un nuevo modelo de programaci贸n basado en relajaciones de ejecuci贸n y enfocado al 谩mbito de la Computaci贸n Paralela de Altas Prestaciones. Concretamente, los objetivos principales de la tesis son:1. Abogar por una metodolog铆a de desarrollo en la que el usuario define las unidades b谩sicas de computo (tareas), junto con un conjunto de relajaciones en, posiblemente, m煤ltiples dimensiones. Estas relajaciones se traducir谩n, en tiempo de ejecuci贸n, en oportunidades expandidas(y complejas) de planificaci贸n en funci贸n de la arquitectura subyacente, impactando as铆 en m茅tricas como rendimiento o consumo energ茅tico.2. Abstraer al usuario de la complejidad del hardware subyacente, delegando la correcta explotaci贸n de dichas posibilidades de planificaci贸n expandidas a un componente software de sistema (t铆picamente conocido como runtime). Dicho software, dotado de conocimiento tanto de las caracter铆sticas est谩ticas de la arquitectura subyacente como del estado puntual de la misma en el momento de la ejecuci贸n, explotar谩 las combinaciones consideradas optimas de entre las relajaciones propuestas por el usuario para cada tarea lista para set ejecutada.3. Extender dicha abstracci贸n para describir tanto sistemas de c贸mputo, en forma de jerarqu铆a de ejecutores y alojadores de memoria que en 麓ultimo t茅rmino describen una arquitectura de c贸mputo heterog茅nea, como aplicaciones, en forma de un conjunto de tareas interrelacionadas. Adem谩s, las relaciones entre ejecutores y tareas son clasificadas en una nueva taxonom铆a tarea-ejecutor, la cual motiva frontends de programaci贸n HPC sin ambig眉edad basados en la clasificaci贸n STSE, Single Task - Single Executor, separada de backends runtime totalmente automatizados.4. Proponer un nuevo modelo de programaci贸n (STEEL) basado en la clasificaci贸n STSE que aglutine ciertas caracter铆sticas consideradas b谩sicas de cara al 茅xito de los futuros modelos de programaci贸n basados en tareas: rendimiento, facilidad de composici贸n, expresividad e interfaces no permisivos ante fallos.5. Especificar una API que d茅 soporte al modelo de programaci贸n, as铆 como una implementaci贸n runtime del mismo aprovechando t茅cnicas y paradigmas soportados en el lenguaje C++ de 煤ltima generaci贸n, e ilustrar su uso, flexibilidad e impacto en el rendimiento a trav茅s de ejemplos y casos de uso sencillos .La metodolog铆a que se propugna aboga por una clara y estricta separaci贸n de conceptos entre los actores b谩sicos que componen una ejecuci贸n paralela: usuario / c贸digo y hardware subyacente. Este tipo de abstracciones permite delegar el conocimiento experto desde el usuario hacia el software de sistema, proporcionando as铆 mecanismos para mecanizar y automatizar su optimizaci贸n ,y adaptar su rendimiento a la arquitectura paralela sobre la que se instanciar谩n los c贸digos. Desde este punto de vista, la tesis dise帽a, implementa y valida mecanismos para llevar a cabo una formalizaci贸n de la complejidad inherente a la programaci贸n paralela heterog茅nea, clasificando aquellas acciones que en la actualidad se llevan a cabo por parte del usuario en el proceso de desarrollo y optimizaci贸n de c贸digo, y proporcionando un marco de trabajo en el que dicha complejidad puede ser delegada, de forma eficiente y consistente, a un runtime...Fac. de Inform谩ticaTRUEunpu

    STAPL-RTS: A Runtime System for Massive Parallelism

    Get PDF
    Modern High Performance Computing (HPC) systems are complex, with deep memory hierarchies and increasing use of computational heterogeneity via accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage machine resources, writing programs using low level primitives from multiple APIs (e.g., MPI+OpenMP), creating efficient but rigid, difficult to extend, and non-portable implementations. Alternatively, users can adopt higher level programming environments, often at the cost of lost performance. Our approach is to maintain the high level nature of the application without sacrificing performance by relying on the transfer of high level, application semantic knowledge between layers of the software stack at an appropriate level of abstraction and performing optimizations on a per-layer basis. In this dissertation, we present the STAPL Runtime System (STAPL-RTS), a runtime system built for portable performance, suitable for massively parallel machines. While the STAPL-RTS abstracts and virtualizes the underlying platform for portability, it uses information from the upper layers to perform the appropriate low level optimizations that restore the performance characteristics. We outline the fundamental ideas behind the design of the STAPL-RTS, such as the always distributed communication model and its asynchronous operations. Through appropriate code examples and benchmarks, we prove that high level information allows applications written on top of the STAPL-RTS to attain the performance of optimized, but ad hoc solutions. Using the STAPL library, we demonstrate how this information guides important decisions in the STAPL-RTS, such as multi-protocol communication coordination and request aggregation using established C++ programming idioms. Recognizing that nested parallelism is of increasing interest for both expressivity and performance, we present a parallel model that combines asynchronous, one-sided operations with isolated nested parallel sections. Previous approaches to nested parallelism targeted either static applications through the use of blocking, isolated sections, or dynamic applications by using asynchronous mechanisms (i.e., recursive task spawning) which come at the expense of isolation. We combine the flexibility of dynamic task creation with the isolation guarantees of the static models by allowing the creation of asynchronous, one-sided nested parallel sections that work in tandem with the more traditional, synchronous, collective nested parallelism. This allows selective, run-time customizable use of parallelism in an application, based on the input and the algorithm
    corecore