11 research outputs found

    Asynchronous Execution of Python Code on Task Based Runtime Systems

    Get PDF
    Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience of programming in low-level languages and costs of acquiring the necessary skills required for programming at this level. In recent years, Python, with the support of linear algebra libraries like NumPy, has gained popularity despite facing limitations which prevent this code from distributed runs. Here we present a solution which maintains both high level programming abstractions as well as parallel and distributed efficiency. Phylanx, is an asynchronous array processing toolkit which transforms Python and NumPy operations into code which can be executed in parallel on HPC resources by mapping Python and NumPy functions and variables into a dependency tree executed by HPX, a general purpose, parallel, task-based runtime system written in C++. Phylanx additionally provides introspection and visualization capabilities for debugging and performance analysis. We have tested the foundations of our approach by comparing our implementation of widely used machine learning algorithms to accepted NumPy standards

    Distributed Matrix Tiling using a Hypergraph Labeling Formulation

    Get PDF
    Partitioning large matrices is an important problem in distributed linear algebra computing, used in ML among others. Briefly, our goal is to perform a sequence of matrix algebra operations in a distributed manner on these large matrices. However, not all partitioning schemes work well with different matrix algebra operations and their implementations (algorithms). This is a type of data tiling problem. In this paper we consider a data tiling problem using hypergraphs. We prove some hardness results and give a theoretical characterization of its complexity on random instances. Additionally, we develop a greedy algorithm and experimentally show its efficacy

    HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics

    Get PDF
    In order to cope with the exponential growth in available data, the efficiency of data analysis and machine learning libraries have recently received increased attention. Although corresponding array-based numerical kernels have been significantly improved, most are limited by the resources available on a single computational node. Consequently, kernels must exploit distributed resources, e.g., distributed memory architectures. To this end, we introduce HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload via MPI on arbitrarily large high-performance computing systems. It provides both low-level array-based computations, as well as assorted higher-level algorithms. With HeAT, it is possible for a NumPy user to take advantage of their available resources, significantly lowering the barrier to distributed data analysis. Compared with applications written in similar frameworks, HeAT achieves speedups of up to two orders of magnitude

    HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data Analytics

    Get PDF
    To cope with the rapid growth in available data, the efficiency of data analysis and machine learning libraries has recently received increased attention. Although great advancements have been made in traditional array-based computations, most are limited by the resources available on a single computation node. Consequently, novel approaches must be made to exploit distributed resources, e.g. distributed memory architectures. To this end, we introduce HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. It provides both low-level array computations, as well as assorted higher-level algorithms. With HeAT, it is possible for a NumPy user to take full advantage of their available resources, significantly lowering the barrier to distributed data analysis. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.Comment: 10 pages, 8 figures, 5 listings, 1 tabl

    Adaptive Data Migration in Load-Imbalanced HPC Applications

    Get PDF
    Distributed parallel applications need to maximize and maintain computer resource utilization and be portable across different machines. Balanced execution of some applications requires more effort than others because their data distribution changes over time. Data re-distribution at runtime requires elaborate schemes that are expensive and may benefit particular applications. This dissertation discusses a solution for HPX applications to monitor application execution with APEX and use AGAS migration to adaptively redistribute data and load balance applications at runtime to improve application performance and scaling behavior. This dissertation provides evidence for the practicality of using the Active Global Address Space as is proposed by the ParalleX model and implemented in HPX. It does so by using migration for the transparent moving of objects at runtime and using the Autonomic Performance Environment for eXascale library with experiments that run on homogeneous and heterogeneous machines at Louisiana State University, CSCS Swiss National Supercomputing Centre, and National Energy Research Scientific Computing Center

    Optimizing the Performance of Multi-threaded Linear Algebra Libraries Based on Task Granularity

    Get PDF
    Linear algebra libraries play a very important role in many HPC applications. As larger datasets are created everyday, it also becomes crucial for the multi-threaded linear algebra libraries to utilize the compute resources properly. Moving toward exascale computing, the current programming models would not be able to fully take advantage of the advances in memory hierarchies, computer architectures, and networks. Asynchronous Many-Task(AMT) Runtime systems would be the solution to help the developers to manage the available parallelism. In this Dissertation we propose an adaptive solution to improve the performance of a linear algebra library based on a set of compile-time and runtime characteristics including the machine architecture, the expression being evaluated, the number of cores to run the application on, the type of the operation, and also the size of the matrices, to get as close as possible to the highest performance. Our focus is on machine learning applications, where we are potentially dealing with very large matrices, which could make creating temporaries very expensive. For this purpose we selected Blaze C++ library, a high performance template-based math library that gives us this option to access the expression tree at compile time, along with HPX, a C++ standard library for concurrency and parallelism, as our runtime system. HPX, as an AMT runtime system, offers scalibility and fine-grained parallelism through creating lightweight threads for fast context switching between the threads. Finding the optimum task granularity is a challenge in AMTs. Creating too many small tasks would result in performance degradation due to scheduling overhead of the tasks, and creating too few tasks would lead to under-utilization of the resources. Our work focuses on finding the optimum task granularity for each specific problem. We tried two different approaches to model the relationship between the performance and the grain size, in order to find a range of grain sizes that could lead us to the maximum performance. First, we used polynomial functions to model how throughput changes in terms of the grain size, and number of cores. Although this method was successful in finding the range of grain sizes for maximum throughput, it was not physical. This motivated us to go deeper and try to develop an analytical model for execution time in terms of grain size for balanced parallel for loops. Based on the analytical model we propose a method to predict the range of grain size for minimum execution time. Moreover, since the parameters of the proposed model only depend on the system architecture, we suggest to use a parallel for-loop benchmark to find these parameters on a system, and use it to find the range of grain size for minimum execution time for arbitrary balanced parallel for-loop applications ran on the same machine. Having the mentioned models, we changed the current implementation of the HPX backend for Blaze by adding two parameters to represent the unit of work, and the number of units included in each task, for fine grained control of the parallelism, which is possible through HPX runtime system. Also, a complexity estimation function has been added to Blaze to estimate the number of floating point operations occurring in each unit of work. The model parameters estimated through the parallel for-loop benchmark could also be plugged into Blaze at compile time, in order to find the optimum range of grain size at runtime based on the matrix sizes and complexity of the operations. In the next step, we used the identified range of grain size to extend the previous implementation of splittable tasks, as an algorithm to control task granularity. We modified the current implementation by scheduling the tasks on idle cores directly instead of waiting for them to be stolen, and integrating the lower-bound of the analytical model as the threshold to stop splitting, in order to adapt the threshold to the system architecture and the application being executed

    Managing Overheads in Asynchronous Many-Task Runtime Systems

    Get PDF
    Asynchronous Many-Task (AMT) runtime systems are based on the idea of dividing an algorithm into small units of work, known as tasks. The runtime system is then responsible for scheduling and executing these tasks in an efficient manner by taking into account the resources provided to it and the associated data dependencies between the tasks. One of the primary challenges faced by AMTs is managing such fine-grained parallelism and the overheads associated with creating, scheduling and executing tasks. This work develops methodologies for assessing and managing overheads associated with fine-grained task execution in HPX, our exemplar Asynchronous Many-Task runtime system. Known optimization techniques, viz. active message coalescing, task inlining and parallel loop iteration chunking are applied to HPX. Active message coalescing, where messages bound to the same destination are aggregated into a single message, is presented as a solution to minimize overheads associated with fine-grained communications. Methodologies and metrics for analyzing fine-grained communication overheads are developed. The metrics identified and implemented in this research aid in evaluating network efficiency by giving us an intrinsic view of the underlying network overhead that would be difficult to measure using conventional methods. Task inlining, a method that allows runtime systems to manage the overheads introduced by a large number of tasks by merging tasks together into one thread of execution, is presented as a technique for minimizing fine-grained task overheads. A runtime policy that dynamically decides whether to inline a task is developed and evaluated on different processor architectures. A methodology to derive a largely machine independent constant that allows controlling task granularity is developed. Finally, the machine independent constant derived in the context of task inlining is applied to chunking of parallel loop iterations, which confirms its applicability to reduce overheads, in the context of finding the optimal chunk size of the combined loop iterations

    Machine Learning in Image Analysis and Pattern Recognition

    Get PDF
    This book is to chart the progress in applying machine learning, including deep learning, to a broad range of image analysis and pattern recognition problems and applications. In this book, we have assembled original research articles making unique contributions to the theory, methodology and applications of machine learning in image analysis and pattern recognition
    corecore