35 research outputs found

    Tasking in accelerators: performance evaluation

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In this work, we analyze the implications and results of implementing dynamic parallelism, concurrent kernels and CUDA Graphs to solve task-oriented problems. As a benchmark we propose three different methods for solving DGEMM operation on tiled-matrices; which might be the most popular benchmark for performance analysis. For the algorithms that we study, we present significant differences in terms of data dependencies, synchronization and granularity. The main contribution of this work is determining which of the previous approaches work better for having multiple task running concurrently in a single GPU, as well as stating the main limitations and benefits of every technique. Using dynamic parallelism and CUDA Streams we were able to achieve up to 30% speedups and for CUDA Graph API up to 25x acceleration outperforming state of the art results.This project has received funding from the EPEEC project from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 801051, from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII ( TIN2015-65316-P ) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Pro-gramació i Entorns d’Execució Paral·lels (2014-SGR-1051 ). Finally, this project also received funding from the Spanish Ministry of Economy and Competitiveness under the Juan de la Cierva Grant Agreement No IJCI-2017-33511 , and from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska Curie grant agreement No. 749516 .Peer ReviewedPostprint (author's final draft

    A generic learning multi-agent-system approach for spatio-temporal-, thermal- and energy-aware scheduling

    Get PDF
    This paper proposes an agent based approach to the scheduling of jobs in data centers under thermal constraints. The model encompasses both temporal and spatial aspects of the temperature evolution using a unified model, taking into account the dynamics of heat production and dissipation. Agents coordinate to eventually move jobs to the best suitable place and to adapt dynamically the frequency settings of the nodes to the best combination. Several objectives of the agents are compared under different circumstances by an extensive set of experiments

    Partial aggregation for collective communication in distributed memory machines

    Get PDF
    High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

    ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

    Get PDF
    Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in both controlled, isolated environments and real applications. Our experiments show loading times of lost input data in the range of milliseconds on up to 24576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application

    Deep Learning at Scale with Nearest Neighbours Communications

    Get PDF
    As deep learning techniques become more and more popular, there is the need to move these applications from the data scientist’s Jupyter notebook to efficient and reliable enterprise solutions. Moreover, distributed training of deep learning models will happen more and more outside the well-known borders of cloud and HPC infrastructure and will move to edge and mobile platforms. Current techniques for distributed deep learning have drawbacks in both these scenarios, limiting their long-term applicability. After a critical review of the established techniques for Data Parallel training from both a distributed computing and deep learning perspective, a novel approach based on nearest-neighbour communications is presented in order to overcome some of the issues related to mainstream approaches, such as global communication patterns. Moreover, in order to validate the proposed strategy, the Flexible Asynchronous Scalable Training (FAST) framework is introduced, which allows to apply the nearest-neighbours communications approach to a deep learning framework of choice. Finally, a relevant use-case is deployed on a medium-scale infrastructure to demonstrate both the framework and the methodology presented. Training convergence and scalability results are presented and discussed in comparison to a baseline defined by using state-of-the-art distributed training tools provided by a well-known deep learning framework

    Proyecto Docente e Investigador, Trabajo Original de Investigación y Presentación de la Defensa, preparado por Germán Moltó para concursar a la plaza de Catedrático de Universidad, concurso 082/22, plaza 6708, área de Ciencia de la Computación e Inteligencia Artificial

    Full text link
    Este documento contiene el proyecto docente e investigador del candidato Germán Moltó Martínez presentado como requisito para el concurso de acceso a plazas de Cuerpos Docentes Universitarios. Concretamente, el documento se centra en el concurso para la plaza 6708 de Catedrático de Universidad en el área de Ciencia de la Computación en el Departamento de Sistemas Informáticos y Computación de la Universitat Politécnica de València. La plaza está adscrita a la Escola Técnica Superior d'Enginyeria Informàtica y tiene como perfil las asignaturas "Infraestructuras de Cloud Público" y "Estructuras de Datos y Algoritmos".También se incluye el Historial Académico, Docente e Investigador, así como la presentación usada durante la defensa.Germán Moltó Martínez (2022). Proyecto Docente e Investigador, Trabajo Original de Investigación y Presentación de la Defensa, preparado por Germán Moltó para concursar a la plaza de Catedrático de Universidad, concurso 082/22, plaza 6708, área de Ciencia de la Computación e Inteligencia Artificial. http://hdl.handle.net/10251/18903

    Processing social media text for the quantamental analyses of cryptoasset time series

    Get PDF
    This thesis analyses social media text to identify which events and concerns are associated with changes between phases of rising and falling cryptoasset prices. A new cryptoasset classification system, based on token functionality, highlights Bitcoin as the largest example of a 'crypto-transaction' system and Ethereum as the largest example of a 'crypto-fuel' system. The price of ether is only weakly correlated with that of bitcoin (Spearman's rho 0.3849). Both bitcoin and ether show distinct phases of rising or falling prices and have a large, dedicated social media forum on Reddit. A process is developed to extract events and concerns discussed on social media associated with these different phases of price movement. This innovative data-driven approach circumvents the need to pre-judge social media metrics. First, a new, non-parametric Data-Driven Phasic Word Identification methodology is developed to find words associated with the phase of declining bitcoin prices in 2017-18. This approach is further developed to find the context of these words, from which topics are inferred. Then, neural networks (word2vec) are applied to evolve analysis from extracting words to extracting topics. Finally, this work enables the development of a framework for identifying which events and concerns are plausible causes of changes between different phases in the ether and bitcoin price series. Consistent with Bitcoin providing a form of money and Ethereum providing a platform for developing applications, these results show the one-off effect of regulatory bans on bitcoin, and the recurring effects of rival innovations on ether price. The results also suggest the influence of technical traders, captured through market price discourse, on both cryptoassets. This thesis demonstrates the value of a quantamental approach to the analysis of cryptoasset prices
    corecore