40 research outputs found

    Optimization of SpGEMM with Risc-V vector instructions

    Full text link
    The Sparse GEneral Matrix-Matrix multiplication (SpGEMM) C=A×BC = A \times B is a fundamental routine extensively used in domains like machine learning or graph analytics. Despite its relevance, the efficient execution of SpGEMM on vector architectures is a relatively unexplored topic. The most recent algorithm to run SpGEMM on these architectures is based on the SParse Accumulator (SPA) approach, and it is relatively efficient for sparse matrices featuring several tens of non-zero coefficients per column as it computes C columns one by one. However, when dealing with matrices containing just a few non-zero coefficients per column, the state-of-the-art algorithm is not able to fully exploit long vector architectures when computing the SpGEMM kernel. To overcome this issue we propose the SPA paRallel with Sorting (SPARS) algorithm, which computes in parallel several C columns among other optimizations, and the HASH algorithm, which uses dynamically sized hash tables to store intermediate output values. To combine the efficiency of SPA for relatively dense matrix blocks with the high performance that SPARS and HASH deliver for very sparse matrix blocks we propose H-SPA(t) and H-HASH(t), which dynamically switch between different algorithms. H-SPA(t) and H-HASH(t) obtain 1.24×\times and 1.57×\times average speed-ups with respect to SPA respectively, over a set of 40 sparse matrices obtained from the SuiteSparse Matrix Collection. For the 22 most sparse matrices, H-SPA(t) and H-HASH(t) deliver 1.42×\times and 1.99×\times average speed-ups respectively

    Approximating a Multi-Grid Solver

    Get PDF
    Multi-grid methods are numerical algorithms used in parallel and distributed processing. The main idea of multigrid solvers is to speedup the convergence of an iterative method by reducing the problem to a coarser grid a number of times. Multi-grid methods are widely exploited in many application domains, thus it is important to improve their performance and energy efficiency. This paper aims to reach this objective based on the following observation: Given that the intermediary steps do not require full accuracy, it is possible to save time and energy by reducing precision during some steps while keeping the final result within the targeted accuracy. To achieve this goal, we first introduce a cycle shape different from the classic V-cycle used in multi-grid solvers. Then, we propose to dynamically change the floating-point precision used during runtime according to the accuracy needed for each intermediary step. Our evaluation considering a state-of-the-art multi-grid solver implementation demonstrates that it is possible to trade temporary precision for time to completion without hurting the quality of the final result. In particular, we are able to reach the same accuracy results as with full double-precision while gaining between 15% and 30% execution time improvement.This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 708566 (DURO). The European Commission is not liable for any use that might be made of the information contained therein. This work has been supported by the Spanish Government (Severo Ochoa grant SEV2015-0493)Peer ReviewedPostprint (author's final draft

    Revisiting credit distribution algorithms for distributed termination detection

    Get PDF
    This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages.Peer ReviewedPostprint (author's final draft

    Comparing Distributed Termination Detection Algorithms for Task-Based Runtime Systems on HPC platforms

    Get PDF
    International audienceThis paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. We then compare the implementation of these algorithms over a task-based runtime system, PaRSEC and show the advantages and limitations of each approach on a practical implementation

    Ordonnancement avec tolérance aux pannes pour des tâches parallèles à nombre de processeurs programmable

    Get PDF
    We study the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or the makespan, when jobs can fail due to silent errors and hence may need to be re-executed after each failure until successful completion. Our work generalizes the classical scheduling framework for failure-free jobs. To cope with silent errors, we introduce two resilient scheduling algorithms, LPA-List and Batch-List, both of which use the List strategy to schedule the jobs. Without knowing a priori how many times each job will fail, LPA-List relies on a local strategy to allocate processors to the jobs, while Batch-List schedules the jobs in batches and allows only a restricted number of failures per job in each batch. We prove new approximation ratios for the two algorithms under several prominent speedup models (e.g., roofline, communication, Amdahl, power, monotonic, and a mixed model). An extensive set of simulations is conducted to evaluate different variants of the two algorithms, and the results show that they consistently outperform some baseline heuristics. Overall, our best algorithm is within a factor of 1.6 of a lower bound on average over the entire set of experiments, and within a factor of 4.2 in the worst case.Ce rapport étudie l’ordonnancement résilient de tâches sur des plateformes de calcul à haute performance. Dans le problème étudié, il est possible de choisir le nombre constant de processeurs effectuant chaque tâche, en déterminant le temps d’exécution de ces dernières selon différent modèles de rendement. Nous décrivons des algorithmes dont l’objectif est deminimiser le temps total d’exécution, sachant que les tâches sont susceptibles d’échouer et de devoir être ré-effectuées à chaque erreur. Ce problème est donc une généralisation du cadre classique où toutes les tâches sont connues à priori et n’échouent pas. Nous décrivons un algorithme d’ordonnancement par listes de priorité, et prouvons de nouvelles bornes d’approximation pour trois modèles de rendement classiques (roofline, communication, Amdahl, power, monotonic, et un modèle qui mélange ceux-ci). Nous décrivons également un algorithme d’ordonnancement par lots, au sein desquels les tâches pourront échouer un nombre limité de fois, et prouvons alors de nouvelles bornes d’approximation pour des rendements quelconques. Enfin, nous effectuons des expériences sur un ensemble complet d’exemples pour comparer les niveaux de performance de différentes variantes de nos algorithmes, significativement meilleurs que les algorithmes simples usuels. Notre meilleure heuristique est en moyenne à un facteur 1.61.6 d’une borne inférieure de la solution optimale, et à un facteur 4.24.2 dans le pire cas

    Revisiting Credit Distribution Algorithms for Distributed Termination Detection

    Get PDF
    International audienceThis paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages

    Algorithmes d’ordonnancement tolérants aux fautes pour les plates-formes à large échelle

    No full text
    This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bigger and bigger in order to reach what we call exascale, i.e. a computing capacity of 10^18 FLOP/s but they suffer numerous failures. Reducing the execution time and handling the errors are two linked problems: for instance, replication (computing redudancy) decreases the number of critical failures but also decreases the number of available resources. In particular, this thesis focuses on several “checkpoint/restart” mechanisms.(saving the state of an application to restart from that save when a failure occurs): the first part investigates checkpointing on several levels, the use of additional resources to cope with system latency and checkpointing in generic task-graphs. The second part deals with optimal checkpointing strategies when coupled with replication (in linear task graphs, on heterogeneous platforms and with process duplication). The last part explores several scheduling problems linked to increasing disruptions in large-scale platforms.Cette thèse se concentre sur un problème majeur dans le contexte du calcul haute performance : la résilience. Les machines de calcul étant de plus en plus grosses pour viser les 10^18 opérations de calcul par seconde (exascale), celles-ci sont sujettes à de nombreuses pannes. La réduction du temps de calcul et la gestion du nombre de fautes sont deux problématiques étroitement liées : par exemple la réplication (redondance de calcul) permet de subir moins d'erreurs mais induit uneune diminution du nombre de ressources disponibles. En particulier, cette thèse se concentre sur divers mécanismes de « checkpoint/restart » (sauvegarde de l'état d'une application pour repartir de celle-ci lors d'une panne): la première partie traite de checkpoints sur plusieurs niveaux, de l'utilisation de ressources supplémentaires pour palier la latence des systèmes, et de checkpoint dans des graphes de tâches quelconques. La deuxième partie traite de stratégies optimales de checkpoint quand elles sont couplées avec de la réplication (dans des chaines de tâches, sur des plates-formes hétérogènes et enfin avec de la duplication de processus). La dernière partie explore quelques problèmes d'ordonnancement liés aux perturbations croissantes dans les plates-formes à large échelle

    Algorithmes d’ordonnancement tolérants aux fautes pour les plates-formes à large échelle

    No full text
    This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bigger and bigger in order to reach what we call exascale, i.e. a computing capacity of 10^18 FLOP/s but they suffer numerous failures. Reducing the execution time and handling the errors are two linked problems: for instance, replication (computing redudancy) decreases the number of critical failures but also decreases the number of available resources. In particular, this thesis focuses on several “checkpoint/restart” mechanisms.(saving the state of an application to restart from that save when a failure occurs): the first part investigates checkpointing on several levels, the use of additional resources to cope with system latency and checkpointing in generic task-graphs. The second part deals with optimal checkpointing strategies when coupled with replication (in linear task graphs, on heterogeneous platforms and with process duplication). The last part explores several scheduling problems linked to increasing disruptions in large-scale platforms.Cette thèse se concentre sur un problème majeur dans le contexte du calcul haute performance : la résilience. Les machines de calcul étant de plus en plus grosses pour viser les 10^18 opérations de calcul par seconde (exascale), celles-ci sont sujettes à de nombreuses pannes. La réduction du temps de calcul et la gestion du nombre de fautes sont deux problématiques étroitement liées : par exemple la réplication (redondance de calcul) permet de subir moins d'erreurs mais induit uneune diminution du nombre de ressources disponibles. En particulier, cette thèse se concentre sur divers mécanismes de « checkpoint/restart » (sauvegarde de l'état d'une application pour repartir de celle-ci lors d'une panne): la première partie traite de checkpoints sur plusieurs niveaux, de l'utilisation de ressources supplémentaires pour palier la latence des systèmes, et de checkpoint dans des graphes de tâches quelconques. La deuxième partie traite de stratégies optimales de checkpoint quand elles sont couplées avec de la réplication (dans des chaines de tâches, sur des plates-formes hétérogènes et enfin avec de la duplication de processus). La dernière partie explore quelques problèmes d'ordonnancement liés aux perturbations croissantes dans les plates-formes à large échelle

    Ordonnancement périodique d'entrées/sorties pour super-ordinateurs

    Get PDF
    With the ever-growing need of data in HPC applications, the congestion at theI/O level becomes critical in super-computers. Architectural enhancement such asburst-buffers and pre-fetching are added to machines, but are not sufficient toprevent congestion. Recent online I/O scheduling strategies have been put inplace, but they add an additional congestion point and overheads in thecomputation of applications. In this work, we show how to take advantage of the periodic nature of HPCapplications in order to develop efficient periodic scheduling strategiesfor their I/O transfers. Our strategy computes once during the job scheduling phase a pattern where itdefines the I/O behavior for each application, after which the applications runindependently, transferring their I/O at the specified times. Our strategy limitsthe amount of I/O congestion at the I/O node level and can be easily integratedinto current job schedulers. We validate this model through extensive simulationsand experiments by comparing it to state-of-the-art online solutions, showing thatnot only our scheduler has the advantage of being de-centralized and thus overcoming theoverhead of online schedulers, but also that it performs better than thesesolutions, improving the application dilation up to 13% and the maximumsystem efficiency up to 18%.Dans cet article, nous nous intéressons à des techniques de gestion d'entrées-sorties dans les super-ordinateurs. La nouveauté de ce travail est la prise en compte de certaintes caractéristiques et arguments structurels sur les applications haute performance, leur périodicité, dans la conception de nos algorithmes.Nous nous comparons à des solutions récentes et montrons un gain en efficacité système atteignant 18% et en dilation atteignant 13%.Nous montrons comment facilement intégrer ces solutions sur des super-ordinateurs

    Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms

    Get PDF
    International audienceIn this paper, we design and analyze strategies to replicate the execution of an application on two diierent platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal paaern size W for a periodic checkpointing strategy where both platforms concurrently try and executeW units of work before checkpointing. e rst platform that completes its paaern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use rst or second-order approximations to compute overheads and optimal paaern sizes, and show through extensive simulations that these models are very accurate. e simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively diierent speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. e simulations also demonstrate that the periodic checkpointing strategy is globally more eecient, unless platform speeds are quite close
    corecore