123 research outputs found

    Guest editorial: Special issue on parallel matrix algorithms and applications (PMAA’16)

    Get PDF
    International audienceThis special issue of Parallel Computing contains nine articles, selected after peer reviewing, from invited and contributed presentations made at the 8th International Workshop on Parallel Matrix Algorithms and Applications (PMAA'16), that took place at the Université of Bordeaux, France, from July 6-8, 2016. The workshop attracted around 120 participants from all continents, 25% were PhD students and around 10% from industry. The workshop was co-chaired by Emmanuel Agullo, Peter Arbenz, Luc Gi-raud, and Olaf Schenk. The members of the program committee were : P. D'Am-bra, H A total of twelve high quality submissions were received. In this special issue nine eventually accepted papers appear. The nine papers address diverse aspects of linear algebra and high performance computing 1. Jack Dongarra, Mark Gates, Stanimire Tomov address accelerating the SVD two stage reduction and divide-and-conquer using GPUs. The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today's high performance computers. For dense matrices, the classic algorithm for the SVD uses a one-stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two-stage reduction to bidiagonal has been gaining popularity. As accelerators , such as GPUs and co-processors, are becoming increasingly widespread in high-performance computing, the authors present an accelerated SVD employing a two-stage reduction to bidiagonal as well as a parallelized and accelerated divide-and-conquer algorithm to solve the subsequent bidiagonal SVD. The new implementation provides a significant speedup compared to existing multi-core and GPU-based SVD implementations

    A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters

    Get PDF
    In this work, we consider the solution of boundary integral equations by means of a scalable hierarchical matrix approach on clusters equipped with graphics hardware, i.e. graphics processing units (GPUs). To this end, we extend our existing single-GPU hierarchical matrix library hmglib such that it is able to scale on many GPUs and such that it can be coupled to arbitrary application codes. Using a model GPU implementation of a boundary element method (BEM) solver, we are able to achieve more than 67 percent relative parallel speed-up going from 128 to 1024 GPUs for a model geometry test case with 1.5 million unknowns and a real-world geometry test case with almost 1.2 million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6 minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the setup phase and 20 seconds for the iterative solver. To the best of the authors' knowledge, we here discuss the first fully GPU-based distributed-memory parallel hierarchical matrix Open Source library using the traditional H-matrix format and adaptive cross approximation with an application to BEM problems

    Numerically Stable Recurrence Relations for the Communication Hiding Pipelined Conjugate Gradient Method

    Full text link
    Pipelined Krylov subspace methods (also referred to as communication-hiding methods) have been proposed in the literature as a scalable alternative to classic Krylov subspace algorithms for iteratively computing the solution to a large linear system in parallel. For symmetric and positive definite system matrices the pipelined Conjugate Gradient method outperforms its classic Conjugate Gradient counterpart on large scale distributed memory hardware by overlapping global communication with essential computations like the matrix-vector product, thus hiding global communication. A well-known drawback of the pipelining technique is the (possibly significant) loss of numerical stability. In this work a numerically stable variant of the pipelined Conjugate Gradient algorithm is presented that avoids the propagation of local rounding errors in the finite precision recurrence relations that construct the Krylov subspace basis. The multi-term recurrence relation for the basis vector is replaced by two-term recurrences, improving stability without increasing the overall computational cost of the algorithm. The proposed modification ensures that the pipelined Conjugate Gradient method is able to attain a highly accurate solution independently of the pipeline length. Numerical experiments demonstrate a combination of excellent parallel performance and improved maximal attainable accuracy for the new pipelined Conjugate Gradient algorithm. This work thus resolves one of the major practical restrictions for the useability of pipelined Krylov subspace methods.Comment: 15 pages, 5 figures, 1 table, 2 algorithm

    Developing a distributed electronic health-record store for India

    Get PDF
    The DIGHT project is addressing the problem of building a scalable and highly available information store for the Electronic Health Records (EHRs) of the over one billion citizens of India

    Use of A Network Enabled Server System for a Sparse Linear Algebra Grid Application

    Get PDF
    Solving systems of linear equations is one of the key operations in linear algebra. Many different algorithms are available in that purpose. These algorithms require a very accurate tuning to minimise runtime and memory consumption. The TLSE project provides, on one hand, a scenario-driven expert site to help users choose the right algorithm according to their problem and tune accurately this algorithm, and, on the other hand, a test-bed for experts in order to compare algorithms and define scenarios for the expert site. Both features require to run the available solvers a large number of times with many different values for the control parameters (and maybe with many different architectures). Currently, only the grid can provide enough computing power for this kind of application. The DIET middleware is the GRID backbone for TLSE. It manages the solver services and their scheduling in a scalable way.La résolution de systèmes linéaires creux est une opération clé en algèbre linéaire. Beaucoup d’algorithmes sont utilisés pour cela, qui dépendent de nombreux paramètres, afin d’offrir une robustesse, une performance et une consommation mémoire optimales. Le projet GRID-TLSE fournit d’une part, un site d’expertise basé sur l’utilisation de scénarios pour aider les utilisateurs à choisir l’algorithme qui convient le mieux à leur problème ainsi que les paramètres associés; et d’autre part, un environnement pour les experts du domaine leur permettant de comparer efficacement des algorithmes et de définir dynamiquement de nouveaux scénarios d’utilisation. Ces fonctionnalités nécessitent de pouvoir exécuter les logiciels de résolution disponibles un grand nombre de fois,avec beaucoup de valeurs différentes des paramètres de contrôle (et éventuellement sur plusieurs architectures de machines). Actuellement, seule la grille peut fournir la puissance de calcul pour ce type d’applications. L’intergiciel DIETest utilisé pour gérer la grille, les différents services, et leur ordonnancement efficace

    A study of various load information exchange mechanisms for a distributed application using dynamic scheduling

    Get PDF
    We consider a distributed asynchronous system where processes can only communicate by message passing and need a coherent view of the load(e.g.,workload,memory) of others to take dynamic decisions (scheduling).We present several mechanisms to obtain a distributed view of such information,based eithe ron maintaining that view or demand-driven witha snapshot algorithm.We perform an experimental study in the context of a real application,an asynchronous parallel solver for large sparse systems of linear equationsNous considérons un système distribué et asynchrone où les processus peuvent seulement communiquer par passage de messages, et requièrent une estimation correcte de la charge (travail en attente, mémoire utilisée) des autres processus pour procéder à  des décisions dynamiques liées à  l'ordonnancement des tâches de calcul. Nous présentons plusieurs types de mécanismes pour obtenir une vision distribuée de telles informations. Dans un premier type d'approches, la vision est maintenue grâce à des échanges de messages réguliers; dans le deuxième type d'approches (mécanismes à  la demande ou de type snapshot), le processus demandeur des informations émet une requête, et reçoit ensuite les informations de charge correspondant à  sa demande. Nous expérimentons ces approches dans le cadre d'une application réelle utilisant des ordonnanceurs dynamiques distribués

    Evaluation and Analysis of Distributed Graph-Parallel Processing Frameworks

    Get PDF
    A number of graph-parallel processing frameworks have been proposed to address the needs of processing complex and large-scale graph structured datasets in recent years. Although significant performance improvement made by those frameworks were reported, comparative advantages of each of these frameworks over the others have not been fully studied, which impedes the best utilization of those frameworks for a specific graph computing task and setting. In this work, we conducted a comparison study on parallel processing systems for large-scale graph computations in a systematic manner, aiming to reveal the characteristics of those systems in performing common graph algorithms with real-world datasets on the same ground. We selected three popular graph-parallel processing frameworks (Giraph, GPS and GraphLab) for the study and also include a representative general data-parallel computing system— Spark—in the comparison in order to understand how well a general data-parallel system can run graph problems. We applied basic performance metrics measuring speed, resource utilization, and scalability to answer a basic question of which graph-parallel processing platform is better suited for what applications and datasets. Three widely-used graph algorithms— clustering coefficient, shortest path length, and PageRank score—were used for benchmarking on the targeted computing systems.We ran those algorithms against three real world network datasets with diverse characteristics and scales on a research cluster and have obtained a number of interesting observations. For instance, all evaluated systems showed poor scalability (i.e., the runtime increases with more computing nodes) with small datasets likely due to communication overhead. Further, out of the evaluated graphparallel computing platforms, PowerGraph consistently exhibits better performance than others
    • …
    corecore