Search CORE

73 research outputs found

Partial aggregation for collective communication in distributed memory machines

Author: Kowalewski Roger
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 03/08/2021
Field of study

High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

Digitale Hochschulschriften der LMU

Graph Processing on GPUs:A Survey

Author: He Ligang
Hua Qiang-Sheng
Jin Hai
Liu Bo
Shi Xuanhua
Zheng Zhigao
Zhou Yongluan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Copenhagen University Research Information System

Efficient distributed load balancing for parallel algorithms

Author: Cosenza Biagio
Publication venue: Universita degli studi di Salerno
Publication date: 29/04/2011
Field of study

2009 - 2010With the advent of massive parallel processing technology, exploiting the power offered by hundreds, or even thousands of processors is all but a trivial task. Computing by using multi-processor, multi-core or many-core adds a number of additional challenges related to the cooperation and communication of multiple processing units. The uneven distribution of data among the various processors, i.e. the load imbalance, represents one of the major problems in data parallel applications. Without good load distribution strategies, we cannot reach good speedup, thus good efficiency. Load balancing strategies can be classified in several ways, according to the methods used to balance workload. For instance, dynamic load balancing algorithms make scheduling decisions during the execution and commonly results in better performance compared to static approaches, where task assignment is done before the execution. Even more important is the difference between centralized and distributed load balancing approaches. In fact, despite that centralized algorithms have a wider vision of the computation, hence may exploit smarter balancing techniques, they expose global synchronization and communication bottlenecks involving the master node. This definitely does not assure scalability with the number of processors. This dissertation studies the impact of different load balancing strategies. In particular, one of the key observations driving our work is that distributed algorithms work better than centralized ones in the context of load balancing for multi-processors (alike for multi-cores and many-cores as well). We first show a centralized approach for load balancing, then we propose several distributed approaches for problems having different parallelization, workload distribution and communication pattern. We try to efficiently combine several approaches to improve performance, in particular using predictive metrics to obtain a per task compute-time estimation, using adaptive subdivision, improving dynamic load balancing and addressing distributed balancing schemas. The main challenge tackled on this thesis has been to combine all these approaches together in new and efficient load balancing schemas. We assess the proposed balancing techniques, starting from centralized approaches to distributed ones, in distinctive real case scenarios: Mesh-like computation, Parallel Ray Tracing, and Agent-based Simulations. Moreover, we test our algorithms with parallel hardware such has cluster of workstations, multi-core processors and exploiting SIMD vectorial instruction set. Finally, we conclude the thesis with several remarks, about the impact of distributed techniques, the effect of the communication pattern and workload distribution, the use of cost estimation for adaptive partitioning, the trade-off fast versus accuracy in prediction-based approaches, the effectiveness of work stealing combined with sorting, and a non-trivial way to exploit hybrid CPUGPU computations. [edited by author]IX n.s

EleA@UniSA - Università degli Studi di Salerno

Algorithm Libraries for Multi-Core Processors

Author: Singler Johannes
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2010
Field of study

By providing parallelized versions of established algorithm libraries, we ease the exploitation of the multiple cores on modern processors for the programmer. The Multi-Core STL provides basic algorithms for internal memory, while the parallelized STXXL enables multi-core acceleration for algorithms on large data sets stored on disk. Some parallelized geometric algorithms are introduced into CGAL. Further, we design and implement sorting algorithms for huge data in distributed external memory

KITopen

ATCOM: Automatically tuned collective communication system for SMP clusters.

Author: Wu Meng-Shiou
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2005
Field of study

Conventional implementations of collective communications are based on point-to-point communications, and their optimizations have been focused on efficiency of those communication algorithms. However, point-to-point communications are not the optimal choice for modern computing clusters of SMPs due to their two-level communication structure. In recent years, a few research efforts have investigated efficient collective communications for SMP clusters. This dissertation is focused on platform-independent algorithms and implementations in this area;There are two main approaches to implementing efficient collective communications for clusters of SMPs: using shared memory operations for intra-node communications, and over-lapping inter-node/intra-node communications. The former fully utilizes the hardware based shared memory of an SMP, and the latter takes advantage of the inherent hierarchy of the communications within a cluster of SMPs. Previous studies focused on clusters of SMP from certain vendors. However, the previously proposed methods are not portable to other systems. Because the performance optimization issue is very complicated and the developing process is very time consuming, it is highly desired to have self-tuning, platform-independent implementations. As proven in this dissertation, such an implementation can significantly outperform the other point-to-point based portable implementations and some platform-specific implementations;The dissertation describes in detail the architecture of the platform-independent implementation. There are four system components: shared memory-based collective communications, overlapping mechanisms for inter-node and intra-node communications, a prediction-based tuning module and a micro-benchmark based tuning module. Each component is carefully designed with the goal of automatic tuning in mind

Digital Repository @ Iowa State University (ISU)

UNT Digital Library

Volunteer computing

Author: Sarmenta Luis Francisco G. (Luis Francisco Gumaru)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2001
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.Includes bibliographical references (p. 205-216).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.This thesis presents the idea of volunteer computing, which allows high-performance parallel computing networks to be formed easily, quickly, and inexpensively by enabling ordinary Internet users to share their computers' idle processing power without needing expert help. In recent years, projects such as SETI@home have demonstrated the great potential power of volunteer computing. In this thesis, we identify volunteer computing's further potentials, and show how these can be achieved. We present the Bayanihan system for web-based volunteer computing. Using Java applets, Bayanihan enables users to volunteer their computers by simply visiting a web page. This makes it possible to set up parallel computing networks in a matter of minutes compared to the hours, days, or weeks required by traditional NOW and metacomputing systems. At the same time, Bayanihan provides a flexible object-oriented software framework that makes it easy for programmers to write various applications, and for researchers to address issues such as adaptive parallelism, fault-tolerance, and scalability. Using Bayanihan, we develop a general-purpose runtime system and APIs, and show how volunteer computing's usefulness extends beyond solving esoteric mathematical problems to other, more practical, master-worker applications such as image rendering, distributed web-crawling, genetic algorithms, parametric analysis, and Monte Carlo simulations. By presenting a new API using the bulk synchronous parallel (BSP) model, we further show that contrary to popular belief and practice, volunteer computing need not be limited to master-worker applications, but can be used for coarse-grain message-passing programs as well. Finally, we address the new problem of maintaining reliability in the presence of malicious volunteers. We present and analyze traditional techniques such as voting, and new ones such as spot-checking, encrypted computation, and periodic obfuscation. Then, we show how these can be integrated in a new idea called credibility-based fault-tolerance, which uses probability estimates to limit and direct the use of redundancy. We validate this new idea with parallel Monte Carlo simulations, and show how it can achieve error rates several orders-of-magnitude smaller than traditional voting for the same slowdown.by Luis F.G. Sarmenta.Ph.D

DSpace@MIT

Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing

Author: Uçar Bora
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/08/2014
Field of study

Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Graphics Technology in Space Applications (GTSA 1989)

Author: Griffin Sandy
Publication venue
Publication date
Field of study

This document represents the proceedings of the Graphics Technology in Space Applications, which was held at NASA Lyndon B. Johnson Space Center on April 12 to 14, 1989 in Houston, Texas. The papers included in these proceedings were published in general as received from the authors with minimum modifications and editing. Information contained in the individual papers is not to be construed as being officially endorsed by NASA

NASA Technical Reports Server

Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis

Author: Kondikoppa Praveenkumar
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Analysis of Big data to gain better insights has been the focus of researchers in the recent past. Traditional desktop computers or database management systems may not be suitable for efficient and timely analysis, due to the requirement of massive parallel processing. Distributed computing frameworks are being explored as a viable solution. For example, Google proposed MapReduce, which is becoming a de facto computing architecture for Big data solutions. However, scheduling in MapReduce is coarse grained and remains as a challenge for improvement. Related with MapReduce scheduler when configured over distributed clusters, we identify two issues: data locality disruption and random assignment of non-local map tasks. We propose a network aware scheduler to extend the existing rack awareness. The tasks are scheduled in the order of node, rack and any other rack within the same cluster to achieve cluster level data locality. The issue of random assignment non-local map tasks is handled by enhancing the scheduler to consider the network parameters, such as delay, bandwidth and packet loss between remote clusters. As part of Big data analysis at computational biology, we consider two major data intensive applications: indexing genome sequences and de Novo assembly. Both of these applications deal with the massive amount data generated from DNA sequencers. We developed a scalable algorithm to construct sub-trees of a suffix tree in parallel to address huge memory requirements needed for indexing the human genome. For the de Novo assembly, we propose Parallel Giraph based Assembler (PGA) to address the challenges associated with the assembly of large genomes over commodity hardware. PGA uses the de Bruijn graph to represent the data generated from sequencers. Huge memory demands and performance expectations are addressed by developing parallel algorithms based on the distributed graph-processing framework, Apache Giraph

Louisiana State University

An agent-based visualisation system.

Author: Nicolas Roard
Publication venue
Publication date: 01/01/2007
Field of study

This thesis explores the concepts of visual supercomputing, where complex distributed systems are used toward interactive visualisation of large datasets. Such complex systems inherently trigger management and optimisation problems; in recent years the concepts of autonomic computing have arisen to address those issues. Distributed visualisation systems are a very challenging area to apply autonomic computing ideas as such systems are both latency and compute sensitive, while most autonomic computing implementations usually concentrate on one or the other but not both concurrently. A major contribution of this thesis is to provide a case study demonstrating the application of autonomic computing concepts to a computation intensive, real-time distributed visualisation system. The first part of the thesis proposes the realisation of a layered multi-agent system to enable autonomic visualisation. The implementation of a generic multi-agent system providing reflective features is described. This architecture is then used to create a flexible distributed graphic pipeline, oriented toward real-time visualisation of volume datasets. Performance evaluation of the pipeline is presented. The second part of the thesis explores the reflective nature of the system and presents high level architectures based on software agents, or visualisation strategies, that take advantage of the flexibility of the system to provide generic features. Autonomic capabilities are presented, with fault recovery and automatic resource configuration. Performance evaluation, simulation and prediction of the system are presented, exploring different use cases and optimisation scenarios. A performance exploration tool, Delphe, is described, which uses real-time data of the system to let users explore its performance

Cronfa at Swansea University