Search CORE

277 research outputs found

Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

Author: Hasanov Khalid
Lastovetsky Alexey
Quintin Jean-Noel
Publication venue
Publication date: 18/06/2013
Field of study

Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid 1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm as it can be used on a non-square number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores.Comment: 9 page

arXiv.org e-Print Archive

Crossref

On Collective Communication and Notified Read in the Global Address Space Programming Interface (GASPI)

Author: End Vanessa
Publication venue
Publication date: 14/12/2016
Field of study

Georg-August-University Göttingen

MPG.PuRe

Network on chip architecture for multi-agent systems in FPGA

Author: Belatreche A
Coleman S
Gerlein EA
McGinnity TM
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/11/2017
Field of study

A system of interacting agents is, by definition, very demanding in terms of computational resources. Although multi-agent systems have been used to solve complex problems in many areas, it is usually very difficult to perform large-scale simulations in their targeted serial computing platforms. Reconfigurable hardware, in particular Field Programmable Gate Arrays (FPGA) devices, have been successfully used in High Performance Computing applications due to their inherent flexibility, data parallelism and algorithm acceleration capabilities. Indeed, reconfigurable hardware seems to be the next logical step in the agency paradigm, but only a few attempts have been successful in implementing multi-agent systems in these platforms. This paper discusses the problem of inter-agent communications in Field Programmable Gate Arrays. It proposes a Network-on-Chip in a hierarchical star topology to enable agents’ transactions through message broadcasting using the Open Core Protocol, as an interface between hardware modules. A customizable router microarchitecture is described and a multi-agent system is created to simulate and analyse message exchanges in a generic heavy traffic load agent-based application. Experiments have shown a throughput of 1.6Gbps per port at 100 MHz without packet loss and seamless scalability characteristics

Northumbria Research Link

Crossref

Nottingham Trent Institutional Repository (IRep)

Ulster University's Research Portal

Performance measurement and analysis of PC based cluster server using SET of Architecture and modeling a scalable High performance cluster

Author: Mehta Mihir J.
Publication venue
Publication date: 01/04/2006
Field of study

Not availabl

Etheses - A Saurashtra University Library Service

Parallelizing Scale Invariant Feature Transform on a Distributed Memory Cluster

Author: Bobovych Stanislav
Publication venue: ScholarWorks@UARK
Publication date: 01/01/2011
Field of study

Scale Invariant Feature Transform (SIFT) is a computer vision algorithm that is widely-used to extract features from images. We explored accelerating an existing implementation of this algorithm with message passing in order to analyze large data sets. We successfully tested two approaches to data decomposition in order to parallelize SIFT on a distributed memory cluster

ScholarWorks@UARK

UARK (University of Arkansas )

Scalable Interactive Volume Rendering Using Off-the-shelf Components

Author: Breen David
Heirich Alan
Lombeyda Santiago
Moll Laurent
Shand Mark
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/2001
Field of study

This paper describes an application of a second generation implementation of the Sepia architecture (Sepia-2) to interactive volu-metric visualization of large rectilinear scalar fields. By employingpipelined associative blending operators in a sort-last configuration a demonstration system with 8 rendering computers sustains 24 to 28 frames per second while interactively rendering large data volumes (1024x256x256 voxels, and 512x512x512 voxels). We believe interactive performance at these frame rates and data sizes is unprecedented. We also believe these results can be extended to other types of structured and unstructured grids and a variety of GL rendering techniques including surface rendering and shadow map-ping. We show how to extend our single-stage crossbar demonstration system to multi-stage networks in order to support much larger data sizes and higher image resolutions. This requires solving a dynamic mapping problem for a class of blending operators that includes Porter-Duff compositing operators

CiteSeerX

Caltech Authors

Bandwidth optimal all-reduce algorithms for clusters of workstations

Author: Bar-Noy
Bar-Noy
Bruck
Bruck
Faraj
Faraj
Faraj
Gropp
Iannello
Karwande
Knodel
Lane
Patarasuk
Pitch Patarasuk
Rabenseifner
Rabenseifner
Thakur
van de Geijn
Xin Yuan
Yuan
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Efficient Broadcast for Multicast-Capable Interconnection Networks

Author: Siebert Christian
Publication venue
Publication date: 30/09/2006
Field of study

The broadcast function MPI_Bcast() from the MPI-1.1 standard is one of the most heavily used collective operations for the message passing programming paradigm. This diploma thesis makes use of a feature called "Multicast", which is supported by several network technologies (like Ethernet or InfiniBand), to create an efficient MPI_Bcast() implementation, especially for large communicators and small-sized messages. A preceding analysis of existing real-world applications leads to an algorithm which does not only perform well for synthetical benchmarks but also even better for a wide class of parallel applications. The finally derived broadcast has been implemented for the open source MPI library "Open MPI" using IP multicast. The achieved results prove that the new broadcast is usually always better than existing point-to-point implementations, as soon as the number of MPI processes exceeds the 8 node boundary. The performance gain reaches a factor of 4.9 on 342 nodes, because the new algorithm scales practically independently of the number of involved processes.Die Broadcastfunktion MPI_Bcast() aus dem MPI-1.1 Standard ist eine der meistgenutzten kollektiven Kommunikationsoperationen des nachrichtenbasierten Programmierparadigmas. Diese Diplomarbeit nutzt die Multicastfähigkeit, die von mehreren Netzwerktechnologien (wie Ethernet oder InfiniBand) bereitgestellt wird, um eine effiziente MPI_Bcast() Implementation zu erschaffen, insbesondere für große Kommunikatoren und kleinere Nachrichtengrößen. Eine vorhergehende Analyse von existierenden parallelen Anwendungen führte dazu, dass der neue Algorithmus nicht nur bei synthetischen Benchmarks gut abschneidet, sondern sein Potential bei echten Anwendungen noch besser entfalten kann. Der letztendlich daraus entstandene Broadcast wurde für die Open-Source MPI Bibliothek "Open MPI" entwickelt und basiert auf IP Multicast. Die erreichten Ergebnisse belegen, dass der neue Broadcast üblicherweise immer besser als jegliche Punkt-zu-Punkt Implementierungen ist, sobald die Anzahl von MPI Prozessen die Grenze von 8 Knoten überschreitet. Der Geschwindigkeitszuwachs erreicht einen Faktor von 4,9 bei 342 Knoten, da der neue Algorithmus praktisch unabhängig von der Knotenzahl skaliert

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Multimedia ONline ARchiv CHemnitz