Search CORE

101,840 research outputs found

Three-Level Parallel J-Jacobi Algorithms for Hermitian Matrices

Author: Aleksandar Ušćumlić
Bečka
Bojanczyk
Brent
Bunch
Bunch
Davor Davidović
Demmel
Dopico
Drmač
Eberlein
Hansen
Hari
Hari
Higham
Krešimir Bokulić
Luk
Luk
Okša
Parlett
Royo
Rutishauser
Sanja Singer
Saša Singer
Shroff
Singer
Singer
Slapničar
Slapničar
van der Sluis
Vedran Novaković
Veselić
Whiteside
Zha
Zhou
Publication venue: 'Elsevier BV'
Publication date: 24/08/2010
Field of study

The paper describes several efficient parallel implementations of the one-sided hyperbolic Jacobi-type algorithm for computing eigenvalues and eigenvectors of Hermitian matrices. By appropriate blocking of the algorithms an almost ideal load balancing between all available processors/cores is obtained. A similar blocking technique can be used to exploit local cache memory of each processor to further speed up the process. Due to diversity of modern computer architectures, each of the algorithms described here may be the method of choice for a particular hardware and a given matrix size. All proposed block algorithms compute the eigenvalues with relative accuracy similar to the original non-blocked Jacobi algorithm.Comment: Submitted for publicatio

arXiv.org e-Print Archive

CiteSeerX

Crossref

FAMENA Repository

Full-text Institutional Repository of the Ruđer Bošković Institute

A Conflict-Resilient Lock-Free Calendar Queue for Scalable Share-Everything PDES Platforms

Author: Ianni Mauro
Marotta Romolo
Pellegrini Alessandro
Quaglia Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Emerging share-everything Parallel Discrete Event Simulation (PDES) platforms rely on worker threads fully sharing the workload of events to be processed. These platforms require efficient event pool data structures enabling high concurrency of extraction/insertion operations. Non-blocking event pool algorithms are raising as promising solutions for this problem. However, the classical non-blocking paradigm leads concurrent conflicting operations, acting on a same portion of the event pool data structure, to abort and then retry. In this article we present a conflict-resilient non-blocking calendar queue that enables conflicting dequeue operations, concurrently attempting to extract the minimum element, to survive, thus improving the level of scalability of accesses to the hot portion of the data structure---namely the bucket to which the current locality of the events to be processed is bound. We have integrated our solution within an open source share-everything PDES platform and report the results of an experimental analysis of the proposed concurrent data structure compared to some literature solutions

ART

Archivio della ricerca- Università di Roma La Sapienza

Recent update of the RPLUS2D/3D codes

Author: Tsai Y.-L. Peter
Publication venue
Publication date
Field of study

The development of the RPLUS2D/3D codes is summarized. These codes utilize LU algorithms to solve chemical non-equilibrium flows in a body-fitted coordinate system. The motivation behind the development of these codes is the need to numerically predict chemical non-equilibrium flows for the National AeroSpace Plane Program. Recent improvements include vectorization method, blocking algorithms for geometric flexibility, out-of-core storage for large-size problems, and an LU-SW/UP combination for CPU-time efficiency and solution quality

NASA Technical Reports Server

Accelerating Parallel Stochastic Gradient Descent via Non-blocking Mini-batches

Author: Dube Parijat
He Haoze
Publication venue
Publication date: 09/11/2022
Field of study

SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at the parameter server by using communication collectives like Ring All-Reduce for synchronization. While the parameter updates in distributed SGD may happen asynchronously there is still a synchronization barrier to make sure that the local training epoch at every learner is complete before the learners can advance to the next epoch. The delays in waiting for the slowest learners(stragglers) remain to be a problem in the synchronization steps of these state-of-the-art decentralized frameworks. In this paper, we propose the (de)centralized Non-blocking SGD (Non-blocking SGD) which can address the straggler problem in a heterogeneous environment. The main idea of Non-blocking SGD is to split the original batch into mini-batches, then accumulate the gradients and update the model based on finished mini-batches. The Non-blocking idea can be implemented using decentralized algorithms including Ring All-reduce, D-PSGD, and MATCHA to solve the straggler problem. Moreover, using gradient accumulation to update the model also guarantees convergence and avoids gradient staleness. Run-time analysis with random straggler delays and computational efficiency/throughput of devices is also presented to show the advantage of Non-blocking SGD. Experiments on a suite of datasets and deep learning networks validate the theoretical analyses and demonstrate that Non-blocking SGD speeds up the training and fastens the convergence. Compared with the state-of-the-art decentralized asynchronous algorithms like D-PSGD and MACHA, Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.Comment: 12 pages, 4 figure

arXiv.org e-Print Archive

Load Balancing in Large-Scale Systems with Multiple Dispatchers

Author: Borst Sem
van der Boor Mark
van Leeuwaarden Johan
Publication venue
Publication date: 01/01/2017
Field of study

Load balancing algorithms play a crucial role in delivering robust application performance in data centers and cloud networks. Recently, strong interest has emerged in Join-the-Idle-Queue (JIQ) algorithms, which rely on tokens issued by idle servers in dispatching tasks and outperform power-of-

d

policies. Specifically, JIQ strategies involve minimal information exchange, and yet achieve zero blocking and wait in the many-server limit. The latter property prevails in a multiple-dispatcher scenario when the loads are strictly equal among dispatchers. For various reasons it is not uncommon however for skewed load patterns to occur. We leverage product-form representations and fluid limits to establish that the blocking and wait then no longer vanish, even for arbitrarily low overall load. Remarkably, it is the least-loaded dispatcher that throttles tokens and leaves idle servers stranded, thus acting as bottleneck. Motivated by the above issues, we introduce two enhancements of the ordinary JIQ scheme where tokens are either distributed non-uniformly or occasionally exchanged among the various dispatchers. We prove that these extensions can achieve zero blocking and wait in the many-server limit, for any subcritical overall load and arbitrarily skewed load profiles. Extensive simulation experiments demonstrate that the asymptotic results are highly accurate, even for moderately sized systems

arXiv.org e-Print Archive

Crossref

Repository TU/e

Pure OAI Repository

Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort

Author: Axtmann Michael
Sanders Peter
Wiebigke Armin
Publication venue
Publication date: 01/01/2018
Field of study

MPI uses the concept of communicators to connect groups of processes. It provides nonblocking collective operations on communicators to overlap communication and computation. Flexible algorithms demand flexible communicators. E.g., a process can work on different subproblems within different process groups simultaneously, new process groups can be created, or the members of a process group can change. Depending on the number of communicators, the time for communicator creation can drastically increase the running time of the algorithm. Furthermore, a new communicator synchronizes all processes as communicator creation routines are blocking collective operations. We present RBC, a communication library based on MPI, that creates range-based communicators in constant time without communication. These RBC communicators support (non)blocking point-to-point communication as well as (non)blocking collective operations. Our experiments show that the library reduces the time to create a new communicator by a factor of more than 400 whereas the running time of collective operations remains about the same. We propose Janus Quicksort, a distributed sorting algorithm that avoids any load imbalances. We improved the performance of this algorithm by a factor of 15 for moderate inputs by using RBC communicators. Finally, we discuss different approaches to bring nonblocking (local) communicator creation of lightweight (range-based) communicators into MPI

arXiv.org e-Print Archive

Crossref

KITopen

NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion

Author: Dhillon Inderjit
Hsieh Cho-Jui
Vishwanathan S. V. N.
Yu Hsiang-Fu
Yun Hyokun
Publication venue
Publication date: 24/04/2014
Field of study

We develop an efficient parallel distributed algorithm for matrix completion, named NOMAD (Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion). NOMAD is a decentralized algorithm with non-blocking communication between processors. One of the key features of NOMAD is that the ownership of a variable is asynchronously transferred between processors in a decentralized fashion. As a consequence it is a lock-free parallel algorithm. In spite of being an asynchronous algorithm, the variable updates of NOMAD are serializable, that is, there is an equivalent update ordering in a serial implementation. NOMAD outperforms synchronous algorithms which require explicit bulk synchronization after every iteration: our extensive empirical evaluation shows that not only does our algorithm perform well in distributed setting on commodity hardware, but also outperforms state-of-the-art algorithms on a HPC cluster both in multi-core and distributed memory settings

arXiv.org e-Print Archive

CiteSeerX