Search CORE

2,111 research outputs found

Hyper-Scalable JSQ with Sparse Feedback

Author: Borst Sem
van der Boor Mark
van Leeuwaarden Johan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/03/2019
Field of study

Load balancing algorithms play a vital role in enhancing performance in data centers and cloud networks. Due to the massive size of these systems, scalability challenges, and especially the communication overhead associated with load balancing mechanisms, have emerged as major concerns. Motivated by these issues, we introduce and analyze a novel class of load balancing schemes where the various servers provide occasional queue updates to guide the load assignment. We show that the proposed schemes strongly outperform JSQ(

d

) strategies with comparable communication overhead per job, and can achieve a vanishing waiting time in the many-server limit with just one message per job, just like the popular JIQ scheme. The proposed schemes are particularly geared however towards the sparse feedback regime with less than one message per job, where they outperform corresponding sparsified JIQ versions. We investigate fluid limits for synchronous updates as well as asynchronous exponential update intervals. The fixed point of the fluid limit is identified in the latter case, and used to derive the queue length distribution. We also demonstrate that in the ultra-low feedback regime the mean stationary waiting time tends to a constant in the synchronous case, but grows without bound in the asynchronous case

arXiv.org e-Print Archive

Crossref

Pure OAI Repository

Towards Optimal Distributed Node Scheduling in a Multihop Wireless Network through Local Voting

Author: Amelina Natalia
Granichin Oleg
Jiang Yuming
Kralevska Katina
Vergados Dimitrios J.
Publication venue
Publication date: 02/03/2017
Field of study

In a multihop wireless network, it is crucial but challenging to schedule transmissions in an efficient and fair manner. In this paper, a novel distributed node scheduling algorithm, called Local Voting, is proposed. This algorithm tries to semi-equalize the load (defined as the ratio of the queue length over the number of allocated slots) through slot reallocation based on local information exchange. The algorithm stems from the finding that the shortest delivery time or delay is obtained when the load is semi-equalized throughout the network. In addition, we prove that, with Local Voting, the network system converges asymptotically towards the optimal scheduling. Moreover, through extensive simulations, the performance of Local Voting is further investigated in comparison with several representative scheduling algorithms from the literature. Simulation results show that the proposed algorithm achieves better performance than the other distributed algorithms in terms of average delay, maximum delay, and fairness. Despite being distributed, the performance of Local Voting is also found to be very close to a centralized algorithm that is deemed to have the optimal performance

arXiv.org e-Print Archive

NORA - Norwegian Open Research Archives

Distributed memory compiler design for sparse problems

Author: Berryman Harry
Hiranandani Seema
Saltz Joel
Wu Janet
Publication venue
Publication date
Field of study

A compiler and runtime support mechanism is described and demonstrated. The methods presented are capable of solving a wide range of sparse and unstructured problems in scientific computing. The compiler takes as input a FORTRAN 77 program enhanced with specifications for distributing data, and the compiler outputs a message passing program that runs on a distributed memory computer. The runtime support for this compiler is a library of primitives designed to efficiently support irregular patterns of distributed array accesses and irregular distributed array partitions. A variety of Intel iPSC/860 performance results obtained through the use of this compiler are presented

NASA Technical Reports Server

Asymptotically Optimal Load Balancing Topologies

Author: Borst Sem C.
Mukherjee Debankur
van Leeuwaarden Johan S. H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

We consider a system of

N

servers inter-connected by some underlying graph topology

G_N

. Tasks arrive at the various servers as independent Poisson processes of rate

\lambda

. Each incoming task is irrevocably assigned to whichever server has the smallest number of tasks among the one where it appears and its neighbors in

G_N

. Tasks have unit-mean exponential service times and leave the system upon service completion. The above model has been extensively investigated in the case

G_N

is a clique. Since the servers are exchangeable in that case, the queue length process is quite tractable, and it has been proved that for any

\lambda < 1

, the fraction of servers with two or more tasks vanishes in the limit as

N \to \infty

. For an arbitrary graph

G_N

, the lack of exchangeability severely complicates the analysis, and the queue length process tends to be worse than for a clique. Accordingly, a graph

G_N

is said to be

N

-optimal or

\sqrt{N}

-optimal when the occupancy process on

G_N

is equivalent to that on a clique on an

N

-scale or

\sqrt{N}

-scale, respectively. We prove that if

G_N

is an Erd\H{o}s-R\'enyi random graph with average degree

d(N)

, then it is with high probability

N

-optimal and

\sqrt{N}

-optimal if

d(N) \to \infty

and

d(N) / (\sqrt{N} \log(N)) \to \infty

N \to \infty

, respectively. This demonstrates that optimality can be maintained at

N

-scale and

\sqrt{N}

-scale while reducing the number of connections by nearly a factor

N

and

\sqrt{N} / \log(N)

compared to a clique, provided the topology is suitably random. It is further shown that if

G_N

contains

\Theta(N)

bounded-degree nodes, then it cannot be

N

-optimal. In addition, we establish that an arbitrary graph

G_N

N

-optimal when its minimum degree is

N - o(N)

, and may not be

N

-optimal even when its minimum degree is

c N + o(N)

for any

0 < c < 1/2

.Comment: A few relevant results from arXiv:1612.00723 are included for convenienc

arXiv.org e-Print Archive

Crossref

Repository TU/e

Pure OAI Repository

Coarse grain parallel finite element simulations for incompressible flows

Author: M. F. Webster
P. W. Grant
X. Zhang
Publication venue: 'Wiley'
Publication date: 01/01/2002
Field of study

Crossref

Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures

Author: Descombes Stéphane
Duarte Max
Dumont Thierry
Guillet Thomas
Louvet Violaine
Massot Marc
Publication venue: 'Cellule MathDoc/CEDRAM'
Publication date: 14/10/2016
Field of study

A new solver featuring time-space adaptation and error control has been recently introduced to tackle the numerical solution of stiff reaction-diffusion systems. Based on operator splitting, finite volume adaptive multiresolution and high order time integrators with specific stability properties for each operator, this strategy yields high computational efficiency for large multidimensional computations on standard architectures such as powerful workstations. However, the data structure of the original implementation, based on trees of pointers, provides limited opportunities for efficiency enhancements, while posing serious challenges in terms of parallel programming and load balancing. The present contribution proposes a new implementation of the whole set of numerical methods including Radau5 and ROCK4, relying on a fully different data structure together with the use of a specific library, TBB, for shared-memory, task-based parallelism with work-stealing. The performance of our implementation is assessed in a series of test-cases of increasing difficulty in two and three dimensions on multi-core and many-core architectures, demonstrating high scalability

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL-UJM

The SMAI journal of computational mathematics

Crossref

Numérisation de Documents Anciens Mathématiques

Hal-Diderot

HAL-Polytechnique

HAL-Rennes 1

A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

Author: Bergman K
Chandramowlishwaran A
Hamada T
Lorena A Barba
Rahimian A
Rio Yokota
Warren M
Yokota R
Publication venue: 'SAGE Publications'
Publication date: 16/10/2011
Field of study

Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were parallelized using OpenMP, and a test using 10^7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10^8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2x faster). The weak scaling test used 10^6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape

arXiv.org e-Print Archive

Crossref