61,155 research outputs found
Realistic communication model for parallel computing on cluster
We present a model for parallel computation on commodity cluster. Our cluster model is targeted as a tool for performance analysis and algorithm design. We abstract the communication event by means of local and remote data movements, and explicitly expose the contention problems by capturing them in our parameters. To validate our model, we compare the prediction accuracy of our model with the Postal model for the popular tree-based broadcast algorithm. Our model provides good prediction accuracy and answers to some performance issues that are missing in existing models. We examine the gather collective operation, in which contention delay dominates its overall execution time. Based on the model, we design a communication schedule for the gather operation that is based on the upper and lower bounds, in which the congestion behavior could be under our control and achieve optimal results by avoiding data loss.published_or_final_versio
Black-Box Parallelization for Machine Learning
The landscape of machine learning applications is changing rapidly: large centralized datasets are replaced by high volume, high velocity data streams generated by a vast number of geographically distributed, loosely connected devices, such as mobile phones, smart sensors, autonomous vehicles or industrial machines. Current learning approaches centralize the data and process it in parallel in a cluster or computing center. This has three major disadvantages: (i) it does not scale well with the number of data-generating devices since their growth exceeds that of computing centers, (ii) the communication costs for centralizing the data are prohibitive in many applications, and (iii) it requires sharing potentially privacy-sensitive data. Pushing computation towards the data-generating devices alleviates these problems and allows to employ their otherwise unused computing power. However, current parallel learning approaches are designed for tightly integrated systems with low latency and high bandwidth, not for loosely connected distributed devices. Therefore, I propose a new paradigm for parallelization that treats the learning algorithm as a black box, training local models on distributed devices and aggregating them into a single strong one. Since this requires only exchanging models instead of actual data, the approach is highly scalable, communication-efficient, and privacy-preserving. Following this paradigm, this thesis develops black-box parallelizations for two broad classes of learning algorithms. One approach can be applied to incremental learning algorithms, i.e., those that improve a model in iterations. Based on the utility of aggregations it schedules communication dynamically, adapting it to the hardness of the learning problem. In practice, this leads to a reduction in communication by orders of magnitude. It is analyzed for (i) online learning, in particular in the context of in-stream learning, which allows to guarantee optimal regret and for (ii) batch learning based on empirical risk minimization where optimal convergence can be guaranteed. The other approach is applicable to non-incremental algorithms as well. It uses a novel aggregation method based on the Radon point that allows to achieve provably high model quality with only a single aggregation. This is achieved in polylogarithmic runtime on quasi-polynomially many processors. This relates parallel machine learning to Nick's class of parallel decision problems and is a step towards answering a fundamental open problem about the abilities and limitations of efficient parallel learning algorithms. An empirical study on real distributed systems confirms the potential of the approaches in realistic application scenarios
Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems
We discuss the performance of direct summation codes used in the simulation
of astrophysical stellar systems on highly distributed architectures. These
codes compute the gravitational interaction among stars in an exact way and
have an O(N^2) scaling with the number of particles. They can be applied to a
variety of astrophysical problems, like the evolution of star clusters, the
dynamics of black holes, the formation of planetary systems, and cosmological
simulations. The simulation of realistic star clusters with sufficiently high
accuracy cannot be performed on a single workstation but may be possible on
parallel computers or grids. We have implemented two parallel schemes for a
direct N-body code and we study their performance on general purpose parallel
computers and large computational grids. We present the results of timing
analyzes conducted on the different architectures and compare them with the
predictions from theoretical models. We conclude that the simulation of star
clusters with up to a million particles will be possible on large distributed
computers in the next decade. Simulating entire galaxies however will in
addition require new hybrid methods to speedup the calculation.Comment: 22 pages, 8 figures, accepted for publication in Parallel Computin
Parallel implementation of the TRANSIMS micro-simulation
This paper describes the parallel implementation of the TRANSIMS traffic
micro-simulation. The parallelization method is domain decomposition, which
means that each CPU of the parallel computer is responsible for a different
geographical area of the simulated region. We describe how information between
domains is exchanged, and how the transportation network graph is partitioned.
An adaptive scheme is used to optimize load balancing. We then demonstrate how
computing speeds of our parallel micro-simulations can be systematically
predicted once the scenario and the computer architecture are known. This makes
it possible, for example, to decide if a certain study is feasible with a
certain computing budget, and how to invest that budget. The main ingredients
of the prediction are knowledge about the parallel implementation of the
micro-simulation, knowledge about the characteristics of the partitioning of
the transportation network graph, and knowledge about the interaction of these
quantities with the computer system. In particular, we investigate the
differences between switched and non-switched topologies, and the effects of 10
Mbit, 100 Mbit, and Gbit Ethernet. keywords: Traffic simulation, parallel
computing, transportation planning, TRANSIM
Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study
Analytic, first-principles performance modeling of distributed-memory
applications is difficult due to a wide spectrum of random disturbances caused
by the application and the system. These disturbances (commonly called "noise")
destroy the assumptions of regularity that one usually employs when
constructing simple analytic models. Despite numerous efforts to quantify,
categorize, and reduce such effects, a comprehensive quantitative understanding
of their performance impact is not available, especially for long delays that
have global consequences for the parallel application. In this work, we
investigate various traces collected from synthetic benchmarks that mimic real
applications on simulated and real message-passing systems in order to pinpoint
the mechanisms behind delay propagation. We analyze the dependence of the
propagation speed of idle waves emanating from injected delays with respect to
the execution and communication properties of the application, study how such
delays decay under increased noise levels, and how they interact with each
other. We also show how fine-grained noise can make a system immune against the
adverse effects of propagating idle waves. Our results contribute to a better
understanding of the collective phenomena that manifest themselves in
distributed-memory parallel applications.Comment: 10 pages, 9 figures; title change
NBODY6++GPU: Ready for the gravitational million-body problem
Accurate direct -body simulations help to obtain detailed information
about the dynamical evolution of star clusters. They also enable comparisons
with analytical models and Fokker-Planck or Monte-Carlo methods. NBODY6 is a
well-known direct -body code for star clusters, and NBODY6++ is the extended
version designed for large particle number simulations by supercomputers. We
present NBODY6++GPU, an optimized version of NBODY6++ with hybrid
parallelization methods (MPI, GPU, OpenMP, and AVX/SSE) to accelerate large
direct -body simulations, and in particular to solve the million-body
problem. We discuss the new features of the NBODY6++GPU code, benchmarks, as
well as the first results from a simulation of a realistic globular cluster
initially containing a million particles. For million-body simulations,
NBODY6++GPU is times faster than NBODY6 with 320 CPU cores and 32
NVIDIA K20X GPUs. With this computing cluster specification, the simulations of
million-body globular clusters including primordial binaries require
about an hour per half-mass crossing time.Comment: 13 pages, 9 figures, 3 table
Coverage Protocols for Wireless Sensor Networks: Review and Future Directions
The coverage problem in wireless sensor networks (WSNs) can be generally
defined as a measure of how effectively a network field is monitored by its
sensor nodes. This problem has attracted a lot of interest over the years and
as a result, many coverage protocols were proposed. In this survey, we first
propose a taxonomy for classifying coverage protocols in WSNs. Then, we
classify the coverage protocols into three categories (i.e. coverage aware
deployment protocols, sleep scheduling protocols for flat networks, and
cluster-based sleep scheduling protocols) based on the network stage where the
coverage is optimized. For each category, relevant protocols are thoroughly
reviewed and classified based on the adopted coverage techniques. Finally, we
discuss open issues (and recommend future directions to resolve them)
associated with the design of realistic coverage protocols. Issues such as
realistic sensing models, realistic energy consumption models, realistic
connectivity models and sensor localization are covered
The Simulation Model Partitioning Problem: an Adaptive Solution Based on Self-Clustering (Extended Version)
This paper is about partitioning in parallel and distributed simulation. That
means decomposing the simulation model into a numberof components and to
properly allocate them on the execution units. An adaptive solution based on
self-clustering, that considers both communication reduction and computational
load-balancing, is proposed. The implementation of the proposed mechanism is
tested using a simulation model that is challenging both in terms of structure
and dynamicity. Various configurations of the simulation model and the
execution environment have been considered. The obtained performance results
are analyzed using a reference cost model. The results demonstrate that the
proposed approach is promising and that it can reduce the simulation execution
time in both parallel and distributed architectures
- …