955 research outputs found
Modeling Scalability of Distributed Machine Learning
Present day machine learning is computationally intensive and processes large
amounts of data. It is implemented in a distributed fashion in order to address
these scalability issues. The work is parallelized across a number of computing
nodes. It is usually hard to estimate in advance how many nodes to use for a
particular workload. We propose a simple framework for estimating the
scalability of distributed machine learning algorithms. We measure the
scalability by means of the speedup an algorithm achieves with more nodes. We
propose time complexity models for gradient descent and graphical model
inference. We validate our models with experiments on deep learning training
and belief propagation. This framework was used to study the scalability of
machine learning algorithms in Apache Spark.Comment: 6 pages, 4 figures, appears at ICDE 201
A Parallel General Purpose Multi-Objective Optimization Framework, with Application to Beam Dynamics
Particle accelerators are invaluable tools for research in the basic and
applied sciences, in fields such as materials science, chemistry, the
biosciences, particle physics, nuclear physics and medicine. The design,
commissioning, and operation of accelerator facilities is a non-trivial task,
due to the large number of control parameters and the complex interplay of
several conflicting design goals. We propose to tackle this problem by means of
multi-objective optimization algorithms which also facilitate a parallel
deployment. In order to compute solutions in a meaningful time frame a fast and
scalable software framework is required. In this paper, we present the
implementation of such a general-purpose framework for simulation-based
multi-objective optimization methods that allows the automatic investigation of
optimal sets of machine parameters. The implementation is based on a
master/slave paradigm, employing several masters that govern a set of slaves
executing simulations and performing optimization tasks. Using evolutionary
algorithms as the optimizer and OPAL as the forward solver, validation
experiments and results of multi-objective optimization problems in the domain
of beam dynamics are presented. The high charge beam line at the Argonne
Wakefield Accelerator Facility was used as the beam dynamics model. The 3D beam
size, transverse momentum, and energy spread were optimized
Automated, Parallel Optimization Algorithms for Stochastic Functions
The optimization algorithms for stochastic functions are desired specifically for real-world and simulation applications where results are obtained from sampling, and contain experimental error or random noise. We have developed a series of stochastic optimization algorithms based on the well-known classical down hill simplex algorithm. Our parallel implementation of these optimization algorithms, using a framework called MW, is based on a master-worker architecture where each worker runs a massively parallel program. This parallel implementation allows the sampling to proceed independently on many processors as demonstrated by scaling up to more than 100 vertices and 300 cores. This framework is highly suitable for clusters with an ever increasing number of cores per node. The new algorithms have been successfully applied to the reparameterization of a model for liquid water, achieving thermodynamic and structural results for liquid water that are better than a standard model used in molecular simulations, with the the advantage of a fully automated parameterization process
SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores
The ever-growing complexity of reinforcement learning (RL) tasks demands a
distributed RL system to efficiently generate and process a massive amount of
data to train intelligent agents. However, existing open-source libraries
suffer from various limitations, which impede their practical use in
challenging scenarios where large-scale training is necessary. While industrial
systems from OpenAI and DeepMind have achieved successful large-scale RL
training, their system architecture and implementation details remain
undisclosed to the community. In this paper, we present a novel abstraction on
the dataflows of RL training, which unifies practical RL training across
diverse applications into a general framework and enables fine-grained
optimizations. Following this abstraction, we develop a scalable, efficient,
and extensible distributed RL system called ReaLly Scalable RL (SRL). The
system architecture of SRL separates major RL computation components and allows
massively parallelized training. Moreover, SRL offers user-friendly and
extensible interfaces for customized algorithms. Our evaluation shows that SRL
outperforms existing academic libraries in both a single machine and a
medium-sized cluster. In a large-scale cluster, the novel architecture of SRL
leads to up to 3.7x speedup compared to the design choices adopted by the
existing libraries. We also conduct a direct benchmark comparison to OpenAI's
industrial system, Rapid, in the challenging hide-and-seek environment. SRL
reproduces the same solution as reported by OpenAI with up to 5x speedup in
wall-clock time. Furthermore, we also examine the performance of SRL in a much
harder variant of the hide-and-seek environment and achieve substantial
learning speedup by scaling SRL to over 15k CPU cores and 32 A100 GPUs.
Notably, SRL is the first in the academic community to perform RL experiments
at such a large scale.Comment: 15 pages, 12 figures, 6 table
State of the Art in Parallel Computing with R
R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly suited to general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems five different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix.
A Study of Optimal 4-bit Reversible Toffoli Circuits and Their Synthesis
Optimal synthesis of reversible functions is a non-trivial problem. One of
the major limiting factors in computing such circuits is the sheer number of
reversible functions. Even restricting synthesis to 4-bit reversible functions
results in a huge search space (16! {\approx} 2^{44} functions). The output of
such a search alone, counting only the space required to list Toffoli gates for
every function, would require over 100 terabytes of storage. In this paper, we
present two algorithms: one, that synthesizes an optimal circuit for any 4-bit
reversible specification, and another that synthesizes all optimal
implementations. We employ several techniques to make the problem tractable. We
report results from several experiments, including synthesis of all optimal
4-bit permutations, synthesis of random 4-bit permutations, optimal synthesis
of all 4-bit linear reversible circuits, synthesis of existing benchmark
functions; we compose a list of the hardest permutations to synthesize, and
show distribution of optimal circuits. We further illustrate that our proposed
approach may be extended to accommodate physical constraints via reporting
LNN-optimal reversible circuits. Our results have important implications in the
design and optimization of reversible and quantum circuits, testing circuit
synthesis heuristics, and performing experiments in the area of quantum
information processing.Comment: arXiv admin note: substantial text overlap with arXiv:1003.191
Dynamic Multigrain Parallelization on the Cell Broadband Engine
This paper addresses the problem of orchestrating and scheduling
parallelism at multiple levels of granularity on heterogeneous
multicore processors. We present policies and mechanisms for adaptive
exploitation and scheduling of multiple layers of parallelism on the
Cell Broadband Engine. Our policies combine event-driven task
scheduling with malleable loop-level parallelism, which is exposed
from the runtime system whenever task-level parallelism leaves cores
idle. We present a runtime system for scheduling applications with
layered parallelism on Cell and investigate its potential with RAxML,
a computational biology application which infers large phylogenetic
trees, using the Maximum Likelihood (ML) method. Our experiments show
that the Cell benefits significantly from dynamic parallelization
methods, that selectively exploit the layers of parallelism in the
system, in response to workload characteristics. Our runtime
environment outperforms naive parallelization and scheduling based on
MPI and Linux by up to a factor of 2.6. We are able to execute RAxML
on one Cell four times faster than on a dual-processor system with
Hyperthreaded Xeon processors, and 5--10\% faster than on a
single-processor system with a dual-core, quad-thread IBM Power5
processor
- …