1,236 research outputs found
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching
Current techniques and systems for distributed model training mostly assume
that clusters are comprised of homogeneous servers with a constant resource
availability. However, cluster heterogeneity is pervasive in computing
infrastructure, and is a fundamental characteristic of low-cost transient
resources (such as EC2 spot instances). In this paper, we develop a dynamic
batching technique for distributed data-parallel training that adjusts the
mini-batch sizes on each worker based on its resource availability and
throughput. Our mini-batch controller seeks to equalize iteration times on all
workers, and facilitates training on clusters comprised of servers with
different amounts of CPU and GPU resources. This variable mini-batch technique
uses proportional control and ideas from PID controllers to find stable
mini-batch sizes. Our empirical evaluation shows that dynamic batching can
reduce model training times by more than 4x on heterogeneous clusters
TensorFlow Enabled Genetic Programming
Genetic Programming, a kind of evolutionary computation and machine learning
algorithm, is shown to benefit significantly from the application of vectorized
data and the TensorFlow numerical computation library on both CPU and GPU
architectures. The open source, Python Karoo GP is employed for a series of 190
tests across 6 platforms, with real-world datasets ranging from 18 to 5.5M data
points. This body of tests demonstrates that datasets measured in tens and
hundreds of data points see 2-15x improvement when moving from the scalar/SymPy
configuration to the vector/TensorFlow configuration, with a single core
performing on par or better than multiple CPU cores and GPUs. A dataset
composed of 90,000 data points demonstrates a single vector/TensorFlow CPU core
performing 875x better than 40 scalar/Sympy CPU cores. And a dataset containing
5.5M data points sees GPU configurations out-performing CPU configurations on
average by 1.3x.Comment: 8 pages, 5 figures; presented at GECCO 2017, Berlin, German
Petuum: A New Platform for Distributed Machine Learning on Big Data
What is a systematic way to efficiently apply a wide spectrum of advanced ML
programs to industrial scale problems, using Big Models (up to 100s of billions
of parameters) on Big Data (up to terabytes or petabytes)? Modern
parallelization strategies employ fine-grained operations and scheduling beyond
the classic bulk-synchronous processing paradigm popularized by MapReduce, or
even specialized graph-based execution that relies on graph representations of
ML programs. The variety of approaches tends to pull systems and algorithms
design in different directions, and it remains difficult to find a universal
platform applicable to a wide range of ML programs at scale. We propose a
general-purpose framework that systematically addresses data- and
model-parallel challenges in large-scale ML, by observing that many ML programs
are fundamentally optimization-centric and admit error-tolerant,
iterative-convergent algorithmic solutions. This presents unique opportunities
for an integrative system design, such as bounded-error network synchronization
and dynamic scheduling based on ML program structure. We demonstrate the
efficacy of these system designs versus well-known implementations of modern ML
algorithms, allowing ML programs to run in much less time and at considerably
larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl
Straggler-Resilient Distributed Computing
In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of University of Bergen's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.Utbredelsen av distribuerte datasystemer har økt betydelig de siste årene. Dette skyldes først og fremst at behovet for beregningskraft øker raskere enn hastigheten til en enkelt datamaskin, slik at vi må bruke flere datamaskiner for å møte etterspørselen, og at det blir stadig mer vanlig at systemer er spredt over et stort geografisk område. Dette paradigmeskiftet medfører mange tekniske utfordringer. En av disse er knyttet til "straggler"-problemet, som er forårsaket av forsinkelsesvariasjoner i distribuerte systemer, der en beregning forsinkes av noen få langsomme noder slik at andre noder må vente før de kan fortsette. Straggler-problemet kan svekke effektiviteten til distribuerte systemer betydelig i situasjoner der en enkelt node som opplever en midlertidig overbelastning kan låse et helt system.
I denne avhandlingen studerer vi metoder for å gjøre beregninger av forskjellige typer motstandsdyktige mot slike problemer, og dermed gjøre det mulig for et distribuert system å fortsette til tross for at noen noder ikke svarer i tide. Metodene vi foreslår er skreddersydde for spesielle typer beregninger. Vi foreslår metoder tilpasset distribuert matrise-vektor-multiplikasjon (som er en grunnleggende operasjon i mange typer beregninger), distribuert maskinlæring og distribuert sporing av en tilfeldig prosess (for eksempel det å spore plasseringen til kjøretøy for å unngå kollisjon). De foreslåtte metodene utnytter redundans som enten blir introdusert som en del av metoden, eller som naturlig eksisterer i det underliggende problemet, til å kompensere for manglende delberegninger. For en av de foreslåtte metodene utnytter vi redundans for også å øke effektiviteten til kommunikasjonen mellom noder, og dermed redusere mengden data som må kommuniseres over nettverket. I likhet med straggler-problemet kan slik kommunikasjon begrense effektiviteten i distribuerte systemer betydelig. De foreslåtte metodene gir signifikante forbedringer i ventetid og pålitelighet sammenlignet med tidligere metoder.The number and scale of distributed computing systems being built have increased significantly in recent years. Primarily, that is because: i) our computing needs are increasing at a much higher rate than computers are becoming faster, so we need to use more of them to meet demand, and ii) systems that are fundamentally distributed, e.g., because the components that make them up are geographically distributed, are becoming increasingly prevalent. This paradigm shift is the source of many engineering challenges. Among them is the straggler problem, which is a problem caused by latency variations in distributed systems, where faster nodes are held up by slower ones. The straggler problem can significantly impair the effectiveness of distributed systems—a single node experiencing a transient outage (e.g., due to being overloaded) can lock up an entire system.
In this thesis, we consider schemes for making a range of computations resilient against such stragglers, thus allowing a distributed system to proceed in spite of some nodes failing to respond on time. The schemes we propose are tailored for particular computations. We propose schemes designed for distributed matrix-vector multiplication, which is a fundamental operation in many computing applications, distributed machine learning—in the form of a straggler-resilient first-order optimization method—and distributed tracking of a time-varying process (e.g., tracking the location of a set of vehicles for a collision avoidance system). The proposed schemes rely on exploiting redundancy that is either introduced as part of the scheme, or exists naturally in the underlying problem, to compensate for missing results, i.e., they are a form of forward error correction for computations. Further, for one of the proposed schemes we exploit redundancy to also improve the effectiveness of multicasting, thus reducing the amount of data that needs to be communicated over the network. Such inter-node communication, like the straggler problem, can significantly limit the effectiveness of distributed systems. For the schemes we propose, we are able to show significant improvements in latency and reliability compared to previous schemes.Doktorgradsavhandlin
- …