4,898 research outputs found
D: Decentralized Training over Decentralized Data
While training a machine learning model using multiple workers, each of which
collects data from their own data sources, it would be most useful when the
data collected from different workers can be {\em unique} and {\em different}.
Ironically, recent analysis of decentralized parallel stochastic gradient
descent (D-PSGD) relies on the assumption that the data hosted on different
workers are {\em not too different}. In this paper, we ask the question: {\em
Can we design a decentralized parallel stochastic gradient descent algorithm
that is less sensitive to the data variance across workers?} In this paper, we
present D, a novel decentralized parallel stochastic gradient descent
algorithm designed for large data variance \xr{among workers} (imprecisely,
"decentralized" data). The core of D is a variance blackuction extension of
the standard D-PSGD algorithm, which improves the convergence rate from
to where
denotes the variance among data on different workers. As a result, D is
robust to data variance among workers. We empirically evaluated D on image
classification tasks where each worker has access to only the data of a limited
set of labels, and find that D significantly outperforms D-PSGD
Kac-Schwarz Operators of Type , Quantum Spectral Curves, and Spin Hurwitz Numbers
Given a tau-function of the BKP hierarchy satisfying ,
we discuss the relation between its BKP-affine coordinates on the isotropic
Sato Grassmannian and its BKP-wave function. Using this result, we formulate a
type of Kac-Schwarz operators for in terms of BKP-affine coordinates.
As an example, we compute the affine coordinates of the BKP tau-function for
spin single Hurwitz numbers with completed cycles, and find a pair of
Kac-Schwarz operators satisfying . By doing this, we obtain
the quantum spectral curve for spin single Hurwitz numbers
Distributed Learning over Unreliable Networks
Most of today's distributed machine learning systems assume {\em reliable
networks}: whenever two machines exchange information (e.g., gradients or
models), the network should guarantee the delivery of the message. At the same
time, recent work exhibits the impressive tolerance of machine learning
algorithms to errors or noise arising from relaxed communication or
synchronization. In this paper, we connect these two trends, and consider the
following question: {\em Can we design machine learning systems that are
tolerant to network unreliability during training?} With this motivation, we
focus on a theoretical problem of independent interest---given a standard
distributed parameter server architecture, if every communication between the
worker and the server has a non-zero probability of being dropped, does
there exist an algorithm that still converges, and at what speed? The technical
contribution of this paper is a novel theoretical analysis proving that
distributed learning over unreliable network can achieve comparable convergence
rate to centralized or distributed learning over reliable networks. Further, we
prove that the influence of the packet drop rate diminishes with the growth of
the number of \textcolor{black}{parameter servers}. We map this theoretical
result onto a real-world scenario, training deep neural networks over an
unreliable network layer, and conduct network simulation to validate the system
improvement by allowing the networks to be unreliable
PMLR Press
Recently there has been significant interest in training machine-learning models at low precision: by reducing precision, one can reduce computation and communication by one order of magnitude. We examine training at reduced precision, both from a theoretical and practical perspective, and ask: is it possible to train models at end-to-end low precision with provable guarantees? Can this lead to consistent order-of-magnitude speedups? We mainly focus on linear models, and the answer is yes for linear models. We develop a simple framework called ZipML based on one simple but novel strategy called double sampling. Our ZipML framework is able to execute training at low precision with no bias, guaranteeing convergence, whereas naive quanti- zation would introduce significant bias. We val- idate our framework across a range of applica- tions, and show that it enables an FPGA proto- type that is up to 6.5 × faster than an implemen- tation using full 32-bit precision. We further de- velop a variance-optimal stochastic quantization strategy and show that it can make a significant difference in a variety of settings. When applied to linear models together with double sampling, we save up to another 1.7 × in data movement compared with uniform quantization. When training deep networks with quantized models, we achieve higher accuracy than the state-of-the- art XNOR-Net
- …