678 research outputs found
Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms
The implementation of a vast majority of machine learning (ML) algorithms
boils down to solving a numerical optimization problem. In this context,
Stochastic Gradient Descent (SGD) methods have long proven to provide good
results, both in terms of convergence and accuracy. Recently, several
parallelization approaches have been proposed in order to scale SGD to solve
very large ML problems. At their core, most of these approaches are following a
map-reduce scheme. This paper presents a novel parallel updating algorithm for
SGD, which utilizes the asynchronous single-sided communication paradigm.
Compared to existing methods, Asynchronous Parallel Stochastic Gradient Descent
(ASGD) provides faster (or at least equal) convergence, close to linear scaling
and stable accuracy
Using GPI-2 for Distributed Memory Paralleliziation of the Caffe Toolbox to Speed up Deep Neural Network Training
Deep Neural Network (DNN) are currently of great inter- est in research and
application. The training of these net- works is a compute intensive and time
consuming task. To reduce training times to a bearable amount at reasonable
cost we extend the popular Caffe toolbox for DNN with an efficient distributed
memory communication pattern. To achieve good scalability we emphasize the
overlap of computation and communication and prefer fine granu- lar
synchronization patterns over global barriers. To im- plement these
communication patterns we rely on the the Global address space Programming
Interface version 2 (GPI-2) communication library. This interface provides a
light-weight set of asynchronous one-sided communica- tion primitives
supplemented by non-blocking fine gran- ular data synchronization mechanisms.
Therefore, Caf- feGPI is the name of our parallel version of Caffe. First
benchmarks demonstrate better scaling behavior com- pared with other
extensions, e.g., the Intel TM Caffe. Even within a single symmetric
multiprocessing machine with four graphics processing units, the CaffeGPI
scales bet- ter than the standard Caffe toolbox. These first results
demonstrate that the use of standard High Performance Computing (HPC) hardware
is a valid cost saving ap- proach to train large DDNs. I/O is an other
bottleneck to work with DDNs in a standard parallel HPC setting, which we will
consider in more detail in a forthcoming paper
Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms
Stochastic Gradient Descent (SGD) is the standard numerical method used to
solve the core optimization problem for the vast majority of machine learning
(ML) algorithms. In the context of large scale learning, as utilized by many
Big Data applications, efficient parallelization of SGD is in the focus of
active research. Recently, we were able to show that the asynchronous
communication paradigm can be applied to achieve a fast and scalable
parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD)
outperforms other, mostly MapReduce based, parallel algorithms solving large
scale machine learning problems. In this paper, we investigate the impact of
asynchronous communication frequency and message size on the performance of
ASGD applied to large scale ML on HTC cluster and cloud environments. We
introduce a novel algorithm for the automatic balancing of the asynchronous
communication load, which allows to adapt ASGD to changing network bandwidths
and latencies.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0495
Personal and Political: A Micro-history of the “Red Column” Collective Farm, 1935-36
This article investigates the confluence of personal interests and official policy on collective farms in the mid-1930s, a period that has received far less
scholarly attention than the collectivization drive. The current historiography on collective farmers’ relationship with the state is one-sided, presenting peasants either as passive victims of or idealized resistors to state policies. Both views minimize the complex realities that governed the everyday lives of collective farmers for whom state policies often were secondary to local concerns. This paper, which draws upon rich archival materials in Kirov Krai, employs a micro-historical approach to study the struggle to remove the chairman of the “Red Column” collective farm in Kirov Krai in 1935- 36. It demonstrates that local and personal issues (family ties, grudges, and personality traits) had more influence on how collective farmers reacted to state campaigns and investigations than did official state policy and rhetoric. The chairman’s rude and arrogant behavior, mistreatment of the collective farmers, and flaunting of material goods led to his downfall. But to strengthen their arguments, his opponents accused him of associating with kulaks and white guardists. The chairman and his supporters struck back, alleging that his detractors were themselves white guardists and kulaks, who sought revenge for having been expelled from the collective farm.
Such a micro-historical approach reveals the importance of popular opinion, attitudes, and behavior on collective farms and the level of control that collective farmers had over shaping the implementation of state policies. This paper enables one to appreciate that peasants knew well how to manipulate official labels, such as kulak or class enemy, as weapons to achieve goals of local and personal importance. It enriches the historiography by offering a different way to appreciate peasant attitudes and behavior, and collective farm life in the mid-1930s
Optimization of Computationally and I/O Intense Patterns in Electronic Structure and Machine Learning Algorithms
Development of scalable High-Performance Computing (HPC) applications is already a challenging task even in the
pre-Exascale era. Utilization of the full potential of (near-)future supercomputers will most likely require the mastery
of massively parallel heterogeneous architectures with multi-tier persistence systems, ideally in fault tolerant mode.
With the change in hardware architectures HPC applications are also widening their scope to `Big data' processing and
analytics using machine learning algorithms and neural networks. In this work, in cooperation with the INTERTWinE
FET-HPC project, we demonstrate how the GASPI (Global Address Space Programming Interface) programming model
helps to address these Exascale challenges on examples of tensor contraction, K-means and Terasort algorithms
A Theory of Partitioned Global Address Spaces
Partitioned global address space (PGAS) is a parallel programming model for
the development of applications on clusters. It provides a global address space
partitioned among the cluster nodes, and is supported in programming languages
like C, C++, and Fortran by means of APIs. In this paper we provide a formal
model for the semantics of single instruction, multiple data programs using
PGAS APIs. Our model reflects the main features of popular real-world APIs such
as SHMEM, ARMCI, GASNet, GPI, and GASPI.
A key feature of PGAS is the support for one-sided communication: a node may
directly read and write the memory located at a remote node, without explicit
synchronization with the processes running on the remote side. One-sided
communication increases performance by decoupling process synchronization from
data transfer, but requires the programmer to reason about appropriate
synchronizations between reads and writes. As a second contribution, we propose
and investigate robustness, a criterion for correct synchronization of PGAS
programs. Robustness corresponds to acyclicity of a suitable happens-before
relation defined on PGAS computations. The requirement is finer than the
classical data race freedom and rules out most false error reports.
Our main result is an algorithm for checking robustness of PGAS programs. The
algorithm makes use of two insights. Using combinatorial arguments we first
show that, if a PGAS program is not robust, then there are computations in a
certain normal form that violate happens-before acyclicity. Intuitively,
normal-form computations delay remote accesses in an ordered way. We then
devise an algorithm that checks for cyclic normal-form computations.
Essentially, the algorithm is an emptiness check for a novel automaton model
that accepts normal-form computations in streaming fashion. Altogether, we
prove the robustness problem is PSpace-complete
Stalin’s Constitution
Upon its adoption in December 1936, Soviet leaders hailed the new so-called Stalin Constitution as the most democratic in the world. Scholars have long scoffed at this claim, noting that the mass repression of 1937-1938 that followed rendered it a hollow document. This book focuses on the six-month long popular discussion of the draft Constitution, which preceded its formal adoption in December 1936. Drawing on rich archival sources, this book uses the discussion of the draft 1936 Constitution to examine discourse between the central state leadership and citizens about the new Soviet social contract, which delineated the roles the state and citizens should play in developing socialism
- …