3,564 research outputs found
Dynamic Control Flow in Large-Scale Machine Learning
Many recent machine learning models rely on fine-grained dynamic control flow
for training and inference. In particular, models based on recurrent neural
networks and on reinforcement learning depend on recurrence relations,
data-dependent conditional execution, and other features that call for dynamic
control flow. These applications benefit from the ability to make rapid
control-flow decisions across a set of computing devices in a distributed
system. For performance, scalability, and expressiveness, a machine learning
system must support dynamic control flow in distributed and heterogeneous
environments.
This paper presents a programming model for distributed machine learning that
supports dynamic control flow. We describe the design of the programming model,
and its implementation in TensorFlow, a distributed machine learning system.
Our approach extends the use of dataflow graphs to represent machine learning
models, offering several distinctive features. First, the branches of
conditionals and bodies of loops can be partitioned across many machines to run
on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs.
Second, programs written in our model support automatic differentiation and
distributed gradient computations, which are necessary for training machine
learning models that use control flow. Third, our choice of non-strict
semantics enables multiple loop iterations to execute in parallel across
machines, and to overlap compute and I/O operations.
We have done our work in the context of TensorFlow, and it has been used
extensively in research and production. We evaluate it using several real-world
applications, and demonstrate its performance and scalability.Comment: Appeared in EuroSys 2018. 14 pages, 16 figure
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
Efficient and Reasonable Object-Oriented Concurrency
Making threaded programs safe and easy to reason about is one of the chief
difficulties in modern programming. This work provides an efficient execution
model for SCOOP, a concurrency approach that provides not only data race
freedom but also pre/postcondition reasoning guarantees between threads. The
extensions we propose influence both the underlying semantics to increase the
amount of concurrent execution that is possible, exclude certain classes of
deadlocks, and enable greater performance. These extensions are used as the
basis an efficient runtime and optimization pass that improve performance 15x
over a baseline implementation. This new implementation of SCOOP is also 2x
faster than other well-known safe concurrent languages. The measurements are
based on both coordination-intensive and data-manipulation-intensive benchmarks
designed to offer a mixture of workloads.Comment: Proceedings of the 10th Joint Meeting of the European Software
Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of
Software Engineering (ESEC/FSE '15). ACM, 201
STAPL-RTS: A Runtime System for Massive Parallelism
Modern High Performance Computing (HPC) systems are complex, with deep memory hierarchies and increasing use of computational heterogeneity via accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage machine resources, writing programs using low level primitives from multiple APIs (e.g., MPI+OpenMP), creating efficient but rigid, difficult to extend, and non-portable implementations. Alternatively, users can adopt higher level programming environments, often at the cost of lost performance. Our approach is to maintain the high level nature of the application without sacrificing performance by relying on the transfer of high level, application semantic knowledge between layers of the software stack at an appropriate level of abstraction and performing optimizations on a per-layer basis. In this dissertation, we present the STAPL Runtime System (STAPL-RTS), a runtime system built for portable performance, suitable for massively parallel machines. While the STAPL-RTS abstracts and virtualizes the underlying platform for portability, it uses information from the upper layers to perform the appropriate low level optimizations that restore the performance characteristics.
We outline the fundamental ideas behind the design of the STAPL-RTS, such as the always distributed communication model and its asynchronous operations. Through appropriate code examples and benchmarks, we prove that high level information allows applications written on top of the STAPL-RTS to attain the performance of optimized, but ad hoc solutions. Using the STAPL library, we demonstrate how this information guides important decisions in the STAPL-RTS, such as multi-protocol communication coordination and request aggregation using established C++ programming idioms.
Recognizing that nested parallelism is of increasing interest for both expressivity and performance, we present a parallel model that combines asynchronous, one-sided operations with isolated nested parallel sections. Previous approaches to nested parallelism targeted either static applications through the use of blocking, isolated sections, or dynamic applications by using asynchronous mechanisms (i.e., recursive task spawning) which come at the expense of isolation. We combine the flexibility of dynamic task creation with the isolation guarantees of the static models by allowing the creation of asynchronous, one-sided nested parallel sections that work in tandem with the more traditional, synchronous, collective nested parallelism. This allows selective, run-time customizable use of parallelism in an application, based on the input and the algorithm
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
- …