31,762 research outputs found
RLlib: Abstractions for Distributed Reinforcement Learning
Reinforcement learning (RL) algorithms involve the deep nesting of highly
irregular computation patterns, each of which typically exhibits opportunities
for distributed computation. We argue for distributing RL components in a
composable way by adapting algorithms for top-down hierarchical control,
thereby encapsulating parallelism and resource requirements within
short-running compute tasks. We demonstrate the benefits of this principle
through RLlib: a library that provides scalable software primitives for RL.
These primitives enable a broad range of algorithms to be implemented with high
performance, scalability, and substantial code reuse. RLlib is available at
https://rllib.io/.Comment: Published in the International Conference on Machine Learning (ICML
2018), 10 page
TensorFlow: A system for large-scale machine learning
TensorFlow is a machine learning system that operates at large scale and in
heterogeneous environments. TensorFlow uses dataflow graphs to represent
computation, shared state, and the operations that mutate that state. It maps
the nodes of a dataflow graph across many machines in a cluster, and within a
machine across multiple computational devices, including multicore CPUs,
general-purpose GPUs, and custom designed ASICs known as Tensor Processing
Units (TPUs). This architecture gives flexibility to the application developer:
whereas in previous "parameter server" designs the management of shared state
is built into the system, TensorFlow enables developers to experiment with
novel optimizations and training algorithms. TensorFlow supports a variety of
applications, with particularly strong support for training and inference on
deep neural networks. Several Google services use TensorFlow in production, we
have released it as an open-source project, and it has become widely used for
machine learning research. In this paper, we describe the TensorFlow dataflow
model in contrast to existing systems, and demonstrate the compelling
performance that TensorFlow achieves for several real-world applications.Comment: 18 pages, 9 figures; v2 has a spelling correction in the metadat
AdaNet: A Scalable and Flexible Framework for Automatically Learning Ensembles
AdaNet is a lightweight TensorFlow-based (Abadi et al., 2015) framework for
automatically learning high-quality ensembles with minimal expert intervention.
Our framework is inspired by the AdaNet algorithm (Cortes et al., 2017) which
learns the structure of a neural network as an ensemble of subnetworks. We
designed it to: (1) integrate with the existing TensorFlow ecosystem, (2) offer
sensible default search spaces to perform well on novel datasets, (3) present a
flexible API to utilize expert information when available, and (4) efficiently
accelerate training with distributed CPU, GPU, and TPU hardware. The code is
open-source and available at: https://github.com/tensorflow/adanet
MLlib: Machine Learning in Apache Spark
Apache Spark is a popular open-source platform for large-scale data
processing that is well-suited for iterative machine learning tasks. In this
paper we present MLlib, Spark's open-source distributed machine learning
library. MLlib provides efficient functionality for a wide range of learning
settings and includes several underlying statistical, optimization, and linear
algebra primitives. Shipped with Spark, MLlib supports several languages and
provides a high-level API that leverages Spark's rich ecosystem to simplify the
development of end-to-end machine learning pipelines. MLlib has experienced a
rapid growth due to its vibrant open-source community of over 140 contributors,
and includes extensive documentation to support further growth and to let users
quickly get up to speed
Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics
Control planes of cloud frameworks trade off between scheduling granularity
and performance. Centralized systems schedule at task granularity, but only
schedule a few thousand tasks per second. Distributed systems schedule hundreds
of thousands of tasks per second but changing the schedule is costly.
We present execution templates, a control plane abstraction that can schedule
hundreds of thousands of tasks per second while supporting fine-grained,
per-task scheduling decisions. Execution templates leverage a program's
repetitive control flow to cache blocks of frequently-executed tasks. Executing
a task in a template requires sending a single message. Large-scale scheduling
changes install new templates, while small changes apply edits to existing
templates.
Evaluations of execution templates in Nimbus, a data analytics framework,
find that they provide the fine-grained scheduling flexibility of centralized
control planes while matching the strong scaling of distributed ones. Execution
templates support complex, real-world applications, such as a fluid simulation
with a triply nested loop and data dependent branches.Comment: To appear at USENIX ATC 201
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs
Graph analysis performs many random reads and writes, thus, these workloads
are typically performed in memory. Traditionally, analyzing large graphs
requires a cluster of machines so the aggregate memory exceeds the graph size.
We demonstrate that a multicore server can process graphs with billions of
vertices and hundreds of billions of edges, utilizing commodity SSDs with
minimal performance loss. We do so by implementing a graph-processing engine on
top of a user-space SSD file system designed for high IOPS and extreme
parallelism. Our semi-external memory graph engine called FlashGraph stores
vertex state in memory and edge lists on SSDs. It hides latency by overlapping
computation with I/O. To save I/O bandwidth, FlashGraph only accesses edge
lists requested by applications from SSDs; to increase I/O throughput and
reduce CPU overhead for I/O, it conservatively merges I/O requests. These
designs maximize performance for applications with different I/O
characteristics. FlashGraph exposes a general and flexible vertex-centric
programming interface that can express a wide variety of graph algorithms and
their optimizations. We demonstrate that FlashGraph in semi-external memory
performs many algorithms with performance up to 80% of its in-memory
implementation and significantly outperforms PowerGraph, a popular distributed
in-memory graph engine.Comment: published in FAST'1
ZenLDA: An Efficient and Scalable Topic Model Training System on Distributed Data-Parallel Platform
This paper presents our recent efforts, zenLDA, an efficient and scalable
Collapsed Gibbs Sampling system for Latent Dirichlet Allocation training, which
is thought to be challenging that both data parallelism and model parallelism
are required because of the Big sampling data with up to billions of documents
and Big model size with up to trillions of parameters. zenLDA combines both
algorithm level improvements and system level optimizations. It first presents
a novel CGS algorithm that balances the time complexity, model accuracy and
parallelization flexibility. The input corpus in zenLDA is represented as a
directed graph and model parameters are annotated as the corresponding vertex
attributes. The distributed training is parallelized by partitioning the graph
that in each iteration it first applies CGS step for all partitions in
parallel, followed by synchronizing the computed model each other. In this way,
both data parallelism and model parallelism are achieved by converting them to
graph parallelism. We revisited the tradeoff between system efficiency and
model accuracy and presented approximations such as unsynchronized model,
sparse model initialization and "converged" token exclusion. zenLDA is built on
GraphX in Spark that provides distributed data abstraction (RDD) and expressive
APIs to simplify the programming efforts and simultaneously hides the system
complexities. This enables us to implement other CGS algorithm with a few lines
of code change. To better fit in distributed data-parallel framework and
achieve comparable performance with contemporary systems, we also presented
several system level optimizations to push the performance limit. zenLDA was
evaluated it against web-scale corpus, and the result indicates that zenLDA can
achieve about much better performance than other CGS algorithm we implemented,
and simultaneously achieve better model accuracy.Comment: 11 pages, 10 figures. arXiv admin note: text overlap with
arXiv:1412.4986 by other author
Theano-MPI: a Theano-based Distributed Training Framework
We develop a scalable and extendable training framework that can utilize GPUs
across nodes in a cluster and accelerate the training of deep learning models
based on data parallelism. Both synchronous and asynchronous training are
implemented in our framework, where parameter exchange among GPUs is based on
CUDA-aware MPI. In this report, we analyze the convergence and capability of
the framework to reduce training time when scaling the synchronous training of
AlexNet and GoogLeNet from 2 GPUs to 8 GPUs. In addition, we explore novel ways
to reduce the communication overhead caused by exchanging parameters. Finally,
we release the framework as open-source for further research on distributed
deep learnin
: Codes for Coded Computation that Leverage Stragglers
In distributed computing systems, it is well recognized that worker nodes
that are slow (called stragglers) tend to dominate the overall job execution
time. Coded computation utilizes concepts from erasure coding to mitigate the
effect of stragglers by running "coded" copies of tasks comprising a job.
Stragglers are typically treated as erasures in this process. While this is
useful, there are issues with applying, e.g., MDS codes in a straightforward
manner. Specifically, several applications such as matrix-vector products deal
with sparse matrices. MDS codes typically require dense linear combinations of
submatrices of the original matrix which destroy their inherent sparsity. This
is problematic as it results in significantly higher processing times for
computing the submatrix-vector products in coded computation. Furthermore, it
also ignores partial computations at stragglers.
In this work, we propose a fine-grained model that quantifies the level of
non-trivial coding needed to obtain the benefits of coding in matrix-vector
computation. Simultaneously, it allows us to leverage partial computations
performed by the straggler nodes. For this model, we propose and evaluate
several code designs and discuss their properties.Comment: 5 pages, 3 figures, to appear at the 2018 IEEE Information Theory
Workshop (ITW), in Guangzhou, Chin
Big Data Computing Using Cloud-Based Technologies, Challenges and Future Perspectives
The excessive amounts of data generated by devices and Internet-based sources
at a regular basis constitute, big data. This data can be processed and
analyzed to develop useful applications for specific domains. Several
mathematical and data analytics techniques have found use in this sphere. This
has given rise to the development of computing models and tools for big data
computing. However, the storage and processing requirements are overwhelming
for traditional systems and technologies. Therefore, there is a need for
infrastructures that can adjust the storage and processing capability in
accordance with the changing data dimensions. Cloud Computing serves as a
potential solution to this problem. However, big data computing in the cloud
has its own set of challenges and research issues. This chapter surveys the big
data concept, discusses the mathematical and data analytics techniques that can
be used for big data and gives taxonomy of the existing tools, frameworks and
platforms available for different big data computing models. Besides this, it
also evaluates the viability of cloud-based big data computing, examines
existing challenges and opportunities, and provides future research directions
in this field
- …