2,827 research outputs found
Multi-layer Optimizations for End-to-End Data Analytics
We consider the problem of training machine learning models over
multi-relational data. The mainstream approach is to first construct the
training dataset using a feature extraction query over input database and then
use a statistical software package of choice to train the model. In this paper
we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that
realizes an alternative approach. IFAQ treats the feature extraction query and
the learning task as one program given in the IFAQ's domain-specific language,
which captures a subset of Python commonly used in Jupyter notebooks for rapid
prototyping of machine learning applications. The program is subject to several
layers of IFAQ optimizations, such as algebraic transformations, loop
transformations, schema specialization, data layout optimizations, and finally
compilation into efficient low-level C++ code specialized for the given
workload and data.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit,
and TensorFlow by several orders of magnitude for linear regression and
regression tree models over several relational datasets
Flare: Native Compilation for Heterogeneous Workloads in Apache Spark
The need for modern data analytics to combine relational, procedural, and
map-reduce-style functional processing is widely recognized. State-of-the-art
systems like Spark have added SQL front-ends and relational query optimization,
which promise an increase in expressiveness and performance. But how good are
these extensions at extracting high performance from modern hardware platforms?
While Spark has made impressive progress, we show that for relational
workloads, there is still a significant gap compared with best-of-breed query
engines. And when stepping outside of the relational world, query optimization
techniques are ineffective if large parts of a computation have to be treated
as user-defined functions (UDFs).
We present Flare: a new back-end for Spark that brings performance closer to
the best SQL engines, without giving up the added expressiveness of Spark. We
demonstrate order of magnitude speedups both for relational workloads such as
TPC-H, as well as for a range of machine learning kernels that combine
relational and iterative functional processing.
Flare achieves these results through (1) compilation to native code, (2)
replacing parts of the Spark runtime system, and (3) extending the scope of
optimization and code generation to large classes of UDFs
STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics
Various general-purpose distributed systems have been proposed to cope with
high-diversity applications in the pipeline of Big Data analytics. Most of them
provide simple yet effective primitives to simplify distributed programming.
While the rigid primitives offer great ease of use to savvy programmers, they
probably compromise efficiency in performance and flexibility in data
representation and programming specifications, which are critical properties in
real systems. In this paper, we discuss the limitations of coarse-grained
primitives and aim to provide an alternative for users to have flexible control
over distributed programs and operate globally shared data more efficiently. We
develop STEP, a novel distributed framework based on in-memory key-value store.
The key idea of STEP is to adapt multi-threading in a single machine to a
distributed environment. STEP enables users to take fine-grained control over
distributed threads and apply task-specific optimizations in a flexible manner.
The underlying key-value store serves as distributed shared memory to keep
globally shared data. To ensure ease-of-use, STEP offers plentiful effective
interfaces in terms of distributed shared data manipulation, cluster
management, distributed thread management and synchronization. We conduct
extensive experimental studies to evaluate the performance of STEP using real
data sets. The results show that STEP outperforms the state-of-the-art
general-purpose distributed systems as well as a specialized ML platform in
many real applications
Kyrix: Interactive Visual Data Exploration at Scale
Scalable interactive visual data exploration is crucial in many domains due
to increasingly large datasets generated at rapid rates. Details-on-demand
provides a useful interaction paradigm for exploring large datasets, where
users start at an overview, find regions of interest, zoom in to see detailed
views, zoom out and then repeat. This paradigm is the primary user interaction
mode of widely-used systems such as Google Maps, Aperture Tiles and ForeCache.
These earlier systems, however, are highly customized with hardcoded visual
representations and optimizations. A more general framework is needed to
facilitate the development of visual data exploration systems at scale. In this
paper, we present Kyrix, an end-to-end system for developing scalable
details-on-demand data exploration applications. Kyrix provides developers with
a declarative model for easy specification of general visualizations. Behind
the scenes, Kyrix utilizes a suite of performance optimization techniques to
achieve a response time within 500ms for various user interactions. We also
report results from a performance study which shows that a novel dynamic
fetching scheme adopted by Kyrix outperforms tile-based fetching used in
earlier systems.Comment: CIDR'1
Terra: Scalable Cross-Layer GDA Optimizations
Geo-distributed analytics (GDA) frameworks transfer large datasets over the
wide-area network (WAN). Yet existing frameworks often ignore the WAN topology.
This disconnect between WAN-bound applications and the WAN itself results in
missed opportunities for cross-layer optimizations. In this paper, we present
Terra to bridge this gap. Instead of decoupled WAN routing and GDA transfer
scheduling, Terra applies scalable cross-layer optimizations to minimize WAN
transfer times for GDA jobs. We present a two-pronged approach: (i) a scalable
algorithm for joint routing and scheduling to make fast decisions; and (ii) a
scalable, overlay-based enforcement mechanism that avoids expensive switch rule
updates in the WAN. Together, they enable Terra to quickly react to WAN
uncertainties such as large bandwidth fluctuations and failures in an
application-aware manner as well. Integration with the FloodLight SDN
controller and Apache YARN, and evaluation on 4 workloads and 3 WAN topologies
show that Terra improves the average completion times of GDA jobs by
1.55x-3.43x. GDA jobs running with Terra meets 2.82x-4.29x more deadlines and
can quickly react to WAN-level events in an application-aware manner
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
Machine Learning models are often composed of pipelines of transformations.
While this design allows to efficiently execute single model components at
training time, prediction serving has different requirements such as low
latency, high throughput and graceful performance degradation under heavy load.
Current prediction serving systems consider models as black boxes, whereby
prediction-time-specific optimizations are ignored in favor of ease of
deployment. In this paper, we present PRETZEL, a prediction serving system
introducing a novel white box architecture enabling both end-to-end and
multi-model optimizations. Using production-like model pipelines, our
experiments show that PRETZEL is able to introduce performance improvements
over different dimensions; compared to state-of-the-art approaches PRETZEL is
on average able to reduce 99th percentile latency by 5.5x while reducing memory
footprint by 25x, and increasing throughput by 4.7x.Comment: 16 pages, 14 figures, 13th USENIX Symposium on Operating Systems
Design and Implementation (OSDI), 201
Moving Processing to Data: On the Influence of Processing in Memory on Data Management
Near-Data Processing refers to an architectural hardware and software
paradigm, based on the co-location of storage and compute units. Ideally, it
will allow to execute application-defined data- or compute-intensive operations
in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data
Processing seeks to minimize expensive data movement, improving performance,
scalability, and resource-efficiency. Processing-in-Memory is a sub-class of
Near-Data processing that targets data processing directly within memory (DRAM)
chips. The effective use of Near-Data Processing mandates new architectures,
algorithms, interfaces, and development toolchains
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments
Deep neural networks (DNNs) have become core computation components within
low latency Function as a Service (FaaS) prediction pipelines: including image
recognition, object detection, natural language processing, speech synthesis,
and personalized recommendation pipelines. Cloud computing, as the de-facto
backbone of modern computing infrastructure for both enterprise and consumer
applications, has to be able to handle user-defined pipelines of diverse DNN
inference workloads while maintaining isolation and latency guarantees, and
minimizing resource waste. The current solution for guaranteeing isolation
within FaaS is suboptimal -- suffering from "cold start" latency. A major cause
of such inefficiency is the need to move large amount of model data within and
across servers. We propose TrIMS as a novel solution to address these issues.
Our proposed solution consists of a persistent model store across the GPU, CPU,
local storage, and cloud storage hierarchy, an efficient resource management
layer that provides isolation, and a succinct set of application APIs and
container technologies for easy and transparent integration with FaaS, Deep
Learning (DL) frameworks, and user code. We demonstrate our solution by
interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x
speedup in latency for image classification models and up to 210x speedup for
large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
From social networks to language modeling, the growing scale and importance
of graph data has driven the development of numerous new graph-parallel systems
(e.g., Pregel, GraphLab). By restricting the computation that can be expressed
and introducing new techniques to partition and distribute the graph, these
systems can efficiently execute iterative graph algorithms orders of magnitude
faster than more general data-parallel systems. However, the same restrictions
that enable the performance gains also make it difficult to express many of the
important stages in a typical graph-analytics pipeline: constructing the graph,
modifying its structure, or expressing computation that spans multiple graphs.
As a consequence, existing graph analytics pipelines compose graph-parallel and
data-parallel systems using external storage systems, leading to extensive data
movement and complicated programming model.
To address these challenges we introduce GraphX, a distributed graph
computation framework that unifies graph-parallel and data-parallel
computation. GraphX provides a small, core set of graph-parallel operators
expressive enough to implement the Pregel and PowerGraph abstractions, yet
simple enough to be cast in relational algebra. GraphX uses a collection of
query optimization techniques such as automatic join rewrites to efficiently
implement these graph-parallel operators. We evaluate GraphX on real-world
graphs and workloads and demonstrate that GraphX achieves comparable
performance as specialized graph computation systems, while outperforming them
in end-to-end graph pipelines. Moreover, GraphX achieves a balance between
expressiveness, performance, and ease of use
Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools
Deep Learning (DL) has had an immense success in the recent past, leading to
state-of-the-art results in various domains such as image recognition and
natural language processing. One of the reasons for this success is the
increasing size of DL models and the proliferation of vast amounts of training
data being available. To keep on improving the performance of DL, increasing
the scalability of DL systems is necessary. In this survey, we perform a broad
and thorough investigation on challenges, techniques and tools for scalable DL
on distributed infrastructures. This incorporates infrastructures for DL,
methods for parallel DL training, multi-tenant resource scheduling and the
management of training and model data. Further, we analyze and compare 11
current open-source DL frameworks and tools and investigate which of the
techniques are commonly implemented in practice. Finally, we highlight future
research trends in DL systems that deserve further research.Comment: accepted at ACM Computing Surveys, to appea
- …