2,827 research outputs found

    Multi-layer Optimizations for End-to-End Data Analytics

    Full text link
    We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets

    Flare: Native Compilation for Heterogeneous Workloads in Apache Spark

    Full text link
    The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which promise an increase in expressiveness and performance. But how good are these extensions at extracting high performance from modern hardware platforms? While Spark has made impressive progress, we show that for relational workloads, there is still a significant gap compared with best-of-breed query engines. And when stepping outside of the relational world, query optimization techniques are ineffective if large parts of a computation have to be treated as user-defined functions (UDFs). We present Flare: a new back-end for Spark that brings performance closer to the best SQL engines, without giving up the added expressiveness of Spark. We demonstrate order of magnitude speedups both for relational workloads such as TPC-H, as well as for a range of machine learning kernels that combine relational and iterative functional processing. Flare achieves these results through (1) compilation to native code, (2) replacing parts of the Spark runtime system, and (3) extending the scope of optimization and code generation to large classes of UDFs

    STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics

    Full text link
    Various general-purpose distributed systems have been proposed to cope with high-diversity applications in the pipeline of Big Data analytics. Most of them provide simple yet effective primitives to simplify distributed programming. While the rigid primitives offer great ease of use to savvy programmers, they probably compromise efficiency in performance and flexibility in data representation and programming specifications, which are critical properties in real systems. In this paper, we discuss the limitations of coarse-grained primitives and aim to provide an alternative for users to have flexible control over distributed programs and operate globally shared data more efficiently. We develop STEP, a novel distributed framework based on in-memory key-value store. The key idea of STEP is to adapt multi-threading in a single machine to a distributed environment. STEP enables users to take fine-grained control over distributed threads and apply task-specific optimizations in a flexible manner. The underlying key-value store serves as distributed shared memory to keep globally shared data. To ensure ease-of-use, STEP offers plentiful effective interfaces in terms of distributed shared data manipulation, cluster management, distributed thread management and synchronization. We conduct extensive experimental studies to evaluate the performance of STEP using real data sets. The results show that STEP outperforms the state-of-the-art general-purpose distributed systems as well as a specialized ML platform in many real applications

    Kyrix: Interactive Visual Data Exploration at Scale

    Full text link
    Scalable interactive visual data exploration is crucial in many domains due to increasingly large datasets generated at rapid rates. Details-on-demand provides a useful interaction paradigm for exploring large datasets, where users start at an overview, find regions of interest, zoom in to see detailed views, zoom out and then repeat. This paradigm is the primary user interaction mode of widely-used systems such as Google Maps, Aperture Tiles and ForeCache. These earlier systems, however, are highly customized with hardcoded visual representations and optimizations. A more general framework is needed to facilitate the development of visual data exploration systems at scale. In this paper, we present Kyrix, an end-to-end system for developing scalable details-on-demand data exploration applications. Kyrix provides developers with a declarative model for easy specification of general visualizations. Behind the scenes, Kyrix utilizes a suite of performance optimization techniques to achieve a response time within 500ms for various user interactions. We also report results from a performance study which shows that a novel dynamic fetching scheme adopted by Kyrix outperforms tile-based fetching used in earlier systems.Comment: CIDR'1

    Terra: Scalable Cross-Layer GDA Optimizations

    Full text link
    Geo-distributed analytics (GDA) frameworks transfer large datasets over the wide-area network (WAN). Yet existing frameworks often ignore the WAN topology. This disconnect between WAN-bound applications and the WAN itself results in missed opportunities for cross-layer optimizations. In this paper, we present Terra to bridge this gap. Instead of decoupled WAN routing and GDA transfer scheduling, Terra applies scalable cross-layer optimizations to minimize WAN transfer times for GDA jobs. We present a two-pronged approach: (i) a scalable algorithm for joint routing and scheduling to make fast decisions; and (ii) a scalable, overlay-based enforcement mechanism that avoids expensive switch rule updates in the WAN. Together, they enable Terra to quickly react to WAN uncertainties such as large bandwidth fluctuations and failures in an application-aware manner as well. Integration with the FloodLight SDN controller and Apache YARN, and evaluation on 4 workloads and 3 WAN topologies show that Terra improves the average completion times of GDA jobs by 1.55x-3.43x. GDA jobs running with Terra meets 2.82x-4.29x more deadlines and can quickly react to WAN-level events in an application-aware manner

    PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems

    Full text link
    Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serving systems consider models as black boxes, whereby prediction-time-specific optimizations are ignored in favor of ease of deployment. In this paper, we present PRETZEL, a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations. Using production-like model pipelines, our experiments show that PRETZEL is able to introduce performance improvements over different dimensions; compared to state-of-the-art approaches PRETZEL is on average able to reduce 99th percentile latency by 5.5x while reducing memory footprint by 25x, and increasing throughput by 4.7x.Comment: 16 pages, 14 figures, 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 201

    Moving Processing to Data: On the Influence of Processing in Memory on Data Management

    Full text link
    Near-Data Processing refers to an architectural hardware and software paradigm, based on the co-location of storage and compute units. Ideally, it will allow to execute application-defined data- or compute-intensive operations in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data Processing seeks to minimize expensive data movement, improving performance, scalability, and resource-efficiency. Processing-in-Memory is a sub-class of Near-Data processing that targets data processing directly within memory (DRAM) chips. The effective use of Near-Data Processing mandates new architectures, algorithms, interfaces, and development toolchains

    TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments

    Full text link
    Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines: including image recognition, object detection, natural language processing, speech synthesis, and personalized recommendation pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure for both enterprise and consumer applications, has to be able to handle user-defined pipelines of diverse DNN inference workloads while maintaining isolation and latency guarantees, and minimizing resource waste. The current solution for guaranteeing isolation within FaaS is suboptimal -- suffering from "cold start" latency. A major cause of such inefficiency is the need to move large amount of model data within and across servers. We propose TrIMS as a novel solution to address these issues. Our proposed solution consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of application APIs and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models and up to 210x speedup for large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

    Full text link
    From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use

    Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools

    Full text link
    Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multi-tenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.Comment: accepted at ACM Computing Surveys, to appea
    corecore