Search CORE

7 research outputs found

Tupleware: Redefining Modern Analytics

Author: Cetintemel Ugur
Crotty Andrew
Dursun Kayhan
Galakatos Alex
Kraska Tim
Zdonik Stan
Publication venue
Publication date: 30/07/2014
Field of study

There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to a few terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems. This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware's architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

Large-scale social-media analytics on stratosphere

Author: Boden Christoph
Fernández Miriam
Karnstedt Marcel
Volker Markl
Publication venue
Publication date: 13/05/2013
Field of study

The importance of social-media platforms and online communities - in business as well as public context - is more and more acknowledged and appreciated by industry and researchers alike. Consequently, a wide range of analytics has been proposed to understand, steer, and exploit the mechanics and laws driving their functionality and creating the resulting benefits. However, analysts usually face significant problems in scaling existing and novel approaches to match the data volume and size of modern online communities. In this work, we propose and demonstrate the usage of the massively parallel data processing system Stratosphere, based on second order functions as an extended notion of the MapReduce paradigm, to provide a new level of scalability to such social-media analytics. Based on the popular example of role analysis, we present and illustrate how this massively parallel approach can be leveraged to scale out complex data-mining tasks, while providing a programming approach that eases the formulation of complete analytical workflows

Open Research Online (The Open University)

Area-energy aware dataflow optimisation of visual tracking systems

Author: A Rheinländer
D Comaniciu
E Schulte
EA Lee
J Sérot
J Teifel
P Turcza
R Stewart
Robert Stewart
Publication venue
Publication date: 01/01/2018
Field of study

This paper presents an orderly dataflow-optimisation approach suitable for area-energy aware computer vision applications on FPGAs. Vision systems are increasingly being deployed in power constrained scenarios, where the dataflow model of computation has become popular for describing complex algorithms. Dataflow model allows processing datapaths comprised of several independent and well defined computations. However, compilers are often unsuccessful in identifying domain-specific optimisation opportunities resulting in wasted resources and power consumption. We present a methodology for the optimisation of dataflow networks, according to patterns often found in computer vision systems, focusing on identifying optimisations which are not discovered automatically by an optimising compiler. Code transformation using profiling and refactoring provides opportunities to optimise the design, targeting FPGA implementations and focusing on area and power abatement. Our refactoring methodology, applying transformations to a complex algorithm for visual tracking resulted in significant reduction in power consumption and resource usage

Heriot Watt Pure

Crossref

Stirling Online Research Repository (RIOXX)

Sheffield Hallam University Research Archive

Stirling Online Research Repository

Evaluating end-to-end optimization for data analytics applications in weld

Author: Amarasinghe S
Madden S
Narayanan D
Negi P
Palamuttam R
Palkar S
Pirk H
Schwarzkopf M
Shanbhag A
Thaker P
Thomas J
Zaharia M
Publication venue: 'VLDB Endowment'
Publication date: 01/04/2018
Field of study

Modern analytics applications use a diverse mix of libraries and functions. Unfortunately, there is no optimization across these libraries, resulting in performance penalties as high as an order of magnitude in many applications. To address this problem, we proposed Weld, a common runtime for existing data analytics libraries that performs key physical optimizations such as pipelining under existing, imperative library APIs. In this work, we further develop the Weld vision by designing an automatic adaptive optimizer for Weld applications, and evaluating its impact on realistic data science workloads. Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at run-time in ad-hoc workloads where no statistics are available, with sub-second overhead. We also evaluate which optimizations have the largest impact in practice and whether Weld can be integrated into libraries incrementally. Our results are promising: using our optimizer, Weld accelerates data science workloads by up to 23X on one thread and 80X on eight threads, and its adaptive optimizations provide up to a 3.75X speedup over rule-based optimization. Moreover, Weld provides benefits if even just 4--5 operators in a library are ported to use it. Our results show that common runtime designs like Weld may be a viable approach to accelerate analytics

Spiral - Imperial College Digital Repository

Personalized large scale classification of public tenders on hadoop

Author: Dumoulin Mathieu
Publication venue: Bibliotheque de l' Universite Laval
Publication date: 01/01/2014
Field of study

Ce projet a été réalisé dans le cadre d’un partenariat entre Fujitsu Canada et Université Laval. Les besoins du projets ont été centrés sur une problématique d’affaire définie conjointement avec Fujitsu. Le projet consistait à classifier un corpus d’appels d’offres électroniques avec une approche orienté big data. L’objectif était d’identifier avec un très fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. Après une séries d’expérimentations à petite échelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacité de notre approche basé sur l’algorithme BNS (Bi-Normal Separation), nous avons implanté un système complet qui exploite l’infrastructure technologique big data Hadoop. Nos expérimentations sur le système complet démontrent qu’il est possible d’obtenir une performance de classification tout aussi efficace à grande échelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuée de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and Université Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

CorpusUL

Advances in database technology - EDBT 2016: 19th International Conference on Extending Database Technology, Bordeaux, France, March 15-18, 2016 : proceedings

Author
Publication venue: University of Konstanz, University Library
Publication date: 01/01/2016
Field of study

Digitale Bibliothek Thüringen