Search CORE

10 research outputs found

REX: Recursive, Delta-Based Data-Centric Computation

Author: Guha Sudipto
Ives Zachary G.
Mihaylov Svilen R.
Publication venue
Publication date: 01/01/2012
Field of study

In today's Web and social network environments, query workloads include ad hoc and OLAP queries, as well as iterative algorithms that analyze data relationships (e.g., link analysis, clustering, learning). Modern DBMSs support ad hoc and OLAP queries, but most are not robust enough to scale to large clusters. Conversely, "cloud" platforms like MapReduce execute chains of batch tasks across clusters in a fault tolerant way, but have too much overhead to support ad hoc queries. Moreover, both classes of platform incur significant overhead in executing iterative data analysis algorithms. Most such iterative algorithms repeatedly refine portions of their answers, until some convergence criterion is reached. However, general cloud platforms typically must reprocess all data in each step. DBMSs that support recursive SQL are more efficient in that they propagate only the changes in each step -- but they still accumulate each iteration's state, even if it is no longer useful. User-defined functions are also typically harder to write for DBMSs than for cloud platforms. We seek to unify the strengths of both styles of platforms, with a focus on supporting iterative computations in which changes, in the form of deltas, are propagated from iteration to iteration, and state is efficiently updated in an extensible way. We present a programming model oriented around deltas, describe how we execute and optimize such programs in our REX runtime system, and validate that our platform also handles failures gracefully. We experimentally validate our techniques, and show speedups over the competing methods ranging from 2.5 to nearly 100 times.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Squall: Scalable Real-time Analytics

Author: Dashti Mohammad
El Seidy Mohammed
Espino Timón Daniel
Guliyev Khayyam Mubariz Oglu
Klonatos Ioannis
Koch Christoph
Vitorovic Aleksandar
Vu Minh Khue
Publication venue: EPFL
Publication date: 10/03/2016
Field of study

Squall is a scalable online query engine that runs complex analytics in a cluster using skew-resilient, adaptive operators. Squall builds on state-of-the-art partitioning schemes and local algorithms, including some of our own. This paper presents the overview of Squall, including some novel join operators. The paper also presents lessons learned over the five years of working on this system, and outlines the plan for the proposed system demonstration

Infoscience - École polytechnique fédérale de Lausanne

Experimental evaluation of big data querying tools

Author: Rodrigues Mário Miguel Lucas
Publication venue
Publication date: 01/01/2017
Field of study

Nos últimos anos, o termo Big Data tornou-se um tópico bastanta debatido em várias áreas de negócio. Um dos principais desafios relacionados com este conceito é como lidar com o enorme volume e variedade de dados de forma eficiente. Devido à notória complexidade e volume de dados associados ao conceito de Big Data, são necessários mecanismos de consulta eficientes para fins de análise de dados. Motivado pelo rápido desenvolvimento de ferramentas e frameworks para Big Data, há muita discussão sobre ferramentas de consulta e, mais especificamente, quais são as mais apropriadas para necessidades analíticas específica. Esta dissertação descreve e compara as principais características e arquiteturas das seguintes conhecidas ferramentas analíticas para Big Data: Drill, HAWQ, Hive, Impala, Presto e Spark. Para testar o desempenho dessas ferramentas analíticas para Big Data, descrevemos também o processo de preparação, configuração e administração de um Cluster Hadoop para que possamos instalar e utilizar essas ferramentas, tendo um ambiente capaz de avaliar seu desempenho e identificar quais cenários mais adequados à sua utilização. Para realizar esta avaliação, utilizamos os benchmarks TPC-H e TPC-DS, onde os resultados mostraram que as ferramentas de processamento em memória como HAWQ, Impala e Presto apresentam melhores resultados e desempenho em datasets de dimensão baixa e média. No entanto, as ferramentas que apresentaram tempos de execuções mais lentas, especialmente o Hive, parecem apanhar as ferramentas de melhor desempenho quando aumentamos os datasets de referência

Repositório Comum

Autonomous Database Management at Scale: Automated Tuning, Performance Diagnosis, and Resource Decentralization

Author: Yoon Dong Young
Publication venue
Publication date: 01/01/2019
Field of study

Database administration has always been a challenging task, and is becoming even more difficult with the rise of public and private clouds. Today, many enterprises outsource their database operation to cloud service providers (CSPs) in order to reduce operating costs. CSPs, now tasked with managing an extremely large number of database instances, cannot simply rely on database administrators. In fact, humans have become a bottleneck in the scalability and profitability of cloud offerings. This has created a massive demand for building autonomous databases—systems that operate with little or zero human supervision. While autonomous databases have gained much attention in recent years in both academia and industry, many of the existing techniques remain limited to automating parameter tuning, backup/recovery, and monitoring. Consequently, there is much to be done before realizing a fully autonomous database. This dissertation examines and offers new automation techniques for three specific areas of modern database management. 1. Automated Tuning – We propose a new generation of physical database designers that are robust against uncertainty in future workloads. Given the rising popularity of approximate databases, we also develop an optimal, hybrid sampling strategy that enables efficient join processing on offline samples, a long-standing open problem in approximate query processing. 2. Performance Diagnosis – We design practical tools and algorithms for assisting database administrators in quickly and reliably diagnosing performance problems in their transactional databases. 3. Resource Decentralization – To achieve autonomy among database components in a shared environment, we propose a highly efficient, starvation-free, and fully decentralized distributed lock manager for distributed database clusters.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153349/1/dyoon_1.pd

Deep Blue Documents at the University of Michigan

Proceedings of the First International Conference on Software and Data Technologies, Volume 1

Author: Filipe Joaquim
Publication venue: INSTICC PRESS
Publication date: 11/09/2006
Field of study

University of Twente Research Information

Recommended from our members

Automated Testing and Debugging for Big Data Analytics

Author: Gulzar Muhammad Ali
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, which are falling behind compared to the DISC framework design and implementation. In practice, big data applications often fail as users are unable to test all behaviors emerging from interleaving dataflow operators, user-defined functions, and framework's code. "Testing based on a random sample" rarely guarantees the reliability and "trial and error" and "print" debugging methods are expensive and time-consuming. Thus, the current practice of developing a big data application must be improved and the tools built to enhance the developer's productivity must adapt to the distinct characteristics of data-intensive scalable computing. By synthesizing ideas from software engineering and database systems, our hypothesis is that we can design effective and scalable testing and debugging algorithms for big data analytics without compromising the performance and efficiency of the underlying DISC framework. To design such techniques, we investigate how we can build interactive and responsive debugging primitives that significantly reduce the debugging time, yet do not pose much performance overhead on big data applications. Furthermore, we investigate how we can leverage data provenance techniques from databases and fault-isolation algorithms from software engineering to pinpoint the minimal subset of failure-inducing inputs efficiently. To improve the reliability of big data analytics, we investigate how we can abstract the semantics of dataflow operators and use them in tandem with the semantics of user-defined functions to generate a minimum set of synthetic test inputs capable of revealing more defects than the entire input dataset.To examine the first hypothesis, we introduce interactive, real-time debugging primitives for big data analytics through innovative and scalable debugging features such as simulated breakpoint, dynamic watchpoint, and crash culprit identification. Second, we design a new automated fault localization approach that combines insights from both the software engineering and database literature to bring delta debugging closer to a reality in the big data applications by leveraging data provenance and by constructing systems optimizations for debugging provenance queries. Lastly, we devise a new symbolic-execution based white-box testing algorithm for big data applications that abstracts the implementation of dataflow operators using logical specifications instead of modeling their implementations and combines them with the semantics of any arbitrary user-defined function. We instantiate the idea of an interactive debugging algorithm as BigDebug, the idea of an automated debugging algorithm as BigSift, and the idea of symbolic execution-based testing as BigTest. Our investigation shows that the interactive debugging primitives can scale to terabytes---our record-level tracing incurs less than 25% overhead on average and provides up to 100% time saving compared to the baseline replay debugger. Second, we observe that by combining data provenance with delta debugging, we can identify the minimum faulty input in just under 30% of the original job execution time. Lastly, we verify that by abstracting dataflow operators using logical specifications, we can efficiently generate the most concise test data suitable for local testing while revealing twice as many faults as prior approaches. Our investigations collectively demonstrate that developer productivity can be significantly improved through effective and scalable testing and debugging techniques for big data analytics, without impacting the DISC framework's performance. This dissertation affirms the feasibility of automated debugging and testing techniques for big data analytics---techniques that were previously considered infeasible for large-scale data processing

eScholarship - University of California

Improving Efficiency, Expressiveness and Security of Searchable Encryption

Author: Demertzis Ioannis
Publication venue
Publication date: 01/01/2020
Field of study

A large part of our personal data, ranging from medical and financial records to our social activity, is stored online in cloud servers. Frequent data breaches threaten to expose these data to malicious third parties, often with catastrophic consequences (estimated to several billion of US dollars annually). In this thesis, we use, extend and improve Searchable Encryption (SE) in order to build the next generation encrypted databases/systems that will prevent such undesirable situations. Our goal is to build systems that are both practical and provably secure, while allowing expressive search and computation on encrypted data. Towards this goal, we have proposed new SE schemes that achieve the following: (i) have better search/computation time, (ii) allow expressive queries such as range, join, group-by, as well as dynamic query workloads, and (iii) provide new adjustable security-efficiency trade-offs---leading to robust and efficient schemes even against very powerful adversaries

Digital Repository at the University of Maryland

Language Support for Distributed Functional Programming

Author: Miller Heather
Publication venue: Lausanne, EPFL
Publication date: 13/10/2015
Field of study

Software development has taken a fundamental turn. Software today has gone from simple, closed programs running on a single machine, to massively open programs, patching together user experiences byway of responses received via hundreds of network requests spanning multiple machines. At the same time, as data continues to stockpile, systems for big data analytics are on the rise. Yet despite this trend towards distributing computation, issues at the level of the language and runtime abound. Serialization is still a costly runtime affair, crashing running systems and confounding developers. Function closures are being added to APIs for big data processing for use by end-users without reliably being able to transmit them over the network. And much of the frameworks developed for handling multiple concurrent requests byway of asynchronous programming facilities rely on blocking threads, causing serious scalability issues. This thesis describes a number of extensions and libraries for the Scala programming language that aim to address these issues and to provide a more reliable foundation on which to build distributed systems. This thesis presents a new approach to serialization called pickling based on the idea of generating and composing functional pickler combinators statically. The approach shifts the burden of serialization to compile time as much as possible, enabling users to catch serialization errors at compile time rather than at runtime. Further, by virtue of serialization code being generated at compile time, our framework is shown to be significantly more performant than other state-of-the-art serialization frameworks. We also generalize our technique for generating serialization code to generic functions other than pickling. Second, in light of the trend of distributed data-parallel frameworks being designed around functional patterns where closures are transmitted across cluster nodes to large-scale persistent datasets, this thesis introduces a new closure-like abstraction and type system, called spores, that can guarantee closures to be serializable, thread-safe, or even have custom user-defined properties. Crucially, our system is based on the principle of encoding type information corresponding to captured variables in the type of a spore. We prove our type system sound, implement our approach for Scala, evaluate its practicality through a small empirical study, and show the power of these guarantees through a case analysis of real-world distributed and concurrent frameworks that this safe foundation for closures facilitates. Finally, we bring together the above building blocks, pickling and spores, to form the basis of a new programming model called function-passing. Function-passing is based on the idea of a distributed persistent data structure which stores in its nodes transformations to data rather than the distributed data itself, simplifying fault recovery by design. Lazy evaluation is also central to our model; by incorporating laziness into our design only at the point of initiating network communication, our model remains easy to reason about while remaining efficient in time and memory. We formalize our programming model in the form of a small-step operational semantics which includes a precise specification of the semantics of functional fault recovery, and we provide an open-source implementation of our model in and for Scala

Infoscience - École polytechnique fédérale de Lausanne

Parallel Processing of “Group-By Join” Queries on Shared Nothing Machines

Author: A. Datta
A. Shatdal
D. Taniar
D. Taniar
D.B. Skillicorn
D.J. DeWitt
J.L. Carter
J.L. Wolf
K.A. Hua
L.G. Valiant
M. Bamha
M. Bamha
M. Bamha
M. Seetha
R.H. Bisseling
W.P. Yan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Crossref