293 research outputs found
Reify Your Collection Queries for Modularity and Speed!
Modularity and efficiency are often contradicting requirements, such that
programers have to trade one for the other. We analyze this dilemma in the
context of programs operating on collections. Performance-critical code using
collections need often to be hand-optimized, leading to non-modular, brittle,
and redundant code. In principle, this dilemma could be avoided by automatic
collection-specific optimizations, such as fusion of collection traversals,
usage of indexing, or reordering of filters. Unfortunately, it is not obvious
how to encode such optimizations in terms of ordinary collection APIs, because
the program operating on the collections is not reified and hence cannot be
analyzed.
We propose SQuOpt, the Scala Query Optimizer--a deep embedding of the Scala
collections API that allows such analyses and optimizations to be defined and
executed within Scala, without relying on external tools or compiler
extensions. SQuOpt provides the same "look and feel" (syntax and static typing
guarantees) as the standard collections API. We evaluate SQuOpt by
re-implementing several code analyses of the Findbugs tool using SQuOpt, show
average speedups of 12x with a maximum of 12800x and hence demonstrate that
SQuOpt can reconcile modularity and efficiency in real-world applications.Comment: 20 page
Using Links to prototype a Database Wiki
Both relational databases and wikis have strengths that make them attractive for use in collaborative applications. In the last decade, database-backed Web applications have been used extensively to develop valuable shared biological references called curated databases. Databases offer many advantages such as scalability, query optimization and concurrency control, but are not easy to use and lack other features needed for collaboration. Wikis have become very popular for early-stage biocuration projects because they are easy to use, encourage sharing and collaboration, and provide built-in support for archiving, history-tracking and annotation. However, curation projects often outgrow the limited capabilities of wikis for structuring and efficiently querying data at scale, necessitating a painful phase transition to a database-backed Web application. We perceive a need for a new class of general-purpose system, which we call a Database Wiki, that combines flexible wiki-like support for collaboration with robust database-like capabilities for structuring and querying data. This paper presents DBWiki, a design prototype for such a system written in the Web programming language Links. We present the architecture, typical use, and wiki markup language design for DBWiki and discuss features of Links that provided unique advantages for rapid Web/database application prototyping
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
First-Class Functions for First-Order Database Engines
We describe Query Defunctionalization which enables off-the-shelf first-order
database engines to process queries over first-class functions. Support for
first-class functions is characterized by the ability to treat functions like
regular data items that can be constructed at query runtime, passed to or
returned from other (higher-order) functions, assigned to variables, and stored
in persistent data structures. Query defunctionalization is a non-invasive
approach that transforms such function-centric queries into the data-centric
operations implemented by common query processors. Experiments with XQuery and
PL/SQL database systems demonstrate that first-order database engines can
faithfully and efficiently support the expressive "functions as data" paradigm.Comment: Proceedings of the 14th International Symposium on Database
Programming Languages (DBPL 2013), August 30, 2013, Riva del Garda, Trento,
Ital
Yedalog: Exploring Knowledge at Scale
With huge progress on data processing frameworks, human programmers are frequently the bottleneck when analyzing large repositories of data. We introduce Yedalog, a declarative programming language that allows programmers to mix data-parallel pipelines and computation seamlessly in a single language. By contrast, most existing tools for data-parallel computation embed a sublanguage of data-parallel pipelines in a general-purpose language, or vice versa. Yedalog extends Datalog, incorporating not only computational features from logic programming, but also features for working with data structured as nested records. Yedalog programs can run both on a single machine, and distributed across a cluster in batch and interactive modes, allowing programmers to mix different modes of execution easily
Compilation and Code Optimization for Data Analytics
The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance.
The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation.
As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive.
The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
- …