26 research outputs found
Pregelix: Big(ger) Graph Analytics on A Dataflow Engine
There is a growing need for distributed graph processing systems that are
capable of gracefully scaling to very large graph datasets. Unfortunately, this
challenge has not been easily met due to the intense memory pressure imposed by
process-centric, message passing designs that many graph processing systems
follow. Pregelix is a new open source distributed graph processing system that
is based on an iterative dataflow design that is better tuned to handle both
in-memory and out-of-core workloads. As such, Pregelix offers improved
performance characteristics and scaling properties over current open source
systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up
to 35x speedup compared to distributed GraphLab), and makes more effective use
of available machine resources to support Big(ger) Graph Analytics
Hyracks: A flexible and extensible foundation for data-intensive computing
Abstract—Hyracks is a new partitioned-parallel software plat-form designed to run data-intensive computations on large shared-nothing clusters of computers. Hyracks allows users to express a computation as a DAG of data operators and connec-tors. Operators operate on partitions of input data and produce partitions of output data, while connectors repartition operators’ outputs to make the newly produced partitions available at the consuming operators. We describe the Hyracks end user model, for authors of dataflow jobs, and the extension model for users who wish to augment Hyracks ’ built-in library with new operator and/or connector types. We also describe our initial Hyracks implementation. Since Hyracks is in roughly the same space as the open source Hadoop platform, we compare Hyracks with Hadoop experimentally for several different kinds of use cases. The initial results demonstrate that Hyracks has significant promise as a next-generation platform for data-intensive applications. I
Iterative MapReduce for Large Scale Machine Learning
Large datasets ("Big Data") are becoming ubiquitous because the potential
value in deriving insights from data, across a wide range of business and
scientific applications, is increasingly recognized. In particular, machine
learning - one of the foundational disciplines for data analysis, summarization
and inference - on Big Data has become routine at most organizations that
operate large clouds, usually based on systems such as Hadoop that support the
MapReduce programming paradigm. It is now widely recognized that while
MapReduce is highly scalable, it suffers from a critical weakness for machine
learning: it does not support iteration. Consequently, one has to program
around this limitation, leading to fragile, inefficient code. Further, reliance
on the programmer is inherently flawed in a multi-tenanted cloud environment,
since the programmer does not have visibility into the state of the system when
his or her program executes. Prior work has sought to address this problem by
either developing specialized systems aimed at stylized applications, or by
augmenting MapReduce with ad hoc support for saving state across iterations
(driven by an external loop). In this paper, we advocate support for looping as
a first-class construct, and propose an extension of the MapReduce programming
paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a
class of Iterative MapReduce programs that cover most machine learning
techniques, provide theoretical justifications for the key optimization steps,
and empirically demonstrate that system-optimized programs for significant
machine learning tasks are competitive with state-of-the-art specialized
solutions
AsterixDB: A Scalable, Open Source BDMS
AsterixDB is a new, full-function BDMS (Big Data Management System) with a
feature set that distinguishes it from other platforms in today's open source
Big Data ecosystem. Its features make it well-suited to applications like web
data warehousing, social data storage and analysis, and other use cases related
to Big Data. AsterixDB has a flexible NoSQL style data model; a query language
that supports a wide range of queries; a scalable runtime; partitioned,
LSM-based data storage and indexing (including B+-tree, R-tree, and text
indexes); support for external as well as natively stored data; a rich set of
built-in types; support for fuzzy, spatial, and temporal types and queries; a
built-in notion of data feeds for ingestion of data; and transaction support
akin to that of a NoSQL store.
Development of AsterixDB began in 2009 and led to a mid-2013 initial open
source release. This paper is the first complete description of the resulting
open source AsterixDB system. Covered herein are the system's data model, its
query language, and its software architecture. Also included are a summary of
the current status of the project and a first glimpse into how AsterixDB
performs when compared to alternative technologies, including a parallel
relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data
analytics platform, for things that both technologies can do. Also included is
a brief description of some initial trials that the system has undergone and
the lessons learned (and plans laid) based on those early "customer"
engagements
An Efficient Foundation for Big Data Processing on Modern Clusters
In recent years, the world has seen an explosion in the amount of data being generated. Google proposed the MapReduce framework to allow programmers easily process massive amounts of data in parallel using a cluster of shared-nothing commodity machines. What started out as a tool for human efficiency subsequently began to be used as an intermediate representation for queries compiled from higher-level declarative languages. In this thesis, we present an alternate software stack for building scalable Big Data systems. We specifically focus on two parts of the stack. Hyracks is a new partitioned-parallel runtime layer that provides an efficient, generalized model for executing data-processing jobs on a cluster of commodity machines. Algebricks is a compiler framework that helps to build high-level declarative language compilers for parallel processing on top of Hyracks