4,693 research outputs found
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
Scientific problems that depend on processing large amounts of data require
overcoming challenges in multiple areas: managing large-scale data
distribution, co-placement and scheduling of data with compute resources, and
storing and transferring large volumes of data. We analyze the ecosystems of
the two prominent paradigms for data-intensive applications, hereafter referred
to as the high-performance computing and the Apache-Hadoop paradigm. We propose
a basis, common terminology and functional factors upon which to analyze the
two approaches of both paradigms. We discuss the concept of "Big Data Ogres"
and their facets as means of understanding and characterizing the most common
application workloads found across the two paradigms. We then discuss the
salient features of the two paradigms, and compare and contrast the two
approaches. Specifically, we examine common implementation/approaches of these
paradigms, shed light upon the reasons for their current "architecture" and
discuss some typical workloads that utilize them. In spite of the significant
software distinctions, we believe there is architectural similarity. We discuss
the potential integration of different implementations, across the different
levels and components. Our comparison progresses from a fully qualitative
examination of the two paradigms, to a semi-quantitative methodology. We use a
simple and broadly used Ogre (K-means clustering), characterize its performance
on a range of representative platforms, covering several implementations from
both paradigms. Our experiments provide an insight into the relative strengths
of the two paradigms. We propose that the set of Ogres will serve as a
benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
Iterative MapReduce for Large Scale Machine Learning
Large datasets ("Big Data") are becoming ubiquitous because the potential
value in deriving insights from data, across a wide range of business and
scientific applications, is increasingly recognized. In particular, machine
learning - one of the foundational disciplines for data analysis, summarization
and inference - on Big Data has become routine at most organizations that
operate large clouds, usually based on systems such as Hadoop that support the
MapReduce programming paradigm. It is now widely recognized that while
MapReduce is highly scalable, it suffers from a critical weakness for machine
learning: it does not support iteration. Consequently, one has to program
around this limitation, leading to fragile, inefficient code. Further, reliance
on the programmer is inherently flawed in a multi-tenanted cloud environment,
since the programmer does not have visibility into the state of the system when
his or her program executes. Prior work has sought to address this problem by
either developing specialized systems aimed at stylized applications, or by
augmenting MapReduce with ad hoc support for saving state across iterations
(driven by an external loop). In this paper, we advocate support for looping as
a first-class construct, and propose an extension of the MapReduce programming
paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a
class of Iterative MapReduce programs that cover most machine learning
techniques, provide theoretical justifications for the key optimization steps,
and empirically demonstrate that system-optimized programs for significant
machine learning tasks are competitive with state-of-the-art specialized
solutions
- …
