168 research outputs found
i2MapReduce: Incremental MapReduce for Mining Evolving Big Data
As new data and updates are constantly arriving, the results of data mining
applications become stale and obsolete over time. Incremental processing is a
promising approach to refreshing mining results. It utilizes previously saved
states to avoid the expense of re-computation from scratch.
In this paper, we propose i2MapReduce, a novel incremental processing
extension to MapReduce, the most widely used framework for mining big data.
Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs
key-value pair level incremental processing rather than task level
re-computation, (ii) supports not only one-step computation but also more
sophisticated iterative computation, which is widely used in data mining
applications, and (iii) incorporates a set of novel techniques to reduce I/O
overhead for accessing preserved fine-grain computation states. We evaluate
i2MapReduce using a one-step algorithm and three iterative algorithms with
diverse computation characteristics. Experimental results on Amazon EC2 show
significant performance improvements of i2MapReduce compared to both plain and
iterative MapReduce performing re-computation
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
Early Accurate Results for Advanced Analytics on MapReduce
Approximate results based on samples often provide the only way in which
advanced analytical applications on very massive data sets can satisfy their
time and resource constraints. Unfortunately, methods and tools for the
computation of accurate early results are currently not supported in
MapReduce-oriented systems although these are intended for `big data'.
Therefore, we proposed and implemented a non-parametric extension of Hadoop
which allows the incremental computation of early results for arbitrary
work-flows, along with reliable on-line estimates of the degree of accuracy
achieved so far in the computation. These estimates are based on a technique
called bootstrapping that has been widely employed in statistics and can be
applied to arbitrary functions and data distributions. In this paper, we
describe our Early Accurate Result Library (EARL) for Hadoop that was designed
to minimize the changes required to the MapReduce framework. Various tests of
EARL of Hadoop are presented to characterize the frequent situations where EARL
can provide major speed-ups over the current version of Hadoop.Comment: VLDB201
REX: Recursive, Delta-Based Data-Centric Computation
In today's Web and social network environments, query workloads include ad
hoc and OLAP queries, as well as iterative algorithms that analyze data
relationships (e.g., link analysis, clustering, learning). Modern DBMSs support
ad hoc and OLAP queries, but most are not robust enough to scale to large
clusters. Conversely, "cloud" platforms like MapReduce execute chains of batch
tasks across clusters in a fault tolerant way, but have too much overhead to
support ad hoc queries.
Moreover, both classes of platform incur significant overhead in executing
iterative data analysis algorithms. Most such iterative algorithms repeatedly
refine portions of their answers, until some convergence criterion is reached.
However, general cloud platforms typically must reprocess all data in each
step. DBMSs that support recursive SQL are more efficient in that they
propagate only the changes in each step -- but they still accumulate each
iteration's state, even if it is no longer useful. User-defined functions are
also typically harder to write for DBMSs than for cloud platforms.
We seek to unify the strengths of both styles of platforms, with a focus on
supporting iterative computations in which changes, in the form of deltas, are
propagated from iteration to iteration, and state is efficiently updated in an
extensible way. We present a programming model oriented around deltas, describe
how we execute and optimize such programs in our REX runtime system, and
validate that our platform also handles failures gracefully. We experimentally
validate our techniques, and show speedups over the competing methods ranging
from 2.5 to nearly 100 times.Comment: VLDB201
A hybrid framework of iterative MapReduce and MPI for molecular dynamics applications
Developing platforms for large scale data processing has been a great interest to scientists. Hadoop is a widely used computational platform which is a fault-tolerant distributed system for data storage due to HDFS (Hadoop Distributed File System) and performs fault-tolerant distributed data processing in parallel due to MapReduce framework. It is quite often that actual computations require multiple MapReduce cycles, which needs chained MapReduce jobs. However, Design by Hadoop is poor in addressing problems with iterative structures. In many iterative problems, some invariant data is required by every MapReduce cycle. The same data is uploaded to Hadoop file system in every MapReduce cycle, causing repeated data delivering and unnecessary time cost in transferring this data. In addition, although Hadoop can process data in parallel, it does not support MPI in computing. In any Map/Reduce task, the computation must be serial. This results in inefficient scientific computations wrapped in Map/Reduce tasks because the computation can not be distributed over a Hadoop cluster, especially a Hadoop cluster on a traditional high performance computing cluster. Computational technologies have been extensively investigated to be applied into many application domains. Since the presence of Hadoop, scientists have applied the MapReduce framework to biological sciences, chemistry, medical sciences, and other areas to efficiently process huge data sets. In our research, we proposed a hybrid framework of iterative MapReduce and MPI for molecular dynamics applications. We carried out molecular dynamics simulations with the implemented hybrid framework. We improved the capability and performance of Hadoop by adding a MPI module to Hadoop. The MPI module enables Hadoop to monitor and manage the resources of Hadoop cluster so that computations incurred in Map/Reduce tasks can be performed in a parallel manner. We also applied the local caching mechanism to avoid data delivery redundancy to make the computing more efficient. Our hybrid framework inherits features of Hadoop and improves computing efficiency of Hadoop. The targeting application domain of our research is molecular dynamics simulation. However, the potential use of our iterative MapReduce framework with MPI is broad. It can be used by any applications which contain single or multiple MapReduce iterations, invoke serial or parallel (MPI) computations in Map phase or Reduce phase of Hadoop
Teadusarvutuse algoritmide taandamine hajusarvutuse raamistikele
Teadusarvutuses kasutatakse arvuteid ja algoritme selleks, et lahendada probleeme erinevates reaalteadustes nagu geneetika, bioloogia ja keemia. Tihti on eesmärgiks selliste loodusnähtuste modelleerimine ja simuleerimine, mida päris keskkonnas oleks väga raske uurida.
Näiteks on võimalik luua päikesetormi või meteoriiditabamuse mudel ning arvutisimulatsioonide abil hinnata katastroofi mõju keskkonnale. Mida keerulisemad ja täpsemad on sellised simulatsioonid, seda rohkem arvutusvõimsust on vaja. Tihti kasutatakse selleks suurt hulka arvuteid, mis kõik samaaegselt töötavad ühe probleemi kallal. Selliseid arvutusi nimetatakse paralleel- või hajusarvutusteks.
Hajusarvutuse programmide loomine on aga keeruline ning nõuab palju rohkem aega ja ressursse, kuna vaja on sünkroniseerida erinevates arvutites samaaegselt tehtavat tööd. On loodud mitmeid tarkvararaamistikke, mis lihtsustavad seda tööd automatiseerides osa hajusprogrammeerimisest.
Selle teadustöö eesmärk oli uurida selliste hajusarvutusraamistike sobivust keerulisemate teadusarvutuse algoritmide jaoks. Tulemused näitasid, et olemasolevad raamistikud on üksteisest väga erinevad ning neist ükski ei ole sobiv kõigi erinevat tüüpi algoritmide jaoks. Mõni raamistik on sobiv ainult lihtsamate algoritmide jaoks; mõni ei sobi olukorras, kus andmed ei mahu arvutite mällu. Algoritmi jaoks kõige sobivama hajusarvutisraamistiku valimine võib olla väga keeruline ülesanne, kuna see nõuab olemasolevate raamistike uurimist ja rakendamist.
Sellele probleemile lahendust otsides otsustati luua dünaamiline algoritmide modelleerimise rakendus (DAMR), mis oskab simuleerida algoritmi implementatsioone erinevates hajusarvutusraamistikes. DAMR aitab hinnata milline hajusraamistik on kõige sobivam ette antud algoritmi jaoks, ilma algoritmi reaalselt ühegi hajusraamistiku peale implementeerimata.
Selle uurimustöö peamine panus on hajusarvutusraamistike kasutuselevõtu lihtsamaks tegemine teadlastele, kes ei ole varem nende kasutamisega kokku puutunud. See peaks märkimisväärselt aega ja ressursse kokku hoidma, kuna ei pea ükshaaval kõiki olemasolevaid hajusraamistikke tundma õppima ja rakendama.Scientific computing uses computers and algorithms to solve problems in various sciences such as genetics, biology and chemistry. Often the goal is to model and simulate different natural phenomena which would otherwise be very difficult to study in real environments.
For example, it is possible to create a model of a solar storm or a meteor hit and run computer simulations to assess the impact of the disaster on the environment. The more sophisticated and accurate the simulations are the more computing power is required. It is often necessary to use a large number of computers, all working simultaneously on a single problem. These kind of computations are called parallel or distributed computing.
However, creating distributed computing programs is complicated and requires a lot more time and resources, because it is necessary to synchronize different computers working at the same time. A number of software frameworks have been created to simplify this process by automating part of a distributed programming.
The goal of this research was to assess the suitability of such distributed computing frameworks for complex scientific computing algorithms. The results showed that existing frameworks are very different from each other and none of them are suitable for all different types of algorithms. Some frameworks are only suitable for simple algorithms; others are not suitable when data does not fit into the computer memory. Choosing the most appropriate distributed computing framework for an algorithm can be a very complex task, because it requires studying and applying the existing frameworks.
While searching for a solution to this problem, it was decided to create a Dynamic Algorithms Modelling Application (DAMA), which is able to simulate the implementation of the algorithm in different distributed computing frameworks. DAMA helps to estimate which distributed framework is the most appropriate for a given algorithm, without actually implementing it in any of the available frameworks.
This main contribution of this study is simplifying the adoption of distributed computing frameworks for researchers who are not yet familiar with using them. It should save significant time and resources as it is not necessary to study each of the available distributed computing frameworks in detail
- …