2,070 research outputs found
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
QuPARA: Query-Driven Large-Scale Portfolio Aggregate Risk Analysis on MapReduce
Stochastic simulation techniques are used for portfolio risk analysis. Risk
portfolios may consist of thousands of reinsurance contracts covering millions
of insured locations. To quantify risk each portfolio must be evaluated in up
to a million simulation trials, each capturing a different possible sequence of
catastrophic events over the course of a contractual year. In this paper, we
explore the design of a flexible framework for portfolio risk analysis that
facilitates answering a rich variety of catastrophic risk queries. Rather than
aggregating simulation data in order to produce a small set of high-level risk
metrics efficiently (as is often done in production risk management systems),
the focus here is on allowing the user to pose queries on unaggregated or
partially aggregated data. The goal is to provide a flexible framework that can
be used by analysts to answer a wide variety of unanticipated but natural ad
hoc queries. Such detailed queries can help actuaries or underwriters to better
understand the multiple dimensions (e.g., spatial correlation, seasonality,
peril features, construction features, and financial terms) that can impact
portfolio risk. We implemented a prototype system, called QuPARA (Query-Driven
Large-Scale Portfolio Aggregate Risk Analysis), using Hadoop, which is Apache's
implementation of the MapReduce paradigm. This allows the user to take
advantage of large parallel compute servers in order to answer ad hoc risk
analysis queries efficiently even on very large data sets typically encountered
in practice. We describe the design and implementation of QuPARA and present
experimental results that demonstrate its feasibility. A full portfolio risk
analysis run consisting of a 1,000,000 trial simulation, with 1,000 events per
trial, and 3,200 risk transfer contracts can be completed on a 16-node Hadoop
cluster in just over 20 minutes.Comment: 9 pages, IEEE International Conference on Big Data (BigData), Santa
Clara, USA, 201
Efficient Multi-way Theta-Join Processing Using MapReduce
Multi-way Theta-join queries are powerful in describing complex relations and
therefore widely employed in real practices. However, existing solutions from
traditional distributed and parallel databases for multi-way Theta-join queries
cannot be easily extended to fit a shared-nothing distributed computing
paradigm, which is proven to be able to support OLAP applications over immense
data volumes. In this work, we study the problem of efficient processing of
multi-way Theta-join queries using MapReduce from a cost-effective perspective.
Although there have been some works using the (key,value) pair-based
programming model to support join operations, efficient processing of multi-way
Theta-join queries has never been fully explored. The substantial challenge
lies in, given a number of processing units (that can run Map or Reduce tasks),
mapping a multi-way Theta-join query to a number of MapReduce jobs and having
them executed in a well scheduled sequence, such that the total processing time
span is minimized. Our solution mainly includes two parts: 1) cost metrics for
both single MapReduce job and a number of MapReduce jobs executed in a certain
order; 2) the efficient execution of a chain-typed Theta-join with only one
MapReduce job. Comparing with the query evaluation strategy proposed in [23]
and the widely adopted Pig Latin and Hive SQL solutions, our method achieves
significant improvement of the join processing efficiency.Comment: VLDB201
BigExcel: A Web-Based Framework for Exploring Big Data in Social Sciences
This paper argues that there are three fundamental challenges that need to be
overcome in order to foster the adoption of big data technologies in
non-computer science related disciplines: addressing issues of accessibility of
such technologies for non-computer scientists, supporting the ad hoc
exploration of large data sets with minimal effort and the availability of
lightweight web-based frameworks for quick and easy analytics. In this paper,
we address the above three challenges through the development of 'BigExcel', a
three tier web-based framework for exploring big data to facilitate the
management of user interactions with large data sets, the construction of
queries to explore the data set and the management of the infrastructure. The
feasibility of BigExcel is demonstrated through two Yahoo Sandbox datasets. The
first dataset is the Yahoo Buzz Score data set we use for quantitatively
predicting trending technologies and the second is the Yahoo n-gram corpus we
use for qualitatively inferring the coverage of important events. A
demonstration of the BigExcel framework and source code is available at
http://bigdata.cs.st-andrews.ac.uk/projects/bigexcel-exploring-big-data-for-social-sciences/.Comment: 8 page
- …