1,983 research outputs found
Parallel Hierarchical Affinity Propagation with MapReduce
The accelerated evolution and explosion of the Internet and social media is
generating voluminous quantities of data (on zettabyte scales). Paramount
amongst the desires to manipulate and extract actionable intelligence from vast
big data volumes is the need for scalable, performance-conscious analytics
algorithms. To directly address this need, we propose a novel MapReduce
implementation of the exemplar-based clustering algorithm known as Affinity
Propagation. Our parallelization strategy extends to the multilevel
Hierarchical Affinity Propagation algorithm and enables tiered aggregation of
unstructured data with minimal free parameters, in principle requiring only a
similarity measure between data points. We detail the linear run-time
complexity of our approach, overcoming the limiting quadratic complexity of the
original algorithm. Experimental validation of our clustering methodology on a
variety of synthetic and real data sets (e.g. images and point data)
demonstrates our competitiveness against other state-of-the-art MapReduce
clustering techniques
Comparing MapReduce and pipeline implementations for counting triangles
A common method to define a parallel solution for a computational problem consists in finding a way to use the Divide and Conquer paradigm in order to have processors acting on its own data and scheduled in a parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the input data and combining the results of each step to produce final results. Albeit used for the implementation of a wide variety of computational problems, MapReduce performance can be negatively affected whenever the replication factor grows or the size of the input is larger than the resources available at each processor. In this paper we show an alternative approach to implement the Divide and Conquer paradigm, named dynamic pipeline. The main features of dynamic pipelines are illustrated on a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To evaluate the properties of pipeline, a dynamic pipeline of processes and an ad-hoc version of MapReduce are implemented in the language Go, exploiting its ability to deal with channels and spawned processes. An empirical evaluation is conducted on graphs of different topologies, sizes, and densities. Observed results suggest that dynamic pipelines allows for an efficient implementation of the problem of counting triangles in a graph, particularly, in dense and large graphs, drastically reducing the execution time with respect to the MapReduce implementation.Peer ReviewedPostprint (published version
Hybrid Similarity Function for Big Data Entity Matching with R-Swoosh
Entity Matching (EM) is the problem of determining if two entities in a data set refer to the same real-world object. For example, it decides if two given mentions in the data, such as “Helen Hunt” and “H. M. Hunt”, refer to the same real-world entity by using different similarity functions. This problem plays a key role in information integration, natural language understanding, information processing on the World-Wide Web, and on the emerging Semantic Web. This project deals with the similarity functions and thresholds utilized in them to determine the similarity of the entities. The work contains two major parts: implementation of a hybrid similarity function, which contains three different similarity functions to determine the similarity of entities, and an efficient method to determine the optimum threshold value for similarity functions to get accurate results
QuPARA: Query-Driven Large-Scale Portfolio Aggregate Risk Analysis on MapReduce
Stochastic simulation techniques are used for portfolio risk analysis. Risk
portfolios may consist of thousands of reinsurance contracts covering millions
of insured locations. To quantify risk each portfolio must be evaluated in up
to a million simulation trials, each capturing a different possible sequence of
catastrophic events over the course of a contractual year. In this paper, we
explore the design of a flexible framework for portfolio risk analysis that
facilitates answering a rich variety of catastrophic risk queries. Rather than
aggregating simulation data in order to produce a small set of high-level risk
metrics efficiently (as is often done in production risk management systems),
the focus here is on allowing the user to pose queries on unaggregated or
partially aggregated data. The goal is to provide a flexible framework that can
be used by analysts to answer a wide variety of unanticipated but natural ad
hoc queries. Such detailed queries can help actuaries or underwriters to better
understand the multiple dimensions (e.g., spatial correlation, seasonality,
peril features, construction features, and financial terms) that can impact
portfolio risk. We implemented a prototype system, called QuPARA (Query-Driven
Large-Scale Portfolio Aggregate Risk Analysis), using Hadoop, which is Apache's
implementation of the MapReduce paradigm. This allows the user to take
advantage of large parallel compute servers in order to answer ad hoc risk
analysis queries efficiently even on very large data sets typically encountered
in practice. We describe the design and implementation of QuPARA and present
experimental results that demonstrate its feasibility. A full portfolio risk
analysis run consisting of a 1,000,000 trial simulation, with 1,000 events per
trial, and 3,200 risk transfer contracts can be completed on a 16-node Hadoop
cluster in just over 20 minutes.Comment: 9 pages, IEEE International Conference on Big Data (BigData), Santa
Clara, USA, 201
- …