851 research outputs found
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
A multi-class approach for ranking graph nodes: models and experiments with incomplete data
After the phenomenal success of the PageRank algorithm, many researchers have
extended the PageRank approach to ranking graphs with richer structures beside
the simple linkage structure. In some scenarios we have to deal with
multi-parameters data where each node has additional features and there are
relationships between such features.
This paper stems from the need of a systematic approach when dealing with
multi-parameter data. We propose models and ranking algorithms which can be
used with little adjustments for a large variety of networks (bibliographic
data, patent data, twitter and social data, healthcare data). In this paper we
focus on several aspects which have not been addressed in the literature: (1)
we propose different models for ranking multi-parameters data and a class of
numerical algorithms for efficiently computing the ranking score of such
models, (2) by analyzing the stability and convergence properties of the
numerical schemes we tune a fast and stable technique for the ranking problem,
(3) we consider the issue of the robustness of our models when data are
incomplete. The comparison of the rank on the incomplete data with the rank on
the full structure shows that our models compute consistent rankings whose
correlation is up to 60% when just 10% of the links of the attributes are
maintained suggesting the suitability of our model also when the data are
incomplete
Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation
Myriad of graph-based algorithms in machine learning and data mining require
parsing relational data iteratively. These algorithms are implemented in a
large-scale distributed environment in order to scale to massive data sets. To
accelerate these large-scale graph-based iterative computations, we propose
delta-based accumulative iterative computation (DAIC). Different from
traditional iterative computations, which iteratively update the result based
on the result from the previous iteration, DAIC updates the result by
accumulating the "changes" between iterations. By DAIC, we can process only the
"changes" to avoid the negligible updates. Furthermore, we can perform DAIC
asynchronously to bypass the high-cost synchronous barriers in heterogeneous
distributed environments. Based on the DAIC model, we design and implement an
asynchronous graph processing framework, Maiter. We evaluate Maiter on local
cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves
as much as 60x speedup over Hadoop and outperforms other state-of-the-art
frameworks.Comment: ScienceCloud 2012, TKDE 201
A Web Aggregation Approach for Distributed Randomized PageRank Algorithms
The PageRank algorithm employed at Google assigns a measure of importance to
each web page for rankings in search results. In our recent papers, we have
proposed a distributed randomized approach for this algorithm, where web pages
are treated as agents computing their own PageRank by communicating with linked
pages. This paper builds upon this approach to reduce the computation and
communication loads for the algorithms. In particular, we develop a method to
systematically aggregate the web pages into groups by exploiting the sparsity
inherent in the web. For each group, an aggregated PageRank value is computed,
which can then be distributed among the group members. We provide a distributed
update scheme for the aggregated PageRank along with an analysis on its
convergence properties. The method is especially motivated by results on
singular perturbation techniques for large-scale Markov chains and multi-agent
consensus.Comment: To appear in the IEEE Transactions on Automatic Control, 201
- …