851 research outputs found

    MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!

    Full text link
    Hadoop is currently the large-scale data analysis "hammer" of choice, but there exist classes of algorithms that aren't "nails", in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is "good enough", and that instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a "real-world" production analytics environment. From this combined perspective I reflect on the current state and future of "big data" research

    A multi-class approach for ranking graph nodes: models and experiments with incomplete data

    Get PDF
    After the phenomenal success of the PageRank algorithm, many researchers have extended the PageRank approach to ranking graphs with richer structures beside the simple linkage structure. In some scenarios we have to deal with multi-parameters data where each node has additional features and there are relationships between such features. This paper stems from the need of a systematic approach when dealing with multi-parameter data. We propose models and ranking algorithms which can be used with little adjustments for a large variety of networks (bibliographic data, patent data, twitter and social data, healthcare data). In this paper we focus on several aspects which have not been addressed in the literature: (1) we propose different models for ranking multi-parameters data and a class of numerical algorithms for efficiently computing the ranking score of such models, (2) by analyzing the stability and convergence properties of the numerical schemes we tune a fast and stable technique for the ranking problem, (3) we consider the issue of the robustness of our models when data are incomplete. The comparison of the rank on the incomplete data with the rank on the full structure shows that our models compute consistent rankings whose correlation is up to 60% when just 10% of the links of the attributes are maintained suggesting the suitability of our model also when the data are incomplete

    Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation

    Full text link
    Myriad of graph-based algorithms in machine learning and data mining require parsing relational data iteratively. These algorithms are implemented in a large-scale distributed environment in order to scale to massive data sets. To accelerate these large-scale graph-based iterative computations, we propose delta-based accumulative iterative computation (DAIC). Different from traditional iterative computations, which iteratively update the result based on the result from the previous iteration, DAIC updates the result by accumulating the "changes" between iterations. By DAIC, we can process only the "changes" to avoid the negligible updates. Furthermore, we can perform DAIC asynchronously to bypass the high-cost synchronous barriers in heterogeneous distributed environments. Based on the DAIC model, we design and implement an asynchronous graph processing framework, Maiter. We evaluate Maiter on local cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves as much as 60x speedup over Hadoop and outperforms other state-of-the-art frameworks.Comment: ScienceCloud 2012, TKDE 201

    A Web Aggregation Approach for Distributed Randomized PageRank Algorithms

    Full text link
    The PageRank algorithm employed at Google assigns a measure of importance to each web page for rankings in search results. In our recent papers, we have proposed a distributed randomized approach for this algorithm, where web pages are treated as agents computing their own PageRank by communicating with linked pages. This paper builds upon this approach to reduce the computation and communication loads for the algorithms. In particular, we develop a method to systematically aggregate the web pages into groups by exploiting the sparsity inherent in the web. For each group, an aggregated PageRank value is computed, which can then be distributed among the group members. We provide a distributed update scheme for the aggregated PageRank along with an analysis on its convergence properties. The method is especially motivated by results on singular perturbation techniques for large-scale Markov chains and multi-agent consensus.Comment: To appear in the IEEE Transactions on Automatic Control, 201
    • …
    corecore