7,916 research outputs found

    MapReduce for Experimental Search

    Get PDF
    This report presents preliminary results for the TREC 2010 ad-hoc web search task. We ran our MIREX system on 0.5 billion web documents from the ClueWeb09 crawl. On average, the system retrieves at least 3 relevant documents on the first result page containing 10 results, using a simple index consisting of anchor texts, page titles, and spam removal.\u

    MIREX: MapReduce Information Retrieval Experiments

    Get PDF
    We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost ma- chines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.ne

    Enumerating Maximal Bicliques from a Large Graph using MapReduce

    Get PDF
    We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce platform, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller sized subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through the use of an appropriate total order among the vertices. Our evaluation shows that the algorithm scales to large graphs with millions of edges and tens of mil- lions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale.Comment: A preliminary version of the paper was accepted at the Proceedings of the 3rd IEEE International Congress on Big Data 201

    MRBench: A Benchmark for MapReduce Framework

    Full text link
    MapReduce is Google’s programming model for easy development of scalable parallel applications which pro-cess huge quantity of data on many clusters. Due to its conveniency and efficiency, MapReduce is used in various applications (e.g., web search services and on-line analytical processing.) However, there are only few good benchmarks to evaluate MapReduce implementa-tions by realistic testsets. In this paper, we present MRBench that is a bench-mark for evaluating MapReduce systems. MRBench fo-cuses on processing business oriented queries and con-current data modifications. To this end, we build MR-Bench to deal with large volumes of relational data and execute highly complex queries. By MRBench, users can evaluate the performance of MapReduce systems while varying environmental parameters such as data size and the number of (Map/Reduce) tasks. Our ex-tensive experimental results show that MRBench is a useful tool to benchmark the capability of answering critical business questions.

    MapReduce for information retrieval evaluation: "Let's quickly test this on 12 TB of data"

    Get PDF
    We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net

    Brute Force Information Retrieval Experiments using MapReduce

    Get PDF
    MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totalling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available for others

    Real-Time MapReduce Scheduling

    Get PDF
    In this paper, we explore the feasibility of enabling the scheduling of mixed hard and soft real-time MapReduce applications. We first present an experimental evaluation of the popular Hadoop MapReduce middleware on the Amazon EC2 cloud. Our evaluation reveals tradeoffs between overall system throughput and execution time predictability, as well as highlights a number of factors affecting real-time scheduling, such as data placement, concurrent users, and master scheduling overhead. Based on our evaluation study, we present a formal model for capturing real-time MapReduce applications and the Hadoop platform. Using this model, we formulate the offline scheduling of real-time MapReduce jobs on a heterogeneous distributed Hadoop architecture as a constraint satisfaction problem (CSP) and introduce various search strategies for the formulation. We propose an enhancement of MapReduce’s execution model and a range of heuristic techniques for the online scheduling. We further outline some of our future directions that apply state-of-the-art techniques in the real-time scheduling literature
    • …
    corecore