Search CORE

7,916 research outputs found

MapReduce for Experimental Search

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue: National Institute of Standards and Technology (NIST)
Publication date: 01/01/2011
Field of study

This report presents preliminary results for the TREC 2010 ad-hoc web search task. We ran our MIREX system on 0.5 billion web documents from the ClueWeb09 crawl. On average, the system retrieves at least 3 relevant documents on the first result page containing 10 results, using a simple index consisting of anchor texts, page titles, and spam removal.\u

CiteSeerX

Radboud Repository

University of Twente Research Information

MIREX: MapReduce Information Retrieval Experiments

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue
Publication date: 01/01/2010
Field of study

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost ma- chines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.ne

arXiv.org e-Print Archive

CiteSeerX

University of Twente Research Information

Enumerating Maximal Bicliques from a Large Graph using MapReduce

Author: Mukherjee Arko Provo
Tirthapura Srikanta
Publication venue
Publication date: 01/01/2014
Field of study

We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce platform, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller sized subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through the use of an appropriate total order among the vertices. Our evaluation shows that the algorithm scales to large graphs with millions of edges and tens of mil- lions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale.Comment: A preliminary version of the paper was accepted at the Proceedings of the 3rd IEEE International Congress on Big Data 201

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Crossref

MRBench: A Benchmark for MapReduce Framework

Author: Heon Y. Yeom
Hyuck Han
Hyungsoo Jung
Kiyoung Kim
Kyungho Jeon
Shin-gyu Kim
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

MapReduce is Google’s programming model for easy development of scalable parallel applications which pro-cess huge quantity of data on many clusters. Due to its conveniency and efficiency, MapReduce is used in various applications (e.g., web search services and on-line analytical processing.) However, there are only few good benchmarks to evaluate MapReduce implementa-tions by realistic testsets. In this paper, we present MRBench that is a bench-mark for evaluating MapReduce systems. MRBench fo-cuses on processing business oriented queries and con-current data modifications. To this end, we build MR-Bench to deal with large volumes of relational data and execute highly complex queries. By MRBench, users can evaluate the performance of MapReduce systems while varying environmental parameters such as data size and the number of (Map/Reduce) tasks. Our ex-tensive experimental results show that MRBench is a useful tool to benchmark the capability of answering critical business questions.

CiteSeerX

Crossref

MapReduce for information retrieval evaluation: "Let's quickly test this on 12 TB of data"

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue: Springer
Publication date: 01/01/2010
Field of study

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net

CiteSeerX

Crossref

Radboud Repository

University of Twente Research Information

Brute Force Information Retrieval Experiments using MapReduce

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue: European Research Consortium for Informatics and Mathematics
Publication date: 01/01/2012
Field of study

MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totalling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available for others

Radboud Repository

University of Twente Research Information

Real-Time MapReduce Scheduling

Author: Lee Insup
Loo Boon Thau
Phan Linh T.X.
Zhang Zhuoyao
Publication venue: ScholarlyCommons
Publication date: 01/01/2010
Field of study

In this paper, we explore the feasibility of enabling the scheduling of mixed hard and soft real-time MapReduce applications. We first present an experimental evaluation of the popular Hadoop MapReduce middleware on the Amazon EC2 cloud. Our evaluation reveals tradeoffs between overall system throughput and execution time predictability, as well as highlights a number of factors affecting real-time scheduling, such as data placement, concurrent users, and master scheduling overhead. Based on our evaluation study, we present a formal model for capturing real-time MapReduce applications and the Hadoop platform. Using this model, we formulate the offline scheduling of real-time MapReduce jobs on a heterogeneous distributed Hadoop architecture as a constraint satisfaction problem (CSP) and introduce various search strategies for the formulation. We propose an enhancement of MapReduce’s execution model and a range of heuristic techniques for the online scheduling. We further outline some of our future directions that apply state-of-the-art techniques in the real-time scheduling literature

ScholarlyCommons@Penn