Skip to main content
Article thumbnail
Location of Repository

Efficient Analytics on Ordered Datasets using MapReduce

By Mario Baldi, Narus Inc, Jiangtao Yin, Lixin Gao, Yong Liao and Antonio Nucci

Abstract

Efficiently analyzing data on a large scale can be vital for data owners to gain useful business intelligence. One of the most common datasets used to gain business intelligence is event log files. Oftentimes, records in event log files that are time sorted, need tobe grouped byuser ID or transaction ID in order to mine user behaviors, such as click through rate, while preserving the time order. This kind of analytical workload is here referred to as RElative Order-pReserving based Grouping (Re-Org). Using MapReduce/Hadoop, a popular big data analysis tool, in an as-is manner for executing Re-Org tasks on ordered datasets is not efficient due to its internal sort-merge mechanism. We propose a framework that adopts an efficient group-order-merge mechanism to provide faster execution of Re-Org tasks and implement it by extendingHadoop. Experimental results show a 2.2x speedup over executing Re-Org tasks in plain vanilla Hadoop. Categories andSubject Descriptor

Topics: General Terms Design, Experimentation, Performance Keywords MapReduce/Hadoop, distributedframework, ordereddataset 1. PROBLEM STATEMENT
Year: 2013
OAI identifier: oai:CiteSeerX.psu:10.1.1.353.5139
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://rio.ecs.umass.edu/mnilp... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.