Search CORE

415 research outputs found

Teaching HDFS/MapReduce Systems Concepts to Undergraduates

Author: Apon Amy W.
Duffy Edward B
Ngo Linh B
Publication venue: Clemson University Libraries
Publication date: 01/05/2014
Field of study

This paper presents the development of a Hadoop MapReduce module that has been taught in a course in distributed computing to upper undergraduate computer science students at Clemson University. The paper describes our teaching experiences and the feedback from the students over several semesters that have helped to shape the course. We provide suggested best practices for lecture materials, the computing platform, and the teaching methods. In addition, the computing platform and teaching methods can be extended to accommodate emerging technologies and modules for related courses

Teaching HDFS/MapReduce Systems Concepts to Undergraduates

Author: Apon Amy W.
Duffy Edward B
Ngo Linh B
Publication venue: Clemson University Libraries
Publication date: 01/05/2014
Field of study

Teaching HDFS/MapReduce Systems Concepts to Undergraduates

Author: Apon Amy W.
Duffy Edward B
Ngo Linh B
Publication venue: Clemson University Libraries
Publication date: 01/05/2014
Field of study

Leading Undergraduate Students to Big Data Generation

Author: Shen Ju
Yang Jianjun
Publication venue
Publication date: 01/03/2015
Field of study

People are facing a flood of data today. Data are being collected at unprecedented scale in many areas, such as networking, image processing, virtualization, scientific computation, and algorithms. The huge data nowadays are called Big Data. Big data is an all encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications. In this article, the authors present a unique way which uses network simulator and tools of image processing to train students abilities to learn, analyze, manipulate, and apply Big Data. Thus they develop students handson abilities on Big Data and their critical thinking abilities. The authors used novel image based rendering algorithm with user intervention to generate realistic 3D virtual world. The learning outcomes are significant

arXiv.org e-Print Archive

Modeling performance of Hadoop applications: A journey from queueing networks to stochastic well formed nets

Author: A Castiglione
D Ardagna
DJ Dubois
DR Liang
E Vianna
ED Lazowska
HV Jagadish
J Polo
JE Marynowski
K Jensen
K Kambatla
L Aguilera-Mendoza
M Bertoli
M Lin
RD Nelson
S Baarir
VW Mak
WW Chu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the enduser and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%

Archivio istituzionale della ricerca - Politecnico di Milano

A Comparative Study of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Author: Akil Bilal
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 29/03/2018
Field of study

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those not only aim to improve performance, but also provide high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop. But without comparison data available, how would data scientists know which system they should choose? This research compares: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from the perspectives of performance, usability and practicality for batch-oriented data analytics. We propose and apply a methodology which guides the conception of multidimensional software comparisons and the presentation of their results. The methodology was effective, providing direction and structure to the comparison, and should serve as helpful for future comparisons. The results confirm that Spark and Flink are superior to Hadoop MapReduce in performance and usability. Spark and Flink were similar in all three considerations, however as per the methodology, readers have the flexibility to adjust weightings to their needs, which could differentiate them. We also report on the design, execution and results of a large-scale usability study with a cohort of masters students, who learn and work with all three platforms, solving different use cases in data science contexts. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink. These results were included in the usability component of the multidimensional comparison