Search CORE

361 research outputs found

Spark-DIY: A framework for interoperable Spark Operations with high performance Block-Based Data Models

Author: Caino Lores Silvina
Carretero Pérez Jesús
Nicolae Bogdan
Peterka Tom
Yildiz Orcun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/12/2018
Field of study

This work was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness under the grant TIN2016-79637-P ”Towards Unification of HPC and Big Data Paradigms”; the Spanish Ministry of Education under the FPU15/00422 Training Program for Academic and Teaching Staff Grant; the Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357; and by DOE with agreement No. DE-DC000122495, program manager Laura Biven

Universidad Carlos III de Madrid e-Archivo

A Comparative Study of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Author: Akil Bilal
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 29/03/2018
Field of study

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those not only aim to improve performance, but also provide high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop. But without comparison data available, how would data scientists know which system they should choose? This research compares: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from the perspectives of performance, usability and practicality for batch-oriented data analytics. We propose and apply a methodology which guides the conception of multidimensional software comparisons and the presentation of their results. The methodology was effective, providing direction and structure to the comparison, and should serve as helpful for future comparisons. The results confirm that Spark and Flink are superior to Hadoop MapReduce in performance and usability. Spark and Flink were similar in all three considerations, however as per the methodology, readers have the flexibility to adjust weightings to their needs, which could differentiate them. We also report on the design, execution and results of a large-scale usability study with a cohort of masters students, who learn and work with all three platforms, solving different use cases in data science contexts. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink. These results were included in the usability component of the multidimensional comparison

Cumulon: Cloudbased statistical analysis from users perspective.

Author: Botong Huang
Jun Yang
Nicholas W D Jarrett
Sayan Mukherjee
Shivnath Babu
Publication venue
Publication date: 01/01/2014
Field of study

Abstract Cumulon is a system aimed at simplifying the developmen

CiteSeerX