Search CORE

352 research outputs found

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Author: Canon Shane
Chhugani Jatin
Demmel James
Devarakonda Aditya
Gerhardt Lisa
Gittens Alex
Harrell Jim
Kottalam Jey
Krishnamurthy Venkat
Liu Jialin
Mahoney Michael W.
Maschhoff Kristyn
Prabhat
Racah Evan
Ringenburg Michael
Sharma Pramod
Yang Jiyan
Publication venue
Publication date: 12/05/2016
Field of study

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance

arXiv.org e-Print Archive

eScholarship - University of California

PUMA: Purdue MapReduce Benchmarks Suite

Author: Ahmad Faraz
Lee Seyong
Thottethodi Mithuna
Vijaykumar T. N.
Publication venue: 'Purdue University (bepress)'
Publication date: 30/10/2012
Field of study

The Performance Comparison of Hadoop and Spark

Author: Pan Shengti
Publication venue: The Repository at St. Cloud State
Publication date: 01/03/2016
Field of study

The main focus of this paper is to compare the performance between Hadoop and Spark on some applications, such as iterative computation and real-time data processing. The runtime architectures of both Spark and Hadoop will be compared to illustrate their differences, and the components of their ecosystems will be tabled to show their respective characteristics. In this paper, we will highlight the performance comparison between Spark and Hadoop as the growth of data size and iteration counts, and also show how to tune in Hadoop and Spark in order to achieve higher performance. At the end, there will be several appendixes which describes how to install and launch Hadoop and Spark, how to implement the three case studies using java programming, and how to verify the correctness of the running results

St. Cloud State University

Massively Parallel Algorithms for Small Subgraph Counting

Author: Biswas Amartya Shankha
Eden Talya
Liu Quanquan C.
Mitrovi? Slobodan
Rubinfeld Ronitt
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022)
Publication date: 01/01/2022
Field of study

Dagstuhl Research Online Publication Server

Big Data Analysis

Author: A Acquisti
A Thusoo
B Glavic
C-C Lee
D Bollier
JG Koomey
M Li
M Strohbach
N Marz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The value of big data is predicated on the ability to detect trends and patterns and more generally to make sense of the large volumes of data that is often comprised of a heterogeneous mix of format, structure, and semantics. Big data analysis is the component of the big data value chain that focuses on transforming raw acquired data into a coherent usable resource suitable for analysis. Using a range of interviews with key stakeholders in small and large companies and academia, this chapter outlines key insights, state of the art, emerging trends, future requirements, and sectorial case studies for data analysis

OAPEN Library

Springer - Publisher Connector

DI-fusion