352 research outputs found
Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
We explore the trade-offs of performing linear algebra using Apache Spark,
compared to traditional C and MPI implementations on HPC platforms. Spark is
designed for data analytics on cluster computing platforms with access to local
disks and is optimized for data-parallel tasks. We examine three widely-used
and important matrix factorizations: NMF (for physical plausability), PCA (for
its ubiquity) and CX (for data interpretability). We apply these methods to
TB-sized problems in particle physics, climate modeling and bioimaging. The
data matrices are tall-and-skinny which enable the algorithms to map
conveniently into Spark's data-parallel model. We perform scaling experiments
on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide
tuning guidance to obtain high performance
The Performance Comparison of Hadoop and Spark
The main focus of this paper is to compare the performance between Hadoop and Spark on some applications, such as iterative computation and real-time data processing. The runtime architectures of both Spark and Hadoop will be compared to illustrate their differences, and the components of their ecosystems will be tabled to show their respective characteristics. In this paper, we will highlight the performance comparison between Spark and Hadoop as the growth of data size and iteration counts, and also show how to tune in Hadoop and Spark in order to achieve higher performance. At the end, there will be several appendixes which describes how to install and launch Hadoop and Spark, how to implement the three case studies using java programming, and how to verify the correctness of the running results
Big Data Analysis
The value of big data is predicated on the ability to detect trends and patterns and more generally to make sense of the large volumes of data that is often comprised of a heterogeneous mix of format, structure, and semantics. Big data analysis is the component of the big data value chain that focuses on transforming raw acquired data into a coherent usable resource suitable for analysis. Using a range of interviews with key stakeholders in small and large companies and academia, this chapter outlines key insights, state of the art, emerging trends, future requirements, and sectorial case studies for data analysis
- …