Search CORE

2,384 research outputs found

Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Author: Canon Shane
Chhugani Jatin
Demmel James
Devarakonda Aditya
Gerhardt Lisa
Gittens Alex
Harrell Jim
Kottalam Jey
Krishnamurthy Venkat
Liu Jialin
Mahoney Michael W.
Maschhoff Kristyn
Prabhat
Racah Evan
Ringenburg Michael
Sharma Pramod
Yang Jiyan
Publication venue
Publication date: 12/05/2016
Field of study

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance

arXiv.org e-Print Archive

eScholarship - University of California

Status report on the NCRIS eResearch capability summary

Author: Tom Cochrane
Publication venue: Department of Education and Training (Australia)
Publication date
Field of study

Preface The period 2006 to 2014 has seen an approach to the national support of eResearch infrastructure by the Australian Government which is unprecedented. Not only has investment been at a significantly greater scale than previously, but the intent and approach has been highly innovative, shaped by a strategic approach to research support in which the critical element, the catchword, has been collaboration. The innovative directions shaped by this strategy, under the banner of the Australian Government’s National Collaborative Research Infrastructure Strategy (NCRIS), have led to significant and creative initiatives and activity, seminal to new research and fields of discovery. Origin This document is a Technical Report on the Status of the NCRIS eResearch Capability. It was commissioned by the Australian Government Department of Education and Training in the second half of 2014 to examine a range of questions and issues concerning the development of this infrastructure over the period 2006-2014. The infrastructure has been built and implemented over this period following investments made by the Australian Government amounting to over $430 million, under a number of funding initiatives

Analysis and Policy Observatory (APO)

Applying big data paradigms to a large scale scientific workflow: lessons learned and future directions

Author: Caino Lores Silvina
Carretero Pérez Jesús
Kropf Peter
Lapin Andei
Publication venue: 'Elsevier BV'
Publication date: 17/04/2018
Field of study

The increasing amounts of data related to the execution of scientific workflows has raised awareness of their shift towards parallel data-intensive problems. In this paper, we deliver our experience combining the traditional high-performance computing and grid-based approaches with Big Data analytics paradigms, in the context of scientific ensemble workflows. Our goal was to assess and discuss the suitability of such data-oriented mechanisms for production-ready workflows, especially in terms of scalability. We focused on two key elements in the Big Data ecosystem: the data-centric programming model, and the underlying infrastructure that integrates storage and computation in each node. We experimented with a representative MPI-based iterative workflow from the hydrology domain, EnKF-HGS, which we re-implemented using the Spark data analysis framework. We conducted experiments on a local cluster, a private cloud running OpenNebula, and the Amazon Elastic Compute Cloud (AmazonEC2). The results we obtained were analysed to synthesize the lessons we learned from this experience, while discussing promising directions for further research.This work was supported by the Spanish Ministry of Economics and Competitiveness grant TIN-2013-41350-P, the IC1305 COST Action “Network for Sustainable Ultrascale Computing Platforms” (NESUS), and the FPU Training Program for Academic and Teaching Staff Grant FPU15/00422 by the Spanish Ministry of Education

Universidad Carlos III de Madrid e-Archivo

Spark-DIY: A framework for interoperable Spark Operations with high performance Block-Based Data Models

Author: Caino Lores Silvina
Carretero Pérez Jesús
Nicolae Bogdan
Peterka Tom
Yildiz Orcun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/12/2018
Field of study

This work was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness under the grant TIN2016-79637-P ”Towards Unification of HPC and Big Data Paradigms”; the Spanish Ministry of Education under the FPU15/00422 Training Program for Academic and Teaching Staff Grant; the Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357; and by DOE with agreement No. DE-DC000122495, program manager Laura Biven

Crossref

Universidad Carlos III de Madrid e-Archivo

Experiences with workflows for automating data-intensive bioinformatics

Author: Bongcam-Rudlof Erik
Carrasco Hernández Guillermo
Forer Lucas
Giovacchini Mario
Kallio Aleksi
Kanduła Maciej M
Korpelainen Eija
Krachunov Milko
Kreil David P.
Kulev Ognyan
Lampa Samuel
Pireddu Luca
Schönherr Sebastian
Siretskiy Alexey
Spjuth Ola
Valls Guimera Roman
Vassilev Dimitar
Łabaj Pavel P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution.Pubblicat

Springer - Publisher Connector

P-arch

PubMed Central

Publikationsserver der Universitätsbibliothek Bodenkultur Wien