5,288 research outputs found
New technologies to bridge the gap between High Performance Computing (HPC) and Big Data
The unification of HPC and Big Data has received increasing attention in the
last years. It is a common belief that exascale computing and Big Data are closely associated since HPC requires
processing large-scale data from scientific instruments and simulations. But, at the same time, it was observed that
tools and cultures of HPC and Big Data communities differ significantly. One of the most important issues in the
path to the convergence is caused by the differences in their software stacks. This thesis will address the research
challenge of bridging the gap between Big Data and HPC worlds. With this goal in mind, a set of tools and
technologies will be developed and integrated into a new unified Big Data-HPC framework that will allow the
execution of scientific multi-language applications on both environments using containers
Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY
Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process- and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment.This work was supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Grant TIN2016-79637-P(toward Unification of HPC and Big Data Paradigms), in part by the Spanish Ministry of Education under Grant FPU15/00422 TrainingProgram for Academic and Teaching Staff Grant, in part by the Advanced Scientific Computing Research, Office of Science, U.S.Department of Energy, under Contract DE-AC02-06CH11357, and in part by the DOE with under Agreement DE-DC000122495,Program Manager Laura Biven
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
Scientific problems that depend on processing large amounts of data require
overcoming challenges in multiple areas: managing large-scale data
distribution, co-placement and scheduling of data with compute resources, and
storing and transferring large volumes of data. We analyze the ecosystems of
the two prominent paradigms for data-intensive applications, hereafter referred
to as the high-performance computing and the Apache-Hadoop paradigm. We propose
a basis, common terminology and functional factors upon which to analyze the
two approaches of both paradigms. We discuss the concept of "Big Data Ogres"
and their facets as means of understanding and characterizing the most common
application workloads found across the two paradigms. We then discuss the
salient features of the two paradigms, and compare and contrast the two
approaches. Specifically, we examine common implementation/approaches of these
paradigms, shed light upon the reasons for their current "architecture" and
discuss some typical workloads that utilize them. In spite of the significant
software distinctions, we believe there is architectural similarity. We discuss
the potential integration of different implementations, across the different
levels and components. Our comparison progresses from a fully qualitative
examination of the two paradigms, to a semi-quantitative methodology. We use a
simple and broadly used Ogre (K-means clustering), characterize its performance
on a range of representative platforms, covering several implementations from
both paradigms. Our experiments provide an insight into the relative strengths
of the two paradigms. We propose that the set of Ogres will serve as a
benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure
Research and Education in Computational Science and Engineering
Over the past two decades the field of computational science and engineering
(CSE) has penetrated both basic and applied research in academia, industry, and
laboratories to advance discovery, optimize systems, support decision-makers,
and educate the scientific and engineering workforce. Informed by centuries of
theory and experiment, CSE performs computational experiments to answer
questions that neither theory nor experiment alone is equipped to answer. CSE
provides scientists and engineers of all persuasions with algorithmic
inventions and software systems that transcend disciplines and scales. Carried
on a wave of digital technology, CSE brings the power of parallelism to bear on
troves of data. Mathematics-based advanced computing has become a prevalent
means of discovery and innovation in essentially all areas of science,
engineering, technology, and society; and the CSE community is at the core of
this transformation. However, a combination of disruptive
developments---including the architectural complexity of extreme-scale
computing, the data revolution that engulfs the planet, and the specialization
required to follow the applications to new frontiers---is redefining the scope
and reach of the CSE endeavor. This report describes the rapid expansion of CSE
and the challenges to sustaining its bold advances. The report also presents
strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie
Chiminey: Reliable Computing and Data Management Platform in the Cloud
The enabling of scientific experiments that are embarrassingly parallel, long
running and data-intensive into a cloud-based execution environment is a
desirable, though complex undertaking for many researchers. The management of
such virtual environments is cumbersome and not necessarily within the core
skill set for scientists and engineers. We present here Chiminey, a software
platform that enables researchers to (i) run applications on both traditional
high-performance computing and cloud-based computing infrastructures, (ii)
handle failure during execution, (iii) curate and visualise execution outputs,
(iv) share such data with collaborators or the public, and (v) search for
publicly available data.Comment: Preprint, ICSE 201
From Quantity to Quality: Massive Molecular Dynamics Simulation of Nanostructures under Plastic Deformation in Desktop and Service Grid Distributed Computing Infrastructure
The distributed computing infrastructure (DCI) on the basis of BOINC and
EDGeS-bridge technologies for high-performance distributed computing is used
for porting the sequential molecular dynamics (MD) application to its parallel
version for DCI with Desktop Grids (DGs) and Service Grids (SGs). The actual
metrics of the working DG-SG DCI were measured, and the normal distribution of
host performances, and signs of log-normal distributions of other
characteristics (CPUs, RAM, and HDD per host) were found. The practical
feasibility and high efficiency of the MD simulations on the basis of DG-SG DCI
were demonstrated during the experiment with the massive MD simulations for the
large quantity of aluminum nanocrystals (-). Statistical
analysis (Kolmogorov-Smirnov test, moment analysis, and bootstrapping analysis)
of the defect density distribution over the ensemble of nanocrystals had shown
that change of plastic deformation mode is followed by the qualitative change
of defect density distribution type over ensemble of nanocrystals. Some
limitations (fluctuating performance, unpredictable availability of resources,
etc.) of the typical DG-SG DCI were outlined, and some advantages (high
efficiency, high speedup, and low cost) were demonstrated. Deploying on DG DCI
allows to get new scientific from the simulated
of numerous configurations by harnessing sufficient computational power to
undertake MD simulations in a wider range of physical parameters
(configurations) in a much shorter timeframe.Comment: 13 pages, 11 pages (http://journals.agh.edu.pl/csci/article/view/106
- …