2,499 research outputs found
Recommended from our members
Experiences in porting mini-applications to OpenACC and OpenMP on heterogeneous systems
This article studies mini-applications—Minisweep, GenASiS, GPP, and FF—that use computational methods commonly encountered in HPC. We have ported these applications to develop OpenACC and OpenMP versions, and evaluated their performance on Titan (Cray XK7 with K20x GPUs), Cori (Cray XC40 with Intel KNL), Summit (IBM AC922 with Volta GPUs), and Cori-GPU (Cray CS-Storm 500NX with Intel Skylake and Volta GPUs). Our goals are for these new ports to be useful to both application and compiler developers, to document and describe the lessons learned and the methodology to create optimized OpenMP and OpenACC versions, and to provide a description of possible migration paths between the two specifications. Cases where specific directives or code patterns result in improved performance for a given architecture are highlighted. We also include discussions of the functionality and maturity of the latest compilers available on the above platforms with respect to OpenACC or OpenMP implementations
Direct -body code on low-power embedded ARM GPUs
This work arises on the environment of the ExaNeSt project aiming at design
and development of an exascale ready supercomputer with low energy consumption
profile but able to support the most demanding scientific and technical
applications. The ExaNeSt compute unit consists of densely-packed low-power
64-bit ARM processors, embedded within Xilinx FPGA SoCs. SoC boards are
heterogeneous architecture where computing power is supplied both by CPUs and
GPUs, and are emerging as a possible low-power and low-cost alternative to
clusters based on traditional CPUs. A state-of-the-art direct -body code
suitable for astrophysical simulations has been re-engineered in order to
exploit SoC heterogeneous platforms based on ARM CPUs and embedded GPUs.
Performance tests show that embedded GPUs can be effectively used to accelerate
real-life scientific calculations, and that are promising also because of their
energy efficiency, which is a crucial design in future exascale platforms.Comment: 16 pages, 7 figures, 1 table, accepted for publication in the
Computing Conference 2019 proceeding
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
Deploying Jupyter Notebooks at scale on XSEDE resources for Science Gateways and workshops
Jupyter Notebooks have become a mainstream tool for interactive computing in
every field of science. Jupyter Notebooks are suitable as companion
applications for Science Gateways, providing more flexibility and
post-processing capability to the users. Moreover they are often used in
training events and workshops to provide immediate access to a pre-configured
interactive computing environment. The Jupyter team released the JupyterHub web
application to provide a platform where multiple users can login and access a
Jupyter Notebook environment. When the number of users and memory requirements
are low, it is easy to setup JupyterHub on a single server. However, setup
becomes more complicated when we need to serve Jupyter Notebooks at scale to
tens or hundreds of users. In this paper we will present three strategies for
deploying JupyterHub at scale on XSEDE resources. All options share the
deployment of JupyterHub on a Virtual Machine on XSEDE Jetstream. In the first
scenario, JupyterHub connects to a supercomputer and launches a single node job
on behalf of each user and proxies back the Notebook from the computing node
back to the user's browser. In the second scenario, implemented in the context
of a XSEDE consultation for the IRIS consortium for Seismology, we deploy
Docker in Swarm mode to coordinate many XSEDE Jetstream virtual machines to
provide Notebooks with persistent storage and quota. In the last scenario we
install the Kubernetes containers orchestration framework on Jetstream to
provide a fault-tolerant JupyterHub deployment with a distributed filesystem
and capability to scale to thousands of users. In the conclusion section we
provide a link to step-by-step tutorials complete with all the necessary
commands and configuration files to replicate these deployments.Comment: 7 pages, 3 figures, PEARC '18: Practice and Experience in Advanced
Research Computing, July 22--26, 2018, Pittsburgh, PA, US
Accelerating FIU’s science research and education towards discovery and innovation by leveraging FIU’s Science DMZ
Research faculty and their students are spending too much time on data management issues related to the transfer of data between networks. As the campus cyberinfrastructure increases data production, the transport capacity of the network must increase proportionally to deliver the data to the High-Performance Computing centers for analysis. This work presents the experience of a research project in implementing Science DMZ at Florida International University, which added six researchers and their laboratories to the Science Network and Science DMZ. The study applied a qualitative approach to assessing the researcher’s science workflows in order to create a Science DMZ implementation plan and followed the Energy Sciences Network implementation guide
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
- …