736 research outputs found
Using Pilot Systems to Execute Many Task Workloads on Supercomputers
High performance computing systems have historically been designed to support
applications comprised of mostly monolithic, single-job workloads. Pilot
systems decouple workload specification, resource selection, and task execution
via job placeholders and late-binding. Pilot systems help to satisfy the
resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot
(RP) is a modular and extensible Python-based pilot system. In this paper we
describe RP's design, architecture and implementation, and characterize its
performance. RP is capable of spawning more than 100 tasks/second and supports
the steady-state execution of up to 16K concurrent tasks. RP can be used
stand-alone, as well as integrated with other application-level tools as a
runtime system
Efficient HTTP based I/O on very large datasets for high performance computing with the libdavix library
Remote data access for data analysis in high performance computing is
commonly done with specialized data access protocols and storage systems. These
protocols are highly optimized for high throughput on very large datasets,
multi-streams, high availability, low latency and efficient parallel I/O. The
purpose of this paper is to describe how we have adapted a generic protocol,
the Hyper Text Transport Protocol (HTTP) to make it a competitive alternative
for high performance I/O and data analysis applications in a global computing
grid: the Worldwide LHC Computing Grid. In this work, we first analyze the
design differences between the HTTP protocol and the most common high
performance I/O protocols, pointing out the main performance weaknesses of
HTTP. Then, we describe in detail how we solved these issues. Our solutions
have been implemented in a toolkit called davix, available through several
recent Linux distributions. Finally, we describe the results of our benchmarks
where we compare the performance of davix against a HPC specific protocol for a
data analysis use case.Comment: Presented at: Very large Data Bases (VLDB) 2014, Hangzho
PMT: Power Measurement Toolkit
Efficient use of energy is essential for today's supercomputing systems, as
energy cost is generally a major component of their operational cost. Research
into "green computing" is needed to reduce the environmental impact of running
these systems. As such, several scientific communities are evaluating the
trade-off between time-to-solution and energy-to-solution. While the runtime of
an application is typically easy to measure, power consumption is not.
Therefore, we present the Power Measurement Toolkit (PMT), a high-level
software library capable of collecting power consumption measurements on
various hardware. The library provides a standard interface to easily measure
the energy use of devices such as CPUs and GPUs in critical application
sections
Monitoring Cluster on Online Compiler with Ganglia
Ganglia is an open source monitoring system for high performance computing (HPC) that collect both a whole cluster and every nodes status and report to the user. We use Ganglia to monitor our spasi.informatika.lipi.go.id (SPASI), a customized-fedora10-based cluster, for our cluster online compiler, CLAW (cluster access through web). Our experience on using Ganglia shows that Ganglia has a capability to view our cluster status and allow us to track them
Machine Learning with Kay
Computational power is very important when training Deep Learning (DL) models with large amounts of data (Wooldridge, 2021). Hence, High-Performance Computing (HPC) can be leveraged to reduce computational cost, and the Irish Centre for High-End Computing (ICHEC) provides significant infrastructure and services for research and development to both academia and industry. A portion of ICHEC\u27s HPC system has been allocated for institutional access, and this paper presents a case study of how to use Kay (Ireland\u27s national supercomputer) in the remote sensing domain. Specifically, this study uses clusters of Kay Graphics Processing Units (GPUs) for training DL models to extract buildings from satellite imagery using a large number of input data samples
Swarming the SC’17 Student Cluster Competition
The Student Cluster Competition is a suite of challenges
where teams of undergraduates design a computer cluster and then
compete against each other through various benchmark applications.
The present study will provide a select summary of the experiences of
Team Swarm who represented the Georgia Institute of Technology
at the SC’17 Student Cluster Competition. This report will first
describe the competition and the members of Team Swarm. After this
introduction, it focuses on three major aspects of the experience: the
hardware and software architecture of the team’s computer cluster, the
team’s system administration workflow and the team’s usage of cloud
resources. Additionally, the appendix provides a brief description of
the team members and their method of preparation.Undergraduat
Containers for Portable, Productive, and Performant Scientific Computing
Containers are an emerging technology that holds promise for improving productivity and code portability in scientific computing. The authors examine Linux container technology for the distribution of a nontrivial scientific computing software stack and its execution on a spectrum of platforms from laptop computers through high-performance computing systems. For Python code run on large parallel computers, the runtime is reduced inside a container due to faster library imports. The software distribution approach and data that the authors present will help developers and users decide on whether container technology is appropriate for them. The article also provides guidance for vendors of HPC systems that rely on proprietary libraries for performance on what they can do to make containers work seamlessly and without performance penalty
- …