53 research outputs found
High-Throughput Computing on High-Performance Platforms: A Case Study
The computing systems used by LHC experiments has historically consisted of
the federation of hundreds to thousands of distributed resources, ranging from
small to mid-size resource. In spite of the impressive scale of the existing
distributed computing solutions, the federation of small to mid-size resources
will be insufficient to meet projected future demands. This paper is a case
study of how the ATLAS experiment has embraced Titan---a DOE leadership
facility in conjunction with traditional distributed high- throughput computing
to reach sustained production scales of approximately 52M core-hours a years.
The three main contributions of this paper are: (i) a critical evaluation of
design and operational considerations to support the sustained, scalable and
production usage of Titan; (ii) a preliminary characterization of a next
generation executor for PanDA to support new workloads and advanced execution
modes; and (iii) early lessons for how current and future experimental and
observational systems can be integrated with production supercomputers and
other platforms in a general and extensible manner
Using Pilot Systems to Execute Many Task Workloads on Supercomputers
High performance computing systems have historically been designed to support
applications comprised of mostly monolithic, single-job workloads. Pilot
systems decouple workload specification, resource selection, and task execution
via job placeholders and late-binding. Pilot systems help to satisfy the
resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot
(RP) is a modular and extensible Python-based pilot system. In this paper we
describe RP's design, architecture and implementation, and characterize its
performance. RP is capable of spawning more than 100 tasks/second and supports
the steady-state execution of up to 16K concurrent tasks. RP can be used
stand-alone, as well as integrated with other application-level tools as a
runtime system
A Generic Development and Deployment Framework for Cloud Computing and Distributed Applications
Cloud computing have paved the way for advance of IT-based demand services. This technology helps decrease operation costs, solve scalability issue and many more user and provider constraints. However, development and deployment of distributed applications on cloud environment becomes a more and more complex tasks. Cloud users must spend a lot of time to prepare, install and configure their applications on clouds. In addition, after development and deployment, the applications almost cannot move from a cloud to others due to the lack of interoperability between them. To address these problems, we present in this paper a novel development and deployment framework for cloud distributed applications/services. Our approach is based on abstraction and object-oriented programming technique, allowing users to easily and rapidly develop and deploy their services into cloud environment. The approach also enables service migration and interoperability among the clouds
ArrayBridge: Interweaving declarative array processing with high-performance computing
Scientists are increasingly turning to datacenter-scale computers to produce
and analyze massive arrays. Despite decades of database research that extols
the virtues of declarative query processing, scientists still write, debug and
parallelize imperative HPC kernels even for the most mundane queries. This
impedance mismatch has been partly attributed to the cumbersome data loading
process; in response, the database community has proposed in situ mechanisms to
access data in scientific file formats. Scientists, however, desire more than a
passive access method that reads arrays from files.
This paper describes ArrayBridge, a bi-directional array view mechanism for
scientific file formats, that aims to make declarative array manipulations
interoperable with imperative file-centric analyses. Our prototype
implementation of ArrayBridge uses HDF5 as the underlying array storage library
and seamlessly integrates into the SciDB open-source array database system. In
addition to fast querying over external array objects, ArrayBridge produces
arrays in the HDF5 file format just as easily as it can read from it.
ArrayBridge also supports time travel queries from imperative kernels through
the unmodified HDF5 API, and automatically deduplicates between array versions
for space efficiency. Our extensive performance evaluation in NERSC, a
large-scale scientific computing facility, shows that ArrayBridge exhibits
statistically indistinguishable performance and I/O scalability to the native
SciDB storage engine.Comment: 12 pages, 13 figure
The Technologies Required for Fusing HPC and Real-Time Data to Support Urgent Computing
The use of High Performance Computing (HPC) to compliment urgent decision
making in the event of disasters is an important future potential use of
supercomputers. However, the usage modes involved are rather different from how
HPC has been used traditionally. As such, there are many obstacles that need to
be overcome, not least the unbounded wait times in the batch system queues, to
make the use of HPC in disaster response practical. In this paper, we present
how the VESTEC project plans to overcome these issues and develop a working
prototype of an urgent computing control system. We describe the requirements
for such a system and analyse the different technologies available that can be
leveraged to successfully build such a system. We finally explore the design of
the VESTEC system and discuss ongoing challenges that need to be addressed to
realise a production level system.Comment: Preprint of paper in 2019 IEEE/ACM HPC for Urgent Decision Making
(UrgentHPC
- …