1,982 research outputs found
Using Pilot Systems to Execute Many Task Workloads on Supercomputers
High performance computing systems have historically been designed to support
applications comprised of mostly monolithic, single-job workloads. Pilot
systems decouple workload specification, resource selection, and task execution
via job placeholders and late-binding. Pilot systems help to satisfy the
resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot
(RP) is a modular and extensible Python-based pilot system. In this paper we
describe RP's design, architecture and implementation, and characterize its
performance. RP is capable of spawning more than 100 tasks/second and supports
the steady-state execution of up to 16K concurrent tasks. RP can be used
stand-alone, as well as integrated with other application-level tools as a
runtime system
SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores
The ever-growing complexity of reinforcement learning (RL) tasks demands a
distributed RL system to efficiently generate and process a massive amount of
data to train intelligent agents. However, existing open-source libraries
suffer from various limitations, which impede their practical use in
challenging scenarios where large-scale training is necessary. While industrial
systems from OpenAI and DeepMind have achieved successful large-scale RL
training, their system architecture and implementation details remain
undisclosed to the community. In this paper, we present a novel abstraction on
the dataflows of RL training, which unifies practical RL training across
diverse applications into a general framework and enables fine-grained
optimizations. Following this abstraction, we develop a scalable, efficient,
and extensible distributed RL system called ReaLly Scalable RL (SRL). The
system architecture of SRL separates major RL computation components and allows
massively parallelized training. Moreover, SRL offers user-friendly and
extensible interfaces for customized algorithms. Our evaluation shows that SRL
outperforms existing academic libraries in both a single machine and a
medium-sized cluster. In a large-scale cluster, the novel architecture of SRL
leads to up to 3.7x speedup compared to the design choices adopted by the
existing libraries. We also conduct a direct benchmark comparison to OpenAI's
industrial system, Rapid, in the challenging hide-and-seek environment. SRL
reproduces the same solution as reported by OpenAI with up to 5x speedup in
wall-clock time. Furthermore, we also examine the performance of SRL in a much
harder variant of the hide-and-seek environment and achieve substantial
learning speedup by scaling SRL to over 15k CPU cores and 32 A100 GPUs.
Notably, SRL is the first in the academic community to perform RL experiments
at such a large scale.Comment: 15 pages, 12 figures, 6 table
Big Data and Large-scale Data Analytics: Efficiency of Sustainable Scalability and Security of Centralized Clouds and Edge Deployment Architectures
One of the significant shifts of the next-generation computing technologies will certainly be in
the development of Big Data (BD) deployment architectures. Apache Hadoop, the BD
landmark, evolved as a widely deployed BD operating system. Its new features include
federation structure and many associated frameworks, which provide Hadoop 3.x with the
maturity to serve different markets. This dissertation addresses two leading issues involved in
exploiting BD and large-scale data analytics realm using the Hadoop platform. Namely,
(i)Scalability that directly affects the system performance and overall throughput using
portable Docker containers. (ii) Security that spread the adoption of data protection practices
among practitioners using access controls. An Enhanced Mapreduce Environment (EME),
OPportunistic and Elastic Resource Allocation (OPERA) scheduler, BD Federation Access Broker
(BDFAB), and a Secure Intelligent Transportation System (SITS) of multi-tiers architecture for
data streaming to the cloud computing are the main contribution of this thesis study
- …