108 research outputs found
Quark: A High-Performance Secure Container Runtime for Serverless Computing
Secure container runtimes serve as the foundational layer for creating and
running containers, which is the bedrock of emerging computing paradigms like
microservices and serverless computing. Although existing secure container
runtimes indeed enhance security via running containers over a guest kernel and
a Virtual Machine Monitor (VMM or Hypervisor), they incur performance penalties
in critical areas such as networking, container startup, and I/O system calls.
In our practice of operating microservices and serverless computing, we build
a high-performance secure container runtime named Quark. Unlike existing
solutions that rely on traditional VM technologies by importing Linux for the
guest kernel and QEMU for the VMM, we take a different approach to building
Quark from the ground up, paving the way for extreme customization to unlock
high performance. Our development centers on co-designing a custom guest kernel
and a VMM for secure containers. To this end, we build a lightweight guest OS
kernel named QKernel and a specialized VMM named QVisor. The QKernel-QVisor
codesign allows us to deliver three key advancements: high-performance
RDMA-based container networking, fast container startup mode, and efficient
mechanisms for executing I/O syscalls. In our practice with real-world apps
like Redis, Quark cuts down P95 latency by 79.3% and increases throughput by
2.43x compared to Kata. Moreover, Quark container startup achieves 96.5% lower
latency than the cold-start mode while saving 81.3% memory cost to the
keep-warm mode. Quark is open-source with an industry-standard codebase in
Rust.Comment: arXiv admin note: text overlap with arXiv:2305.10621. The paper on
arXiv:2305.10621 presents a detailed version of the TSoR module in Quar
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
Large language models (LLMs) with hundreds of billions or trillions of
parameters, represented by chatGPT, have achieved profound impact on various
fields. However, training LLMs with super-large-scale parameters requires large
high-performance GPU clusters and long training periods lasting for months. Due
to the inevitable hardware and software failures in large-scale clusters,
maintaining uninterrupted and long-duration training is extremely challenging.
As a result, A substantial amount of training time is devoted to task
checkpoint saving and loading, task rescheduling and restart, and task manual
anomaly checks, which greatly harms the overall training efficiency. To address
these issues, we propose TRANSOM, a novel fault-tolerant LLM training system.
In this work, we design three key subsystems: the training pipeline automatic
fault tolerance and recovery mechanism named Transom Operator and Launcher
(TOL), the training task multi-dimensional metric automatic anomaly detection
system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous
access automatic fault tolerance and recovery technology named Transom
Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks,
while TEE is responsible for task monitoring and anomaly reporting. TEE detects
training anomalies and reports them to TOL, who automatically enters the fault
tolerance strategy to eliminate abnormal nodes and restart the training task.
And the asynchronous checkpoint saving and loading functionality provided by
TCE greatly shorten the fault tolerance overhead. The experimental results
indicate that TRANSOM significantly enhances the efficiency of large-scale LLM
training on clusters. Specifically, the pre-training time for GPT3-175B has
been reduced by 28%, while checkpoint saving and loading performance have
improved by a factor of 20.Comment: 14 pages, 9 figure
High Performance Computing using Infiniband-based clusters
L'abstract è presente nell'allegato / the abstract is in the attachmen
Implications and Limitations of Securing an InfiniBand Network
The InfiniBand Architecture is one of the leading network interconnects used in high performance computing, delivering very high bandwidth and low latency. As the popularity of InfiniBand increases, the possibility for new InfiniBand applications arise outside the domain of high performance computing, thereby creating the opportunity for new security risks. In this work, new security questions are considered and addressed. The study demonstrates that many common traffic analyzing tools cannot monitor or capture InfiniBand traffic transmitted between two hosts. Due to the kernel bypass nature of InfiniBand, many host-based network security systems cannot be executed on InfiniBand applications. Those that can impose a significant performance loss for the network. The research concludes that not all network security practices used for Ethernet translate to InfiniBand as previously suggested and that an answer to meeting specific security requirements for an InfiniBand network might reside in hardware offload
Autonomous Database Management at Scale: Automated Tuning, Performance Diagnosis, and Resource Decentralization
Database administration has always been a challenging task, and is becoming even more difficult with the rise of public and private clouds. Today, many enterprises outsource their database operation to cloud service providers (CSPs) in order to reduce operating costs. CSPs, now tasked with managing an extremely large number of database instances, cannot simply rely on database administrators. In fact, humans have become a bottleneck in the scalability and profitability of cloud offerings. This has created a massive demand for building autonomous databases—systems that operate with little or zero human supervision.
While autonomous databases have gained much attention in recent years in both academia and industry, many of the existing techniques remain limited to automating parameter tuning, backup/recovery, and monitoring. Consequently, there is much to be done before realizing a fully autonomous database. This dissertation examines and offers new automation techniques for three specific areas of modern database management.
1. Automated Tuning – We propose a new generation of physical database designers that are robust against uncertainty in future workloads. Given the rising popularity of approximate databases, we also develop an optimal, hybrid sampling strategy that enables efficient join processing on offline samples, a long-standing open problem in approximate query processing.
2. Performance Diagnosis – We design practical tools and algorithms for assisting database administrators in quickly and reliably diagnosing performance problems in their transactional databases.
3. Resource Decentralization – To achieve autonomy among database components in a shared environment, we propose a highly efficient, starvation-free, and fully decentralized distributed lock manager for distributed database clusters.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153349/1/dyoon_1.pd
- …