9 research outputs found
SpotServe: Serving Generative Large Language Models on Preemptible Instances
The high computational and memory requirements of generative large language
models (LLMs) make it challenging to serve them cheaply. This paper aims to
reduce the monetary cost for serving LLMs by leveraging preemptible GPU
instances on modern clouds, which offer accesses to spare GPUs at a much
cheaper price than regular instances but may be preempted by the cloud at any
time. Serving LLMs on preemptible instances requires addressing challenges
induced by frequent instance preemptions and the necessity of migrating
instances to handle these preemptions.
This paper presents SpotServe, the first distributed LLM serving system on
preemptible instances. Several key techniques in SpotServe realize fast and
reliable serving of generative LLMs on cheap preemptible instances. First,
SpotServe dynamically adapts the LLM parallelization configuration for dynamic
instance availability and fluctuating workload, while balancing the trade-off
among the overall throughput, inference latency and monetary costs. Second, to
minimize the cost of migrating instances for dynamic reparallelization, the
task of migrating instances is formulated as a bipartite graph matching
problem, which uses the Kuhn-Munkres algorithm to identify an optimal migration
plan that minimizes communications. Finally, to take advantage of the grace
period offered by modern clouds, we introduce stateful inference recovery, a
new inference mechanism that commits inference progress at a much finer
granularity and allows SpotServe to cheaply resume inference upon preemption.
We evaluate on real spot instance preemption traces and various popular LLMs
and show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared
with the best existing LLM serving systems. We also show that SpotServe can
leverage the price advantage of preemptive instances, saving 54% monetary cost
compared with only using on-demand instances.Comment: ASPLOS 202
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
Transformer models have achieved state-of-the-art performance on various
domains of applications and gradually becomes the foundations of the advanced
large deep learning (DL) models. However, how to train these models over
multiple GPUs efficiently is still challenging due to a large number of
parallelism choices. Existing DL systems either rely on manual efforts to make
distributed training plans or apply parallelism combinations within a very
limited search space. In this approach, we propose Galvatron, a new system
framework that incorporates multiple popular parallelism dimensions and
automatically finds the most efficient hybrid parallelism strategy. To better
explore such a rarely huge search space, we 1) involve a decision tree to make
decomposition and pruning based on some reasonable intuitions, and then 2)
design a dynamic programming search algorithm to generate the optimal plan.
Evaluations on four representative Transformer workloads show that Galvatron
could perform automatically distributed training with different GPU memory
budgets. Among all evluated scenarios, Galvatron always achieves superior
system throughput compared to previous work with limited parallelism
A Mini Review of <i>S</i>-Nitrosoglutathione Loaded Nano/Micro-Formulation Strategies
As a potential therapeutic agent, the clinical application of S-nitrosoglutathione (GSNO) is limited because of its instability. Therefore, different formulations have been developed to protect GSNO from degradation, delivery and the release of GSNO at a physiological concentration in the active position. Due to the high water-solubility and small molecular-size of GSNO, the biggest challenges in the encapsulation step are low encapsulation efficiency and burst release. This review summarizes the different nano/micro-formulation strategies of a GSNO related delivery system to provide references for subsequent researchers interested in GSNO encapsulation