413 research outputs found
Training and Serving System of Foundation Models: A Comprehensive Survey
Foundation models (e.g., ChatGPT, DALL-E, PengCheng Mind, PanGu-)
have demonstrated extraordinary performance in key technological areas, such as
natural language processing and visual recognition, and have become the
mainstream trend of artificial general intelligence. This has led more and more
major technology giants to dedicate significant human and financial resources
to actively develop their foundation model systems, which drives continuous
growth of these models' parameters. As a result, the training and serving of
these models have posed significant challenges, including substantial computing
power, memory consumption, bandwidth demands, etc. Therefore, employing
efficient training and serving strategies becomes particularly crucial. Many
researchers have actively explored and proposed effective methods. So, a
comprehensive survey of them is essential for system developers and
researchers. This paper extensively explores the methods employed in training
and serving foundation models from various perspectives. It provides a detailed
categorization of these state-of-the-art methods, including finer aspects such
as network, computing, and storage. Additionally, the paper summarizes the
challenges and presents a perspective on the future development direction of
foundation model systems. Through comprehensive discussion and analysis, it
hopes to provide a solid theoretical basis and practical guidance for future
research and applications, promoting continuous innovation and development in
foundation model systems
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
ACCL+: an FPGA-Based Collective Engine for Distributed Applications
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs
or network-attached accelerators. Despite their potential, developing
distributed FPGA-accelerated applications remains cumbersome due to the lack of
appropriate infrastructure and communication abstractions. To facilitate the
development of distributed applications with FPGAs, in this paper we propose
ACCL+, an open-source versatile FPGA-based collective communication library.
Portable across different platforms and supporting UDP, TCP, as well as RDMA,
ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective
communication. Additionally, it can serve as a collective offload engine for
CPU applications, freeing the CPU from networking tasks. It is user-extensible,
allowing new collectives to be implemented and deployed without having to
re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100
Gb/s networking, comparing its performance against software MPI over RDMA. The
results demonstrate ACCL+'s significant advantages for FPGA-based distributed
applications and highly competitive performance for CPU applications. We
showcase ACCL+'s dual role with two use cases: seamlessly integrating as a
collective offload engine to distribute CPU-based vector-matrix multiplication,
and serving as a crucial and efficient component in designing fully FPGA-based
distributed deep-learning recommendation inference
vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
The most widely used machine learning frameworks require users to carefully
tune their memory usage so that the deep neural network (DNN) fits into the
DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to
study different machine learning algorithms, forcing them to either use a less
desirable network architecture or parallelize the processing across multiple
GPUs. We propose a runtime memory manager that virtualizes the memory usage of
DNNs such that both GPU and CPU memory can simultaneously be utilized for
training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory
usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a
significant reduction in memory requirements of DNNs. Similar experiments on
VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the
memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256
(requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card
containing 12 GB of memory, with 18% performance loss compared to a
hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International
Symposium on Microarchitecture (MICRO-49), 201
Lightweight asynchronous scheduling in heterogeneous reconfigurable systems
The trend for heterogeneous embedded systems is the integration of accelerators and general-purpose CPU cores on the same die. In these integrated architectures, like the Zynq UltraScale+ board (CPU+FPGA) that we target in this work, hardware support for shared memory and low-overhead synchronization between the accelerator and the CPU cores make the case for exploring strategies that exploit a tight collaboration between the CPUs and the accelerator. In this paper we propose a novel lightweight scheduling strategy, FastFit, targeted to FPGA accelerators, and a new scheduler based on it, named MultiFastFit, which asynchronously tackles heterogeneous systems comprised of a variety of CPU cores and FPGA IPs. Our strategy significantly reduces the overhead to automatically compute the near-optimal chunksizes when compared to a previous state-of-the-art auto-tuned approach, which makes our approach more suitable for fine-grained applications. Additionally, our scheduler MultiFastFit has been designed to enable the efficient co-execution of work among compute devices in such a way that all the devices are busy while minimizing the load unbalance. Our approaches have been evaluated using four benchmarks carefully tuned for the low-power UltraScale+ platform. Our experiments demonstrate that the FastFit strategy always finds the near-optimal FPGA chunksize for any device configuration at a reasonable cost, even for fine-grained and irregular applications, and that heterogeneous CPU+FPGA co-executions that exploit all the compute devices are usually faster and more energy efficient than the CPU-only and FPGA-only executions. We have also compared MultiFastFit with other state-of-the-art scheduling strategies, finding that it outperforms other auto-tuned approach up to 2x and it achieves similar results to manually-tuned schedulers without requiring an offline search of the ideal CPU-FPGA partition or FPGA chunk granularity. © 2022 The Author
Recommended from our members
Improving efficiency for GPUs with decoupled delegate
GPUs are increasingly used for general-purpose computation. For many applications, GPUs achieve significant performance advantages over CPUs, largely due to GPUs’ ability to exploit massive parallelism. However, massive parallelism can cause inefficiency for many operations, such as concurrent data structures. Symptoms include synchronization bottleneck and redundant overheads. To make such operations efficient, our key strategy is to decouple them from the rest of GPU program. Then, we use a few threads acting as a delegate to perform the decoupled operations on behalf of all other threads. This reduces the synchronization for decoupled operations because fewer threads are used. In addition, the delegate amortizes overheads for other threads, similar to the way in which vector execution amortizes instruction overheads. The cost of our approach is the need for communication between the delegate and other threads. We develop innovative ways to reduce the communication so that the benefit strongly outweighs the cost. Based on the strategy, we propose three solutions for both regular and irregular workloads. For regular GPU workloads, our solution reduces repetitive ALU OPs and instruction execution, while reducing memory latency with non-speculative prefetching. This approach enabled our solution to achieve 40.7% speedup and 20.2% energy reduction on average for 29 benchmarks. For lock-based workloads, our solution avoids destructive lock contention in global memory and thus achieves an average speedup of 3.6x, implemented entirely in software. Finally, we introduce a new GPU single source shortest path (SSSP) algorithm with a complex worklist; it offers many benefits compared to a simpler worklist but incurs significant overheads. However, our decoupled delegate approach reduces the overheads and makes the complex worklist design efficient for GPUs. Hence, our solution, implemented in software, achieved an average speedup of 2.8x over 226 graphs compared with state-of-the-art approaches.Computational Science, Engineering, and Mathematic
- …