Search CORE

413 research outputs found

Training and Serving System of Foundation Models: A Comprehensive Survey

Author: Chen Wuhui
Chen Yanyu
Hong Zicong
Wang Hui
Yu Yue
Zhang Chuanfu
Zhang Tao
Zheng Zibin
Zhou Jiahang
Publication venue
Publication date: 05/01/2024
Field of study

Foundation models (e.g., ChatGPT, DALL-E, PengCheng Mind, PanGu-

\Sigma

) have demonstrated extraordinary performance in key technological areas, such as natural language processing and visual recognition, and have become the mainstream trend of artificial general intelligence. This has led more and more major technology giants to dedicate significant human and financial resources to actively develop their foundation model systems, which drives continuous growth of these models' parameters. As a result, the training and serving of these models have posed significant challenges, including substantial computing power, memory consumption, bandwidth demands, etc. Therefore, employing efficient training and serving strategies becomes particularly crucial. Many researchers have actively explored and proposed effective methods. So, a comprehensive survey of them is essential for system developers and researchers. This paper extensively explores the methods employed in training and serving foundation models from various perspectives. It provides a detailed categorization of these state-of-the-art methods, including finer aspects such as network, computing, and storage. Additionally, the paper summarizes the challenges and presents a perspective on the future development direction of foundation model systems. Through comprehensive discussion and analysis, it hopes to provide a solid theoretical basis and practical guidance for future research and applications, promoting continuous innovation and development in foundation model systems

arXiv.org e-Print Archive

Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

Author: A Arnold
A Faradjian
B Hess
C Schütte
G Wilson
JA Anderson
JC Phillips
KJ Bowers
KJ Bowers
L Verlet
M Eleftheriou
M Shirts
MJ Abraham
P Eastman
R Yokota
S Pronk
S Páll
U Essmann
W Humphrey
WM Brown
Y Andoh
Y Sugita
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

arXiv.org e-Print Archive

Publikationer från KTH

Crossref

Digitala Vetenskapliga Arkivet - Academic Archive On-line

MPG.PuRe

ACCL+: an FPGA-Based Collective Engine for Distributed Applications

Author: Alonso Gustavo
Blott Michaela
He Zhenhao
Korolija Dario
Laan Tristan
Petrica Lucian
Ramhorst Benjamin
Zhu Yu
Publication venue
Publication date: 18/12/2023
Field of study

FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference

arXiv.org e-Print Archive

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

Author: Clemons Jason
Gimelshein Natalia
Keckler Stephen W.
Rhu Minsoo
Zulfiqar Arslan
Publication venue
Publication date: 28/07/2016
Field of study

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 201

arXiv.org e-Print Archive

Crossref

포항공과대학교

Lightweight asynchronous scheduling in heterogeneous reconfigurable systems

Author: Asenjo R.
Gran Tejero R.
Navarro A.
Nikov K.
Nunez-Yanez J.
Rodríguez A.
Suárez Gracia D.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

The trend for heterogeneous embedded systems is the integration of accelerators and general-purpose CPU cores on the same die. In these integrated architectures, like the Zynq UltraScale+ board (CPU+FPGA) that we target in this work, hardware support for shared memory and low-overhead synchronization between the accelerator and the CPU cores make the case for exploring strategies that exploit a tight collaboration between the CPUs and the accelerator. In this paper we propose a novel lightweight scheduling strategy, FastFit, targeted to FPGA accelerators, and a new scheduler based on it, named MultiFastFit, which asynchronously tackles heterogeneous systems comprised of a variety of CPU cores and FPGA IPs. Our strategy significantly reduces the overhead to automatically compute the near-optimal chunksizes when compared to a previous state-of-the-art auto-tuned approach, which makes our approach more suitable for fine-grained applications. Additionally, our scheduler MultiFastFit has been designed to enable the efficient co-execution of work among compute devices in such a way that all the devices are busy while minimizing the load unbalance. Our approaches have been evaluated using four benchmarks carefully tuned for the low-power UltraScale+ platform. Our experiments demonstrate that the FastFit strategy always finds the near-optimal FPGA chunksize for any device configuration at a reasonable cost, even for fine-grained and irregular applications, and that heterogeneous CPU+FPGA co-executions that exploit all the compute devices are usually faster and more energy efficient than the CPU-only and FPGA-only executions. We have also compared MultiFastFit with other state-of-the-art scheduling strategies, finding that it outperforms other auto-tuned approach up to 2x and it achieves similar results to manually-tuned schedulers without requiring an offline search of the ideal CPU-FPGA partition or FPGA chunk granularity. © 2022 The Author

Repositorio Universidad de Zaragoza

Explore Bristol Research

Recommended from our members

Improving efficiency for GPUs with decoupled delegate

Author: Wang Kai (Ph. D. in computational science, engineering, and mathematics)
Publication venue
Publication date: 21/06/2024
Field of study

GPUs are increasingly used for general-purpose computation. For many applications, GPUs achieve significant performance advantages over CPUs, largely due to GPUs’ ability to exploit massive parallelism. However, massive parallelism can cause inefficiency for many operations, such as concurrent data structures. Symptoms include synchronization bottleneck and redundant overheads. To make such operations efficient, our key strategy is to decouple them from the rest of GPU program. Then, we use a few threads acting as a delegate to perform the decoupled operations on behalf of all other threads. This reduces the synchronization for decoupled operations because fewer threads are used. In addition, the delegate amortizes overheads for other threads, similar to the way in which vector execution amortizes instruction overheads. The cost of our approach is the need for communication between the delegate and other threads. We develop innovative ways to reduce the communication so that the benefit strongly outweighs the cost. Based on the strategy, we propose three solutions for both regular and irregular workloads. For regular GPU workloads, our solution reduces repetitive ALU OPs and instruction execution, while reducing memory latency with non-speculative prefetching. This approach enabled our solution to achieve 40.7% speedup and 20.2% energy reduction on average for 29 benchmarks. For lock-based workloads, our solution avoids destructive lock contention in global memory and thus achieves an average speedup of 3.6x, implemented entirely in software. Finally, we introduce a new GPU single source shortest path (SSSP) algorithm with a complex worklist; it offers many benefits compared to a simpler worklist but incurs significant overheads. However, our decoupled delegate approach reduces the overheads and makes the complex worklist design efficient for GPUs. Hence, our solution, implemented in software, achieved an average speedup of 2.8x over 226 graphs compared with state-of-the-art approaches.Computational Science, Engineering, and Mathematic

Texas ScholarWorks

Programming Models for Heterogeneous Systems with Application to Computer Graphics

Author: Potter Ralph
Publication venue
Publication date: 01/06/2017
Field of study

OPUS