74 research outputs found
Centralized and Distributed Sparsification for Low-Complexity Message Passing Algorithm in C-RAN Architectures
Cloud radio access network (C-RAN) is a promising technology for
fifth-generation (5G) cellular systems. However the burden imposed by the huge
amount of data to be collected (in the uplink) from the radio remote heads
(RRHs) and processed at the base band unit (BBU) poses serious challenges. In
order to reduce the computation effort of minimum mean square error (MMSE)
receiver at the BBU the Gaussian message passing (MP) together with a suitable
sparsification of the channel matrix can be used. In this paper we propose two
sets of solutions, either centralized or distributed ones. In the centralized
solutions, we propose different approaches to sparsify the channel matrix, in
order to reduce the complexity of MP. However these approaches still require
that all signals reaching the RRH are conveyed to the BBU, therefore the
communication requirements among the backbone network devices are unaltered. In
the decentralized solutions instead we aim at reducing both the complexity of
MP at the BBU and the requirements on the RRHs-BBU communication links by
pre-processing the signals at the RRH and convey a reduced set of signals to
the BBU.Comment: Accepted for pubblication in IEEE VTC 201
Message passing algorithms for cloud radio access network of cellular systems
Cloud Radio Acces Network è un'architettura per reti cellulari che risulta promottente per lo standard 5G. Il problema principale della decodifica tramite algoritmi a passaggio di messaggio risiede nella complessità computazionale. Proponiamo quindi due diversi approcci per la risoluzione del problema: uno centralizzato per la sparsificazione del canale e uno distribuito in cui la riduzione della complessità è implementata tramite pre-codifica alle celle
Linear state estimation via 5G C-RAN cellular networks using Gaussian belief propagation
Machine-type communications and large-scale information processing architectures are among key (r)evolutionary enhancements of emerging fifth-generation (5G) mobile cellular networks. Massive data acquisition and processing will make 5G network an ideal platform for large-scale system monitoring and control with applications in future smart transportation, connected industry, power grids, etc. In this work, we investigate a capability of such a 5G network architecture to provide the state estimate of an underlying linear system from the input obtained via large-scale deployment of measurement devices. Assuming that the measurements are communicated via densely deployed cloud radio access network (C-RAN), we formulate and solve the problem of estimating the system state from the set of signals collected at C-RAN base stations. Our solution, based on the Gaussian Belief-Propagation (GBP) framework, allows for large-scale and distributed deployment within the emerging 5G information processing architectures. The presented numerical study demonstrates the accuracy, convergence behavior and scalability of the proposed GBP-based solution to the large-scale state estimation problem
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs
The rapid growth of memory and computation requirements of large language
models (LLMs) has outpaced the development of hardware, hindering people who
lack large-scale high-end GPUs from training or deploying LLMs. However,
consumer-level GPUs, which constitute a larger market share, are typically
overlooked in LLM due to their weaker computing performance, smaller storage
capacity, and lower communication bandwidth. Additionally, users may have
privacy concerns when interacting with remote LLMs. In this paper, we envision
a decentralized system unlocking the potential vast untapped consumer-level
GPUs in pre-training, inference and fine-tuning of LLMs with privacy
protection. However, this system faces critical challenges, including limited
CPU and GPU memory, low network bandwidth, the variability of peer and device
heterogeneity. To address these challenges, our system design incorporates: 1)
a broker with backup pool to implement dynamic join and quit of computing
providers; 2) task scheduling with hardware performance to improve system
efficiency; 3) abstracting ML procedures into directed acyclic graphs (DAGs) to
achieve model and task universality; 4) abstracting intermediate represention
and execution planes to ensure compatibility of various devices and deep
learning (DL) frameworks. Our performance analysis demonstrates that 50 RTX
3080 GPUs can achieve throughputs comparable to those of 4 H100 GPUs, which are
significantly more expensive
Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
Distributed deep learning becomes very common to reduce the overall training
time by exploiting multiple computing devices (e.g., GPUs/TPUs) as the size of
deep models and data sets increases. However, data communication between
computing devices could be a potential bottleneck to limit the system
scalability. How to address the communication problem in distributed deep
learning is becoming a hot research topic recently. In this paper, we provide a
comprehensive survey of the communication-efficient distributed training
algorithms in both system-level and algorithmic-level optimizations. In the
system-level, we demystify the system design and implementation to reduce the
communication cost. In algorithmic-level, we compare different algorithms with
theoretical convergence bounds and communication complexity. Specifically, we
first propose the taxonomy of data-parallel distributed training algorithms,
which contains four main dimensions: communication synchronization, system
architectures, compression techniques, and parallelism of communication and
computing. Then we discuss the studies in addressing the problems of the four
dimensions to compare the communication cost. We further compare the
convergence rates of different algorithms, which enable us to know how fast the
algorithms can converge to the solution in terms of iterations. According to
the system-level communication cost analysis and theoretical convergence speed
comparison, we provide the readers to understand what algorithms are more
efficient under specific distributed environments and extrapolate potential
directions for further optimizations
Distributed Symmetry Breaking on Power Graphs via Sparsification
In this paper, we present efficient distributed algorithms for classical
symmetry breaking problems, maximal independent sets (MIS) and ruling sets, in
power graphs. We work in the standard CONGEST model of distributed message
passing, where the communication network is abstracted as a graph .
Typically, the problem instance in CONGEST is identical to the communication
network , that is, we perform the symmetry breaking in . In this work, we
consider a setting where the problem instance corresponds to a power graph
, where each node of the communication network is connected to all of
its -hop neighbors. Our main contribution is a deterministic polylogarithmic
time algorithm for computing -ruling sets of , which (for )
improves exponentially on the current state-of-the-art runtimes. The main
technical ingredient for this result is a deterministic sparsification
procedure which may be of independent interest. On top of being a natural
family of problems, ruling sets (in power graphs) are well-motivated through
their applications in the powerful shattering framework [BEPS JACM'16, Ghaffari
SODA'19] (and others). We present randomized algorithms for computing maximal
independent sets and ruling sets of in essentially the same time as they
can be computed in . We also revisit the shattering algorithm for MIS [BEPS
JACM'16] and present different approaches for the post-shattering phase. Our
solutions are algorithmically and analytically simpler (also in the LOCAL
model) than existing solutions and obtain the same runtime as [Ghaffari
SODA'16]
- …