17 research outputs found
Sample-based and Feature-based Federated Learning for Unconstrained and Constrained Nonconvex Optimization via Mini-batch SSCA
Federated learning (FL) has become a hot research area in enabling the
collaborative training of machine learning models among multiple clients that
hold sensitive local data. Nevertheless, unconstrained federated optimization
has been studied mainly using stochastic gradient descent (SGD), which may
converge slowly, and constrained federated optimization, which is more
challenging, has not been investigated so far. This paper investigates
sample-based and feature-based federated optimization, respectively, and
considers both unconstrained and constrained nonconvex problems for each of
them. First, we propose FL algorithms using stochastic successive convex
approximation (SSCA) and mini-batch techniques. These algorithms can adequately
exploit the structures of the objective and constraint functions and
incrementally utilize samples. We show that the proposed FL algorithms converge
to stationary points and Karush-Kuhn-Tucker (KKT) points of the respective
unconstrained and constrained nonconvex problems, respectively. Next, we
provide algorithm examples with appealing computational complexity and
communication load per communication round. We show that the proposed algorithm
examples for unconstrained federated optimization are identical to FL
algorithms via momentum SGD and provide an analytical connection between SSCA
and momentum SGD. Finally, numerical experiments demonstrate the inherent
advantages of the proposed algorithms in convergence speeds, communication and
computation costs, and model specifications.Comment: 18 pages, 4 figures. This work is to appear in IEEE Trans. Signal
Process. arXiv admin note: substantial text overlap with arXiv:2103.0950
EMBA: Efficient memory bandwidth allocation to improve performance on intel commodity processor
On multi-core processors, contention on shared resources such as the last level cache (LLC) and memory bandwidth may cause serious performance degradation, which makes efficient resource allocation a critical issue in data centers. Intel recently introduces Memory Bandwidth Allocation (MBA) technology on its Xeon scalable processors, which makes it possible to allocate memory bandwidth in a real system. However, how to make the most of MBA to improve system performance remains an open question. In this work, (1) we formulate a quantitative relationship between a program\u27s performance and its LLC occupancy and memory request rate on commodity processors. (2) Guided by the performance formula, we propose a heuristic bound-aware throttling algorithm to improve system performance and (3) we further develop a hierarchical clustering method to improve the algorithm\u27s efficiency. (4) We implement these algorithms in EMBA, a low-overhead dynamic memory bandwidth scheduling system to improve performance on Intel commodity processors. The results show that, when multiple programs run simultaneously on a multi-core processor whose memory bandwidth is saturated, the programs with high memory bandwidth demand usually use bandwidth inefficiently compared with programs with medium memory bandwidth demand from the perspective of CPU performance. By slightly throttling the former\u27s bandwidth, we can significantly improve the performance of the latter. On average, we improve system performance by 36.9% at the expense of 8.6% bandwidth utilization rate
Rate Splitting for Multi-Antenna Downlink: Precoder Design and Practical Implementation
Accepted by the IEEE JSAC SI-MAT5G+ IEEE JSAC special issue on Multiple Antenna Technologies for Beyond 5GInternational audienceRate splitting (RS) is a potentially powerful and flexible technique for multi-antenna downlink transmission. In this paper, we address several technical challenges towards its practical implementation for beyond 5G systems. To this end, we focus on a single-cell system with a multi-antenna base station (BS) and K single-antenna receivers. We consider RS in its most general form, and joint decoding to fully exploit the potential of RS. First, we investigate the achievable rates under joint decoding and formulate the precoder design problems to maximize a general utility function, or to minimize the transmit power under pre-defined rate targets. Building upon the concave-convex procedure (CCCP), we propose precoder design algorithms for an arbitrary number of users. Our proposed algorithms approximate the intractable non-convex problems with a number of successively refined convex problems, and provably converge to stationary points of the original problems. Then, to reduce the decoding complexity, we consider the optimization of the precoder and the decoding order under successive decoding. Further, we propose a stream selection algorithm to reduce the number of precoded signals. With a reduced number of streams and successive decoding at the receivers, our proposed algorithm can even be implemented when the number of users is relatively large, whereas the complexity was previously considered as prohibitively high in the same setting. Finally, we propose a simple adaptation of our algorithms to account for the imperfection of the channel state information at the transmitter. Numerical results demonstrate that the general RS scheme provides a substantial performance gain as compared to state-of-the-art linear precoding schemes, especially with a moderately large number of users