28,743 research outputs found
Scalable Projection-Free Optimization
As a projection-free algorithm, Frank-Wolfe (FW) method, also known as conditional gradient, has recently received considerable attention in the machine learning community. In this dissertation, we study several topics on the FW variants for scalable projection-free optimization. We first propose 1-SFW, the first projection-free method that requires only one sample per iteration to update the optimization variable and yet achieves the best known complexity bounds for convex, non-convex, and monotone DR-submodular settings. Then we move forward to the distributed setting, and develop Quantized Frank-Wolfe (QFW), ageneral communication-efficient distributed FW framework for both convex and non-convex objective functions. We study the performance of QFW in two widely recognized settings: 1) stochastic optimization and 2) finite-sum optimization. Finally, we propose Black-Box Continuous Greedy, a derivative-free and projection-free algorithm, that maximizes a monotone continuous DR-submodular function over a bounded convex body in Euclidean space
Efficient-Adam: Communication-Efficient Distributed Adam
Distributed adaptive stochastic gradient methods have been widely used for
large-scale nonconvex optimization, such as training deep learning models.
However, their communication complexity on finding -stationary
points has rarely been analyzed in the nonconvex setting. In this work, we
present a novel communication-efficient distributed Adam in the
parameter-server model for stochastic nonconvex optimization, dubbed {\em
Efficient-Adam}. Specifically, we incorporate a two-way quantization scheme
into Efficient-Adam to reduce the communication cost between the workers and
server. Simultaneously, we adopt a two-way error feedback strategy to reduce
the biases caused by the two-way quantization on both the server and workers,
respectively. In addition, we establish the iteration complexity for the
proposed Efficient-Adam with a class of quantization operators, and further
characterize its communication complexity between the server and workers when
an -stationary point is achieved. Finally, we apply Efficient-Adam
to solve a toy stochastic convex optimization problem and train deep learning
models on real-world vision and language tasks. Extensive experiments together
with a theoretical guarantee justify the merits of Efficient Adam.Comment: IEEE Transactions on Signal Processin
Bregman Proximal Method for Efficient Communications under Similarity
We propose a novel distributed method for monotone variational inequalities
and convex-concave saddle point problems arising in various machine learning
applications such as game theory and adversarial training. By exploiting
\textit{similarity} our algorithm overcomes communication bottleneck which is a
major issue in distributed optimization. The proposed algorithm enjoys optimal
communication complexity of , where measures the
non-optimality gap function, and is a parameter of similarity. All the
existing distributed algorithms achieving this bound essentially utilize the
Euclidean setup.
In contrast to them, our algorithm is built upon Bregman proximal maps and it
is compatible with an arbitrary Bregman divergence. Thanks to this, it has more
flexibility to fit the problem geometry than algorithms with the Euclidean
setup. Thereby the proposed method bridges the gap between the Euclidean and
non-Euclidean setting.
By using the restart technique, we extend our algorithm to variational
inequalities with -strongly monotone operator, resulting in optimal
communication complexity of (up to a logarithmic factor). Our
theoretical results are confirmed by numerical experiments on a stochastic
matrix game.Comment: 14 page
Fast Composite Optimization and Statistical Recovery in Federated Learning
As a prevalent distributed learning paradigm, Federated Learning (FL) trains
a global model on a massive amount of devices with infrequent communication.
This paper investigates a class of composite optimization and statistical
recovery problems in the FL setting, whose loss function consists of a
data-dependent smooth loss and a non-smooth regularizer. Examples include
sparse linear regression using Lasso, low-rank matrix recovery using nuclear
norm regularization, etc. In the existing literature, federated composite
optimization algorithms are designed only from an optimization perspective
without any statistical guarantees. In addition, they do not consider commonly
used (restricted) strong convexity in statistical recovery problems. We advance
the frontiers of this problem from both optimization and statistical
perspectives. From optimization upfront, we propose a new algorithm named
\textit{Fast Federated Dual Averaging} for strongly convex and smooth loss and
establish state-of-the-art iteration and communication complexity in the
composite setting. In particular, we prove that it enjoys a fast rate, linear
speedup, and reduced communication rounds. From statistical upfront, for
restricted strongly convex and smooth loss, we design another algorithm, namely
\textit{Multi-stage Federated Dual Averaging}, and prove a high probability
complexity bound with linear speedup up to optimal statistical precision.
Experiments in both synthetic and real data demonstrate that our methods
perform better than other baselines. To the best of our knowledge, this is the
first work providing fast optimization algorithms and statistical recovery
guarantees for composite problems in FL.Comment: This is a revised version to fix the imprecise statements about
linear speedup from the ICML proceedings. We use another averaging scheme for
the returned solutions in Theorem 2.1 and 3.1 to guarantee linear speedup
when the number of iterations is larg
Recommended from our members
Towards More Scalable and Robust Machine Learning
For many data-intensive real-world applications, such as recognizing objects from images, detecting spam emails, and recommending items on retail websites, the most successful current approaches involve learning rich prediction rules from large datasets. There are many challenges in these machine learning tasks. For example, as the size of the datasets and the complexity of these prediction rules increase, there is a significant challenge in designing scalable methods that can effectively exploit the availability of distributed computing units. As another example, in many machine learning applications, there can be data corruptions, communication errors, and even adversarial attacks during training and test. Therefore, to build reliable machine learning models, we also have to tackle the challenge of robustness in machine learning.In this dissertation, we study several topics on the scalability and robustness in large-scale learning, with a focus of establishing solid theoretical foundations for these problems, and demonstrate recent progress towards the ambitious goal of building more scalable and robust machine learning models. We start with the speedup saturation problem in distributed stochastic gradient descent (SGD) algorithms with large mini-batches. We introduce the notion of gradient diversity, a metric of the dissimilarity between concurrent gradient updates, and show its key role in the convergence and generalization performance of mini-batch SGD. We then move forward to Byzantine distributed learning, a topic that involves both scalability and robustness in distributed learning. In the Byzantine setting that we consider, a fraction of distributed worker machines can have arbitrary or even adversarial behavior. We design statistically and computationally efficient algorithms to defend against Byzantine failures in distributed optimization with convex and non-convex objectives. Lastly, we discuss the adversarial example phenomenon. We provide theoretical analysis of the adversarially robust generalization properties of machine learning models through the lens of Radamacher complexity
- …