28,743 research outputs found

    Scalable Projection-Free Optimization

    Get PDF
    As a projection-free algorithm, Frank-Wolfe (FW) method, also known as conditional gradient, has recently received considerable attention in the machine learning community. In this dissertation, we study several topics on the FW variants for scalable projection-free optimization. We first propose 1-SFW, the first projection-free method that requires only one sample per iteration to update the optimization variable and yet achieves the best known complexity bounds for convex, non-convex, and monotone DR-submodular settings. Then we move forward to the distributed setting, and develop Quantized Frank-Wolfe (QFW), ageneral communication-efficient distributed FW framework for both convex and non-convex objective functions. We study the performance of QFW in two widely recognized settings: 1) stochastic optimization and 2) finite-sum optimization. Finally, we propose Black-Box Continuous Greedy, a derivative-free and projection-free algorithm, that maximizes a monotone continuous DR-submodular function over a bounded convex body in Euclidean space

    Efficient-Adam: Communication-Efficient Distributed Adam

    Full text link
    Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models. However, their communication complexity on finding ε\varepsilon-stationary points has rarely been analyzed in the nonconvex setting. In this work, we present a novel communication-efficient distributed Adam in the parameter-server model for stochastic nonconvex optimization, dubbed {\em Efficient-Adam}. Specifically, we incorporate a two-way quantization scheme into Efficient-Adam to reduce the communication cost between the workers and server. Simultaneously, we adopt a two-way error feedback strategy to reduce the biases caused by the two-way quantization on both the server and workers, respectively. In addition, we establish the iteration complexity for the proposed Efficient-Adam with a class of quantization operators, and further characterize its communication complexity between the server and workers when an ε\varepsilon-stationary point is achieved. Finally, we apply Efficient-Adam to solve a toy stochastic convex optimization problem and train deep learning models on real-world vision and language tasks. Extensive experiments together with a theoretical guarantee justify the merits of Efficient Adam.Comment: IEEE Transactions on Signal Processin

    Bregman Proximal Method for Efficient Communications under Similarity

    Full text link
    We propose a novel distributed method for monotone variational inequalities and convex-concave saddle point problems arising in various machine learning applications such as game theory and adversarial training. By exploiting \textit{similarity} our algorithm overcomes communication bottleneck which is a major issue in distributed optimization. The proposed algorithm enjoys optimal communication complexity of δ/ϵ\delta/\epsilon, where ϵ\epsilon measures the non-optimality gap function, and δ\delta is a parameter of similarity. All the existing distributed algorithms achieving this bound essentially utilize the Euclidean setup. In contrast to them, our algorithm is built upon Bregman proximal maps and it is compatible with an arbitrary Bregman divergence. Thanks to this, it has more flexibility to fit the problem geometry than algorithms with the Euclidean setup. Thereby the proposed method bridges the gap between the Euclidean and non-Euclidean setting. By using the restart technique, we extend our algorithm to variational inequalities with μ\mu-strongly monotone operator, resulting in optimal communication complexity of δ/μ\delta/\mu (up to a logarithmic factor). Our theoretical results are confirmed by numerical experiments on a stochastic matrix game.Comment: 14 page

    Fast Composite Optimization and Statistical Recovery in Federated Learning

    Full text link
    As a prevalent distributed learning paradigm, Federated Learning (FL) trains a global model on a massive amount of devices with infrequent communication. This paper investigates a class of composite optimization and statistical recovery problems in the FL setting, whose loss function consists of a data-dependent smooth loss and a non-smooth regularizer. Examples include sparse linear regression using Lasso, low-rank matrix recovery using nuclear norm regularization, etc. In the existing literature, federated composite optimization algorithms are designed only from an optimization perspective without any statistical guarantees. In addition, they do not consider commonly used (restricted) strong convexity in statistical recovery problems. We advance the frontiers of this problem from both optimization and statistical perspectives. From optimization upfront, we propose a new algorithm named \textit{Fast Federated Dual Averaging} for strongly convex and smooth loss and establish state-of-the-art iteration and communication complexity in the composite setting. In particular, we prove that it enjoys a fast rate, linear speedup, and reduced communication rounds. From statistical upfront, for restricted strongly convex and smooth loss, we design another algorithm, namely \textit{Multi-stage Federated Dual Averaging}, and prove a high probability complexity bound with linear speedup up to optimal statistical precision. Experiments in both synthetic and real data demonstrate that our methods perform better than other baselines. To the best of our knowledge, this is the first work providing fast optimization algorithms and statistical recovery guarantees for composite problems in FL.Comment: This is a revised version to fix the imprecise statements about linear speedup from the ICML proceedings. We use another averaging scheme for the returned solutions in Theorem 2.1 and 3.1 to guarantee linear speedup when the number of iterations is larg
    • …