23 research outputs found
Primal Method for ERM with Flexible Mini-batching Schemes and Non-convex Losses
In this work we develop a new algorithm for regularized empirical risk
minimization. Our method extends recent techniques of Shalev-Shwartz [02/2015],
which enable a dual-free analysis of SDCA, to arbitrary mini-batching schemes.
Moreover, our method is able to better utilize the information in the data
defining the ERM problem. For convex loss functions, our complexity results
match those of QUARTZ, which is a primal-dual method also allowing for
arbitrary mini-batching schemes. The advantage of a dual-free analysis comes
from the fact that it guarantees convergence even for non-convex loss
functions, as long as the average loss is convex. We illustrate through
experiments the utility of being able to design arbitrary mini-batching
schemes.Comment: 13 pages, 3 figures, 2 algorithm
Data sampling strategies in stochastic algorithms for empirical risk minimization
Gradient descent methods and especially their stochastic variants have become highly popular
in the last decade due to their efficiency on big data optimization problems. In this thesis
we present the development of data sampling strategies for these methods. In the first four
chapters we focus on four views on the sampling for convex problems, developing and analyzing
new state-of-the-art methods using non-standard data sampling strategies. Finally, in the last
chapter we present a more flexible framework, which generalizes to more problems as well as
more sampling rules.
In the first chapter we propose an adaptive variant of stochastic dual coordinate ascent
(SDCA) for solving the regularized empirical risk minimization (ERM) problem. Our modification consists in allowing the method to adaptively change the probability distribution over
the dual variables throughout the iterative process. AdaSDCA achieves a provably better complexity
bound than SDCA with the best fixed probability distribution, known as importance
sampling. However, it is of a theoretical character as it is expensive to implement. We also
propose AdaSDCA+: a practical variant which in our experiments outperforms existing non-adaptive
methods.
In the second chapter we extend the dual-free analysis of SDCA, to arbitrary mini-batching
schemes. Our method is able to better utilize the information in the data defining the ERM
problem. For convex loss functions, our complexity results match those of QUARTZ, which is
a primal-dual method also allowing for arbitrary mini-batching schemes. The advantage of a
dual-free analysis comes from the fact that it guarantees convergence even for non-convex loss
functions, as long as the average loss is convex. We illustrate through experiments the utility
of being able to design arbitrary mini-batching schemes.
In the third chapter we study importance sampling of minibatches. Minibatching is a
well studied and highly popular technique in supervised learning, used by practitioners due
to its ability to accelerate training through better utilization of parallel processing power and
reduction of stochastic variance. Another popular technique is importance sampling { a strategy
for preferential sampling of more important examples also capable of accelerating the training
process. However, despite considerable effort by the community in these areas, and due to the
inherent technical difficulty of the problem, there is no existing work combining the power of
importance sampling with the strength of minibatching. In this chapter we propose the first
importance sampling for minibatches and give simple and rigorous complexity analysis of its
performance. We illustrate on synthetic problems that for training data of certain properties,
our sampling can lead to several orders of magnitude improvement in training time. We then
test the new sampling on several popular datasets, and show that the improvement can reach
an order of magnitude.
In the fourth chapter we ask whether randomized coordinate descent (RCD) methods should
be applied to the ERM problem or rather to its dual. When the number of examples (n) is
much larger than the number of features (d), a common strategy is to apply RCD to the dual
problem. On the other hand, when the number of features is much larger than the number of
examples, it makes sense to apply RCD directly to the primal problem. In this paper we provide
the first joint study of these two approaches when applied to L2-regularized ERM. First, we
show through a rigorous analysis that for dense data, the above intuition is precisely correct.
However, we find that for sparse and structured data, primal RCD can significantly outperform
dual RCD even if d ≪ n, and vice versa, dual RCD can be much faster than primal RCD even
if n ≫ d. Moreover, we show that, surprisingly, a single sampling strategy minimizes both the
(bound on the) number of iterations and the overall expected complexity of RCD. Note that
the latter complexity measure also takes into account the average cost of the iterations, which
depends on the structure and sparsity of the data, and on the sampling strategy employed. We
confirm our theoretical predictions using extensive experiments with both synthetic and real
data sets.
In the last chapter we introduce two novel generalizations of the theory for gradient descent
type methods in the proximal setting. Firstly, we introduce the proportion function, which
we further use to analyze all the known block-selection rules for coordinate descent methods
under a single framework. This framework includes randomized methods with uniform, non-uniform
or even adaptive sampling strategies, as well as deterministic methods with batch,
greedy or cyclic selection rules. We additionally introduce a novel block selection technique
called greedy minibatches, for which we provide competitive convergence guarantees. Secondly,
the whole theory of strongly-convex optimization was recently generalized to a specific class
of non-convex functions satisfying the so-called Polyak- Lojasiewicz condition. To mirror this
generalization in the weakly convex case, we introduce the Weak Polyak- Lojasiewicz condition,
using which we give global convergence guarantees for a class of non-convex functions previously
not considered in theory. Additionally, we give local convergence guarantees for an even larger
class of non-convex functions satisfying only a certain smoothness assumption. By combining
the two above mentioned generalizations we recover the state-of-the-art convergence guarantees
for a large class of previously known methods and setups as special cases of our framework.
Also, we provide new guarantees for many previously not considered combinations of methods
and setups, as well as a huge class of novel non-convex objectives. The flexibility of our approach
offers a lot of potential for future research, as any new block selection procedure will have a
convergence guarantee for all objectives considered in our framework, while any new objective
analyzed under our approach will have a whole fleet of block selection rules with convergence
guarantees readily available
Distributed optimization with arbitrary local solvers
With the growth of data and necessity for distributed optimization methods,
solvers that work well on a single machine must be re-designed to leverage
distributed computation. Recent work in this area has been limited by focusing
heavily on developing highly specific methods for the distributed environment.
These special-purpose methods are often unable to fully leverage the
competitive performance of their well-tuned and customized single machine
counterparts. Further, they are unable to easily integrate improvements that
continue to be made to single machine methods. To this end, we present a
framework for distributed optimization that both allows the flexibility of
arbitrary solvers to be used on each (single) machine locally, and yet
maintains competitive performance against other state-of-the-art
special-purpose distributed methods. We give strong primal-dual convergence
rate guarantees for our framework that hold for arbitrary local solvers. We
demonstrate the impact of local solver selection both theoretically and in an
extensive experimental comparison. Finally, we provide thorough implementation
details for our framework, highlighting areas for practical performance gains
Federated Optimization: Distributed Machine Learning for On-Device Intelligence
We introduce a new and increasingly relevant setting for distributed
optimization in machine learning, where the data defining the optimization are
unevenly distributed over an extremely large number of nodes. The goal is to
train a high-quality centralized model. We refer to this setting as Federated
Optimization. In this setting, communication efficiency is of the utmost
importance and minimizing the number of rounds of communication is the
principal goal.
A motivating example arises when we keep the training data locally on users'
mobile devices instead of logging it to a data center for training. In
federated optimziation, the devices are used as compute nodes performing
computation on their local data in order to update a global model. We suppose
that we have extremely large number of devices in the network --- as many as
the number of users of a given service, each of which has only a tiny fraction
of the total data available. In particular, we expect the number of data points
available locally to be much smaller than the number of devices. Additionally,
since different users generate data with different patterns, it is reasonable
to assume that no device has a representative sample of the overall
distribution.
We show that existing algorithms are not suitable for this setting, and
propose a new algorithm which shows encouraging experimental results for sparse
convex problems. This work also sets a path for future research needed in the
context of \federated optimization.Comment: 38 page