2,090 research outputs found
Optimization Methods for Large-Scale Machine Learning
This paper provides a review and commentary on the past, present, and future
of numerical optimization algorithms in the context of machine learning
applications. Through case studies on text classification and the training of
deep neural networks, we discuss how optimization problems arise in machine
learning and what makes them challenging. A major theme of our study is that
large-scale machine learning represents a distinctive setting in which the
stochastic gradient (SG) method has traditionally played a central role while
conventional gradient-based nonlinear optimization techniques typically falter.
Based on this viewpoint, we present a comprehensive theory of a
straightforward, yet versatile SG algorithm, discuss its practical behavior,
and highlight opportunities for designing algorithms with improved performance.
This leads to a discussion about the next generation of optimization methods
for large-scale machine learning, including an investigation of two main
streams of research on techniques that diminish noise in the stochastic
directions and methods that make use of second-order derivative approximations
Deep Learning for Passive Synthetic Aperture Radar
We introduce a deep learning (DL) framework for inverse problems in imaging,
and demonstrate the advantages and applicability of this approach in passive
synthetic aperture radar (SAR) image reconstruction. We interpret image recon-
struction as a machine learning task and utilize deep networks as forward and
inverse solvers for imaging. Specifically, we design a recurrent neural network
(RNN) architecture as an inverse solver based on the iterations of proximal
gradient descent optimization methods. We further adapt the RNN architecture to
image reconstruction problems by transforming the network into a recurrent
auto-encoder, thereby allowing for unsupervised training. Our DL based inverse
solver is particularly suitable for a class of image formation problems in
which the forward model is only partially known. The ability to learn forward
models and hyper parameters combined with unsupervised training approach
establish our recurrent auto-encoder suitable for real world applications. We
demonstrate the performance of our method in passive SAR image reconstruction.
In this regime a source of opportunity, with unknown location and transmitted
waveform, is used to illuminate a scene of interest. We investigate recurrent
auto- encoder architecture based on the 1 and 0 constrained least- squares
problem. We present a projected stochastic gradient descent based training
scheme which incorporates constraints of the unknown model parameters. We
demonstrate through extensive numerical simulations that our DL based approach
out performs conventional sparse coding methods in terms of computation and
reconstructed image quality, specifically, when no information about the
transmitter is available.Comment: Submitted to IEEE Journal of Selected Topics in Signal Processin
An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks
Deep learning is formulated as a discrete-time optimal control problem. This
allows one to characterize necessary conditions for optimality and develop
training algorithms that do not rely on gradients with respect to the trainable
parameters. In particular, we introduce the discrete-time method of successive
approximations (MSA), which is based on the Pontryagin's maximum principle, for
training neural networks. A rigorous error estimate for the discrete MSA is
obtained, which sheds light on its dynamics and the means to stabilize the
algorithm. The developed methods are applied to train, in a rather principled
way, neural networks with weights that are constrained to take values in a
discrete set. We obtain competitive performance and interestingly, very sparse
weights in the case of ternary networks, which may be useful in model
deployment in low-memory devices
Maximum Principle Based Algorithms for Deep Learning
The continuous dynamical system approach to deep learning is explored in
order to devise alternative frameworks for training algorithms. Training is
recast as a control problem and this allows us to formulate necessary
optimality conditions in continuous time using the Pontryagin's maximum
principle (PMP). A modification of the method of successive approximations is
then used to solve the PMP, giving rise to an alternative training algorithm
for deep learning. This approach has the advantage that rigorous error
estimates and convergence results can be established. We also show that it may
avoid some pitfalls of gradient-based methods, such as slow convergence on flat
landscapes near saddle points. Furthermore, we demonstrate that it obtains
favorable initial convergence rate per-iteration, provided Hamiltonian
maximization can be efficiently carried out - a step which is still in need of
improvement. Overall, the approach opens up new avenues to attack problems
associated with deep learning, such as trapping in slow manifolds and
inapplicability of gradient-based methods for discrete trainable variables.Comment: Published versio
Conditional Gradient Method for Stochastic Submodular Maximization: Closing the Gap
In this paper, we study the problem of \textit{constrained} and
\textit{stochastic} continuous submodular maximization. Even though the
objective function is not concave (nor convex) and is defined in terms of an
expectation, we develop a variant of the conditional gradient method, called
\alg, which achieves a \textit{tight} approximation guarantee. More precisely,
for a monotone and continuous DR-submodular function and subject to a
\textit{general} convex body constraint, we prove that \alg achieves a
[(1-1/e)\text{OPT} -\eps] guarantee (in expectation) with
\mathcal{O}{(1/\eps^3)} stochastic gradient computations. This guarantee
matches the known hardness results and closes the gap between deterministic and
stochastic continuous submodular maximization. By using stochastic continuous
optimization as an interface, we also provide the first tight
approximation guarantee for maximizing a \textit{monotone but stochastic}
submodular \textit{set} function subject to a general matroid constraint
DynaNewton - Accelerating Newton's Method for Machine Learning
Newton's method is a fundamental technique in optimization with quadratic
convergence within a neighborhood around the optimum. However reaching this
neighborhood is often slow and dominates the computational costs. We exploit
two properties specific to empirical risk minimization problems to accelerate
Newton's method, namely, subsampling training data and increasing strong
convexity through regularization. We propose a novel continuation method, where
we define a family of objectives over increasing sample sizes and with
decreasing regularization strength. Solutions on this path are tracked such
that the minimizer of the previous objective is guaranteed to be within the
quadratic convergence region of the next objective to be optimized. Thereby
every Newton iteration is guaranteed to achieve super-linear contractions with
regard to the chosen objective, which becomes a moving target. We provide a
theoretical analysis that motivates our algorithm, called DynaNewton, and
characterizes its speed of convergence. Experiments on a wide range of data
sets and problems consistently confirm the predicted computational savings
Gaussian Robust Classification
Supervised learning is all about the ability to generalize knowledge.
Specifically, the goal of the learning is to train a classifier using training
data, in such a way that it will be capable of classifying new unseen data
correctly. In order to acheive this goal, it is important to carefully design
the learner, so it will not overfit the training data. The later can is done
usually by adding a regularization term. The statistical learning theory
explains the success of this method by claiming that it restricts the
complexity of the learned model. This explanation, however, is rather abstract
and does not have a geometric intuition. The generalization error of a
classifier may be thought of as correlated with its robustness to perturbations
of the data: a classifier that copes with disturbance is expected to generalize
well. Indeed, Xu et al. [2009] have shown that the SVM formulation is
equivalent to a robust optimization (RO) formulation, in which an adversary
displaces the training and testing points within a ball of pre-determined
radius. In this work we explore a different kind of robustness, namely changing
each data point with a Gaussian cloud centered at the sample. Loss is evaluated
as the expectation of an underlying loss function on the cloud. This setup fits
the fact that in many applications, the data is sampled along with noise. We
develop an RO framework, in which the adversary chooses the covariance of the
noise. In our algorithm named GURU, the tuning parameter is a spectral bound on
the noise, thus it can be estimated using physical or applicative
considerations. Our experiments show that this framework performs as well as
SVM and even slightly better in some cases. Generalizations for Mercer kernels
and for the multiclass case are presented as well. We also show that our
framework may be further generalized, using the technique of convex perspective
functions.Comment: Master's dissertation of the first author, carried out under the
supervision of the second autho
Identifying global optimality for dictionary learning
Learning new representations of input observations in machine learning is
often tackled using a factorization of the data. For many such problems,
including sparse coding and matrix completion, learning these factorizations
can be difficult, in terms of efficiency and to guarantee that the solution is
a global minimum. Recently, a general class of objectives have been
introduced-which we term induced dictionary learning models (DLMs)-that have an
induced convex form that enables global optimization. Though attractive
theoretically, this induced form is impractical, particularly for large or
growing datasets. In this work, we investigate the use of practical alternating
minimization algorithms for induced DLMs, that ensure convergence to global
optima. We characterize the stationary points of these models, and, using these
insights, highlight practical choices for the objectives. We then provide
theoretical and empirical evidence that alternating minimization, from a random
initialization, converges to global minima for a large subclass of induced
DLMs. In particular, we take advantage of the existence of the (potentially
unknown) convex induced form, to identify when stationary points are global
minima for the dictionary learning objective. We then provide an empirical
investigation into practical optimization choices for using alternating
minimization for induced DLMs, for both batch and stochastic gradient descent.Comment: Updates to previous version include a small modification to
Proposition 2, to only use normed regularizers, and a modification to the
main theorem (previously Theorem 13) to focus on the overcomplete, full rank
setting and to better characterize non-differentiable induced regularizers.
The theory has been significantly modified since version
User-Centric Joint Access-Backhaul Design for Full-Duplex Self-Backhauled Wireless Networks
Full-duplex self-backhauling is promising to provide cost-effective and
flexible backhaul connectivity for ultra-dense wireless networks, but also
poses a great challenge to resource management between the access and backhaul
links. In this paper, we propose a user-centric joint access-backhaul
transmission framework for full-duplex self-backhauled wireless networks. In
the access link, user-centric clustering is adopted so that each user is
cooperatively served by multiple small base stations (SBSs). In the backhaul
link, user-centric multicast transmission is proposed so that each user's
message is treated as a common message and multicast to its serving SBS
cluster. We first formulate an optimization problem to maximize the network
weighted sum rate through joint access-backhaul beamforming and SBS clustering
when global channel state information (CSI) is available. This problem is
efficiently solved via the successive lower-bound maximization approach with a
novel approximate objective function and the iterative link removal technique.
We then extend the study to the stochastic joint access-backhaul beamforming
optimization with partial CSI. Simulation results demonstrate the effectiveness
of the proposed algorithms for both full CSI and partial CSI scenarios. They
also show that the transmission design with partial CSI can greatly reduce the
CSI overhead with little performance degradation.Comment: to appear in IEEE Trans. on Communication
Iteratively-Reweighted Least-Squares Fitting of Support Vector Machines: A Majorization--Minimization Algorithm Approach
Support vector machines (SVMs) are an important tool in modern data analysis.
Traditionally, support vector machines have been fitted via quadratic
programming, either using purpose-built or off-the-shelf algorithms. We present
an alternative approach to SVM fitting via the majorization--minimization (MM)
paradigm. Algorithms that are derived via MM algorithm constructions can be
shown to monotonically decrease their objectives at each iteration, as well as
be globally convergent to stationary points. We demonstrate the construction of
iteratively-reweighted least-squares (IRLS) algorithms, via the MM paradigm,
for SVM risk minimization problems involving the hinge, least-square,
squared-hinge, and logistic losses, and 1-norm, 2-norm, and elastic net
penalizations. Successful implementations of our algorithms are presented via
some numerical examples
- …