1,267 research outputs found
Large Scale Constrained Linear Regression Revisited: Faster Algorithms via Preconditioning
In this paper, we revisit the large-scale constrained linear regression
problem and propose faster methods based on some recent developments in
sketching and optimization. Our algorithms combine (accelerated) mini-batch SGD
with a new method called two-step preconditioning to achieve an approximate
solution with a time complexity lower than that of the state-of-the-art
techniques for the low precision case. Our idea can also be extended to the
high precision case, which gives an alternative implementation to the Iterative
Hessian Sketch (IHS) method with significantly improved time complexity.
Experiments on benchmark and synthetic datasets suggest that our methods indeed
outperform existing ones considerably in both the low and high precision cases.Comment: Appear in AAAI-1
Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension
Large language models (LLMs) have shown their power in different areas.
Attention computation, as an important subroutine of LLMs, has also attracted
interests in theory. Recently the static computation and dynamic maintenance of
attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and
Zhou 2023] from both algorithmic perspective and hardness perspective. In this
work, we consider the sparsification of the attention problem. We make one
simplification which is the logit matrix is symmetric. Let denote the
length of sentence, let denote the embedding dimension. Given a matrix , suppose and with , then we aim for finding (where ) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) -
D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two
results for this problem.
Our first result is a randomized algorithm. It runs in
time, has succeed
probability, and chooses . Here
denotes the number of non-zero entries in . We use to denote the
exponent of matrix multiplication. Currently .
Our second result is a deterministic algorithm. It runs in
time and chooses . Here denote the -th column
of matrix .
Our main findings have the following implication for applied LLMs task: for
any super large feature dimension, we can reduce it down to the size nearly
linear in length of sentence
Large-scale Binary Quadratic Optimization Using Semidefinite Relaxation and Applications
In computer vision, many problems such as image segmentation, pixel
labelling, and scene parsing can be formulated as binary quadratic programs
(BQPs). For submodular problems, cuts based methods can be employed to
efficiently solve large-scale problems. However, general nonsubmodular problems
are significantly more challenging to solve. Finding a solution when the
problem is of large size to be of practical interest, however, typically
requires relaxation. Two standard relaxation methods are widely used for
solving general BQPs--spectral methods and semidefinite programming (SDP), each
with their own advantages and disadvantages. Spectral relaxation is simple and
easy to implement, but its bound is loose. Semidefinite relaxation has a
tighter bound, but its computational complexity is high, especially for large
scale problems. In this work, we present a new SDP formulation for BQPs, with
two desirable properties. First, it has a similar relaxation bound to
conventional SDP formulations. Second, compared with conventional SDP methods,
the new SDP formulation leads to a significantly more efficient and scalable
dual optimization approach, which has the same degree of complexity as spectral
methods. We then propose two solvers, namely, quasi-Newton and smoothing Newton
methods, for the dual problem. Both of them are significantly more efficiently
than standard interior-point methods. In practice, the smoothing Newton solver
is faster than the quasi-Newton solver for dense or medium-sized problems,
while the quasi-Newton solver is preferable for large sparse/structured
problems. Our experiments on a few computer vision applications including
clustering, image segmentation, co-segmentation and registration show the
potential of our SDP formulation for solving large-scale BQPs.Comment: Fixed some typos. 18 pages. Accepted to IEEE Transactions on Pattern
Analysis and Machine Intelligenc
Training (Overparametrized) Neural Networks in Near-Linear Time
The slow convergence rate and pathological curvature issues of first-order
gradient methods for training deep neural networks, initiated an ongoing effort
for developing faster - optimization
algorithms beyond SGD, without compromising the generalization error. Despite
their remarkable convergence rate ( of the training batch
size ), second-order algorithms incur a daunting slowdown in the
(inverting the Hessian
matrix of the loss function), which renders them impractical. Very recently,
this computational overhead was mitigated by the works of [ZMG19,CGH+19},
yielding an -time second-order algorithm for training two-layer
overparametrized neural networks of polynomial width .
We show how to speed up the algorithm of [CGH+19], achieving an
-time backpropagation algorithm for training (mildly
overparametrized) ReLU networks, which is near-linear in the dimension ()
of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to
reformulate the Gauss-Newton iteration as an -regression problem, and
then use a Fast-JL type dimension reduction to the
underlying Gram matrix in time independent of , allowing to find a
sufficiently good approximate solution via -
conjugate gradient. Our result provides a proof-of-concept that advanced
machinery from randomized linear algebra -- which led to recent breakthroughs
in (ERM, LPs, Regression) -- can be
carried over to the realm of deep learning as well
- β¦