Search CORE

1,267 research outputs found

Patching Colors with Tensors

Author: Brand Cornelius
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual European Symposium on Algorithms (ESA 2019)
Publication date: 01/01/2019
Field of study

Dagstuhl Research Online Publication Server

Large Scale Constrained Linear Regression Revisited: Faster Algorithms via Preconditioning

Author: Wang Di
Xu Jinhui
Publication venue
Publication date: 09/02/2018
Field of study

In this paper, we revisit the large-scale constrained linear regression problem and propose faster methods based on some recent developments in sketching and optimization. Our algorithms combine (accelerated) mini-batch SGD with a new method called two-step preconditioning to achieve an approximate solution with a time complexity lower than that of the state-of-the-art techniques for the low precision case. Our idea can also be extended to the high precision case, which gives an alternative implementation to the Iterative Hessian Sketch (IHS) method with significantly improved time complexity. Experiments on benchmark and synthetic datasets suggest that our methods indeed outperform existing ones considerably in both the low and high precision cases.Comment: Appear in AAAI-1

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

Author: Deng Yichuan
Mahadevan Sridhar
Song Zhao
Publication venue
Publication date: 10/04/2023
Field of study

Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let

n

denote the length of sentence, let

d

denote the embedding dimension. Given a matrix

X \in \mathbb{R}^{n \times d}

, suppose

d \gg n

and

\| X X^\top \|_{\infty} < r

with

r \in (0,0.1)

, then we aim for finding

Y \in \mathbb{R}^{n \times m}

(where

m\ll d

) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) - D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two results for this problem.

\bullet

Our first result is a randomized algorithm. It runs in

\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} )

time, has

1-\delta

succeed probability, and chooses

m = O(n \log(n/\delta))

. Here

\mathrm{nnz}(X)

denotes the number of non-zero entries in

X

. We use

\omega

to denote the exponent of matrix multiplication. Currently

\omega \approx 2.373

\bullet

Our second result is a deterministic algorithm. It runs in

\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega-1}\} + n^{\omega+1})

time and chooses

m = O(n)

. Here

X_i

denote the

i

-th column of matrix

X

. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence

arXiv.org e-Print Archive

Large-scale Binary Quadratic Optimization Using Semidefinite Relaxation and Applications

Author: Hengel Anton van den
Shen Chunhua
Torr Philip H. S.
Wang Peng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

In computer vision, many problems such as image segmentation, pixel labelling, and scene parsing can be formulated as binary quadratic programs (BQPs). For submodular problems, cuts based methods can be employed to efficiently solve large-scale problems. However, general nonsubmodular problems are significantly more challenging to solve. Finding a solution when the problem is of large size to be of practical interest, however, typically requires relaxation. Two standard relaxation methods are widely used for solving general BQPs--spectral methods and semidefinite programming (SDP), each with their own advantages and disadvantages. Spectral relaxation is simple and easy to implement, but its bound is loose. Semidefinite relaxation has a tighter bound, but its computational complexity is high, especially for large scale problems. In this work, we present a new SDP formulation for BQPs, with two desirable properties. First, it has a similar relaxation bound to conventional SDP formulations. Second, compared with conventional SDP methods, the new SDP formulation leads to a significantly more efficient and scalable dual optimization approach, which has the same degree of complexity as spectral methods. We then propose two solvers, namely, quasi-Newton and smoothing Newton methods, for the dual problem. Both of them are significantly more efficiently than standard interior-point methods. In practice, the smoothing Newton solver is faster than the quasi-Newton solver for dense or medium-sized problems, while the quasi-Newton solver is preferable for large sparse/structured problems. Our experiments on a few computer vision applications including clustering, image segmentation, co-segmentation and registration show the potential of our SDP formulation for solving large-scale BQPs.Comment: Fixed some typos. 18 pages. Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligenc

arXiv.org e-Print Archive

Oxford University Research Archive

Training (Overparametrized) Neural Networks in Near-Linear Time

Author: Peng Binghui
Song Zhao
van den Brand Jan
Weinstein Omri
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)
Publication date: 08/12/2020
Field of study

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster

\mathit{second}

\mathit{order}

optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (

\mathit{independent}

of the training batch size

n

), second-order algorithms incur a daunting slowdown in the

\mathit{cost}

\mathit{per}

\mathit{iteration}

(inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an

O(mn^2)

-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width

m

. We show how to speed up the algorithm of [CGH+19], achieving an

\tilde{O}(mn)

-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (

mn

) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an

\ell_2

-regression problem, and then use a Fast-JL type dimension reduction to

\mathit{precondition}

the underlying Gram matrix in time independent of

M

, allowing to find a sufficiently good approximate solution via

\mathit{first}

\mathit{order}

conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in

\mathit{convex}

\mathit{optimization}

(ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server