20 research outputs found
Faster Robust Tensor Power Method for Arbitrary Order
Tensor decomposition is a fundamental method used in various areas to deal
with high-dimensional data. \emph{Tensor power method} (TPM) is one of the
widely-used techniques in the decomposition of tensors. This paper presents a
novel tensor power method for decomposing arbitrary order tensors, which
overcomes limitations of existing approaches that are often restricted to
lower-order (less than ) tensors or require strong assumptions about the
underlying data structure. We apply sketching method, and we are able to
achieve the running time of , on the power and
dimension tensor. We provide a detailed analysis for any -th order
tensor, which is never given in previous works
The Expressibility of Polynomial based Attention Scheme
Large language models (LLMs) have significantly improved various aspects of
our daily lives. These models have impacted numerous domains, from healthcare
to education, enhancing productivity, decision-making processes, and
accessibility. As a result, they have influenced and, to some extent, reshaped
people's lifestyles. However, the quadratic complexity of attention in
transformer architectures poses a challenge when scaling up these models for
processing long textual contexts. This issue makes it impractical to train very
large models on lengthy texts or use them efficiently during inference. While a
recent study by [KMZ23] introduced a technique that replaces the softmax with a
polynomial function and polynomial sketching to speed up attention mechanisms,
the theoretical understandings of this new approach are not yet well
understood.
In this paper, we offer a theoretical analysis of the expressive capabilities
of polynomial attention. Our study reveals a disparity in the ability of
high-degree and low-degree polynomial attention. Specifically, we construct two
carefully designed datasets, namely and , where
includes a feature with a significantly larger value compared
to . We demonstrate that with a sufficiently high degree
, a single-layer polynomial attention network can distinguish between
and . However, with a low degree , the
network cannot effectively separate the two datasets. This analysis underscores
the greater effectiveness of high-degree polynomials in amplifying large values
and distinguishing between datasets. Our analysis offers insight into the
representational capacity of polynomial attention and provides a rationale for
incorporating higher-degree polynomials in attention mechanisms to capture
intricate linguistic correlations.Comment: arXiv admin note: substantial text overlap with arXiv:2310.1168
Revisiting Quantum Algorithms for Linear Regressions: Quadratic Speedups without Data-Dependent Parameters
Linear regression is one of the most fundamental linear algebra problems.
Given a dense matrix and a vector , the goal
is to find such that
. The best
classical algorithm takes time [Clarkson
and Woodruff STOC 2013, Nelson and Nguyen FOCS 2013]. On the other hand,
quantum linear regression algorithms can achieve exponential quantum speedups,
as shown in [Wang Phys. Rev. A 96, 012335, Kerenidis and Prakash ITCS 2017,
Chakraborty, Gily{\'e}n and Jeffery ICALP 2019]. However, the running times of
these algorithms depend on some quantum linear algebra-related parameters, such
as , the condition number of . In this work, we develop a quantum
algorithm that runs in time. It provides a quadratic quantum speedup in
over the classical lower bound without any dependence on data-dependent
parameters. In addition, we also show our result can be generalized to multiple
regression and ridge linear regression
GradientCoin: A Peer-to-Peer Decentralized Large Language Models
Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin
has fundamentally changed the economic system over the last decade. Since 2022,
large language models (LLMs) such as GPT have outperformed humans in many
real-life tasks. However, these large language models have several practical
issues. For example, the model is centralized and controlled by a specific
unit. One weakness is that if that unit decides to shut down the model, it
cannot be used anymore. The second weakness is the lack of guaranteed
discrepancy behind this model, as certain dishonest units may design their own
models and feed them unhealthy training data.
In this work, we propose a purely theoretical design of a decentralized LLM
that operates similarly to a Bitcoin cash system. However, implementing such a
system might encounter various practical difficulties. Furthermore, this new
system is unlikely to perform better than the standard Bitcoin system in
economics. Therefore, the motivation for designing such a system is limited. It
is likely that only two types of people would be interested in setting up a
practical system for it:
Those who prefer to use a decentralized ChatGPT-like software.
Those who believe that the purpose of carbon-based life is to
create silicon-based life, such as Optimus Prime in Transformers.
The reason the second type of people may be interested is that it is possible
that one day an AI system like this will awaken and become the next level of
intelligence on this planet
Federated Empirical Risk Minimization via Second-Order Method
Many convex optimization problems with important applications in machine
learning are formulated as empirical risk minimization (ERM). There are several
examples: linear and logistic regression, LASSO, kernel regression, quantile
regression, -norm regression, support vector machines (SVM), and mean-field
variational inference. To improve data privacy, federated learning is proposed
in machine learning as a framework for training deep learning models on the
network edge without sharing data between participating nodes. In this work, we
present an interior point method (IPM) to solve a general ERM problem under the
federated learning setting. We show that the communication complexity of each
iteration of our IPM is , where is the dimension (i.e.,
number of features) of the dataset
A Unified Scheme of ResNet and Softmax
Large language models (LLMs) have brought significant changes to human
society. Softmax regression and residual neural networks (ResNet) are two
important techniques in deep learning: they not only serve as significant
theoretical components supporting the functionality of LLMs but also are
related to many other machine learning and theoretical computer science fields,
including but not limited to image classification, object detection, semantic
segmentation, and tensors.
Previous research works studied these two concepts separately. In this paper,
we provide a theoretical analysis of the regression problem: , where
is a matrix in , is a vector in
, and is the -dimensional vector whose entries are
all . This regression problem is a unified scheme that combines softmax
regression and ResNet, which has never been done before. We derive the
gradient, Hessian, and Lipschitz properties of the loss function. The Hessian
is shown to be positive semidefinite, and its structure is characterized as the
sum of a low-rank matrix and a diagonal matrix. This enables an efficient
approximate Newton method.
As a result, this unified scheme helps to connect two previously thought
unrelated fields and provides novel insight into loss landscape and
optimization for emerging over-parameterized neural networks, which is
meaningful for future research in deep learning models
Solving Attention Kernel Regression Problem via Pre-conditioner
The attention mechanism is the key to large language models, and the
attention matrix serves as an algorithmic and computational bottleneck for such
a scheme. In this paper, we define two problems, motivated by designing fast
algorithms for proxy of attention matrix and solving regressions against them.
Given an input matrix with and a
response vector , we first consider the matrix exponential of the matrix
as a proxy, and we in turn design algorithms for two types of
regression problems: and
for any positive integer .
Studying algorithms for these regressions is essential, as matrix exponential
can be approximated term-by-term via these smaller problems. The second proxy
is applying exponential entrywise to the Gram matrix, denoted by
and solving the regression . We call this problem the attention
kernel regression problem, as the matrix could be viewed as a
kernel function with respect to . We design fast algorithms for these
regression problems, based on sketching and preconditioning. We hope these
efforts will provide an alternative perspective of studying efficient
approximation of attention matrices.Comment: AISTATS 202
Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression
There have been significant advancements made by large language models (LLMs)
in various aspects of our daily lives. LLMs serve as a transformative force in
natural language processing, finding applications in text generation,
translation, sentiment analysis, and question-answering. The accomplishments of
LLMs have led to a substantial increase in research efforts in this domain. One
specific two-layer regression problem has been well-studied in prior works,
where the first layer is activated by a ReLU unit, and the second layer is
activated by a softmax unit. While previous works provide a solid analysis of
building a two-layer regression, there is still a gap in the analysis of
constructing regression problems with more than two layers.
In this paper, we take a crucial step toward addressing this problem: we
provide an analysis of a two-layer regression problem. In contrast to previous
works, our first layer is activated by a softmax unit. This sets the stage for
future analyses of creating more activation functions based on the softmax
function. Rearranging the softmax function leads to significantly different
analyses. Our main results involve analyzing the convergence properties of an
approximate Newton method used to minimize the regularized training loss. We
prove that the loss function for the Hessian matrix is positive definite and
Lipschitz continuous under certain assumptions. This enables us to establish
local convergence guarantees for the proposed training algorithm. Specifically,
with an appropriate initialization and after iterations,
our algorithm can find an -approximate minimizer of the training loss
with high probability. Each iteration requires approximately time, where is the model size, is the input matrix, and
is the matrix multiplication exponent
Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time
Given a matrix , the low rank matrix completion
problem asks us to find a rank- approximation of as for and by only observing a
few entries specified by a set of entries . In
particular, we examine an approach that is widely used in practice -- the
alternating minimization framework. Jain, Netrapalli and Sanghavi~\cite{jns13}
showed that if has incoherent rows and columns, then alternating
minimization provably recovers the matrix by observing a nearly linear in
number of entries. While the sample complexity has been subsequently
improved~\cite{glz17}, alternating minimization steps are required to be
computed exactly. This hinders the development of more efficient algorithms and
fails to depict the practical implementation of alternating minimization, where
the updates are usually performed approximately in favor of efficiency.
In this paper, we take a major step towards a more efficient and error-robust
alternating minimization framework. To this end, we develop an analytical
framework for alternating minimization that can tolerate moderate amount of
errors caused by approximate updates. Moreover, our algorithm runs in time
, which is nearly linear in the time to verify the
solution while preserving the sample complexity. This improves upon all prior
known alternating minimization approaches which require time.Comment: Improve the runtime from to $O|\Omega| k)
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
Large language models (LLMs) have played a pivotal role in revolutionizing
various facets of our daily existence. Solving attention regression is a
fundamental task in optimizing LLMs. In this work, we focus on giving a
provable guarantee for the one-layer attention network objective function
. Here is Kronecker product between and
. is a matrix in , is the -th block of
. The are variables we want to
learn. and is one
entry at -th row and -th column of ,
is the -column vector of , and is the
vectorization of .
In a multi-layer LLM network, the matrix can
be viewed as the output of a layer, and can be viewed as the input of a layer. The matrix version of can
be viewed as and can be viewed as . We provide an iterative
greedy algorithm to train loss function up that runs in
time. Here denotes the time of multiplying matrix
another matrix, and denotes the exponent of
matrix multiplication