CORE
πΊπ¦Β
Β make metadata, not war
Services
Services overview
Explore all CORE services
Access to raw data
API
Dataset
FastSync
Content discovery
Recommender
Discovery
OAI identifiers
OAI Resolver
Managing content
Dashboard
Bespoke contracts
Consultancy services
Support us
Support us
Membership
Sponsorship
Research partnership
About
About
About us
Our mission
Team
Blog
FAQs
Contact us
Community governance
Governance
Advisory Board
Board of supporters
Research network
Innovations
Our research
Labs
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
Authors
Yeqi Gao
Zhao Song
Weixin Wang
Junze Yin
Publication date
14 September 2023
Publisher
View
on
arXiv
Abstract
Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function
L
(
X
,
Y
)
=
β
j
0
=
1
n
β
i
0
=
1
d
(
β¨
β¨
exp
β‘
(
A
j
0
x
)
,
1
n
β©
β
1
exp
β‘
(
A
j
0
x
)
,
A
3
Y
β
,
i
0
β©
β
b
j
0
,
i
0
)
2
L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2
L
(
X
,
Y
)
=
β
j
0
β
=
1
n
β
β
i
0
β
=
1
d
β
(β¨β¨
exp
(
A
j
0
β
β
x
)
,
1
n
β
β©
β
1
exp
(
A
j
0
β
β
x
)
,
A
3
β
Y
β
,
i
0
β
β
β©
β
b
j
0
β
,
i
0
β
β
)
2
. Here
A
β
R
n
2
Γ
d
2
\mathsf{A} \in \mathbb{R}^{n^2 \times d^2}
A
β
R
n
2
Γ
d
2
is Kronecker product between
A
1
β
R
n
Γ
d
A_1 \in \mathbb{R}^{n \times d}
A
1
β
β
R
n
Γ
d
and
A
2
β
R
n
Γ
d
A_2 \in \mathbb{R}^{n \times d}
A
2
β
β
R
n
Γ
d
.
A
3
A_3
A
3
β
is a matrix in
R
n
Γ
d
\mathbb{R}^{n \times d}
R
n
Γ
d
,
A
j
0
β
R
n
Γ
d
2
\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}
A
j
0
β
β
β
R
n
Γ
d
2
is the
j
0
j_0
j
0
β
-th block of
A
\mathsf{A}
A
. The
X
,
Y
β
R
d
Γ
d
X, Y \in \mathbb{R}^{d \times d}
X
,
Y
β
R
d
Γ
d
are variables we want to learn.
B
β
R
n
Γ
d
B \in \mathbb{R}^{n \times d}
B
β
R
n
Γ
d
and
b
j
0
,
i
0
β
R
b_{j_0,i_0} \in \mathbb{R}
b
j
0
β
,
i
0
β
β
β
R
is one entry at
j
0
j_0
j
0
β
-th row and
i
0
i_0
i
0
β
-th column of
B
B
B
,
Y
β
,
i
0
β
R
d
Y_{*,i_0} \in \mathbb{R}^d
Y
β
,
i
0
β
β
β
R
d
is the
i
0
i_0
i
0
β
-column vector of
Y
Y
Y
, and
x
β
R
d
2
x \in \mathbb{R}^{d^2}
x
β
R
d
2
is the vectorization of
X
X
X
. In a multi-layer LLM network, the matrix
B
β
R
n
Γ
d
B \in \mathbb{R}^{n \times d}
B
β
R
n
Γ
d
can be viewed as the output of a layer, and
A
1
=
A
2
=
A
3
β
R
n
Γ
d
A_1= A_2 = A_3 \in \mathbb{R}^{n \times d}
A
1
β
=
A
2
β
=
A
3
β
β
R
n
Γ
d
can be viewed as the input of a layer. The matrix version of
x
x
x
can be viewed as
Q
K
β€
QK^\top
Q
K
β€
and
Y
Y
Y
can be viewed as
V
V
V
. We provide an iterative greedy algorithm to train loss function
L
(
X
,
Y
)
L(X,Y)
L
(
X
,
Y
)
up
Ο΅
\epsilon
Ο΅
that runs in
O
~
(
(
T
m
a
t
(
n
,
n
,
d
)
+
T
m
a
t
(
n
,
d
,
d
)
+
d
2
Ο
)
log
β‘
(
1
/
Ο΅
)
)
\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )
O
((
T
mat
β
(
n
,
n
,
d
)
+
T
mat
β
(
n
,
d
,
d
)
+
d
2
Ο
)
lo
g
(
1/
Ο΅
))
time. Here
T
m
a
t
(
a
,
b
,
c
)
{\cal T}_{\mathrm{mat}}(a,b,c)
T
mat
β
(
a
,
b
,
c
)
denotes the time of multiplying
a
Γ
b
a \times b
a
Γ
b
matrix another
b
Γ
c
b \times c
b
Γ
c
matrix, and
Ο
β
2.37
\omega\approx 2.37
Ο
β
2.37
denotes the exponent of matrix multiplication
Similar works
Full text
Available Versions
arXiv.org e-Print Archive
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:arXiv.org:2309.07418
Last time updated on 08/10/2023