CORE
πΊπ¦Β
Β make metadata, not war
Services
Services overview
Explore all CORE services
Access to raw data
API
Dataset
FastSync
Content discovery
Recommender
Discovery
OAI identifiers
OAI Resolver
Managing content
Dashboard
Bespoke contracts
Consultancy services
Support us
Support us
Membership
Sponsorship
Research partnership
About
About
About us
Our mission
Team
Blog
FAQs
Contact us
Community governance
Governance
Advisory Board
Board of supporters
Research network
Innovations
Our research
Labs
An Over-parameterized Exponential Regression
Authors
Yeqi Gao
Sridhar Mahadevan
Zhao Song
Publication date
29 March 2023
Publisher
View
on
arXiv
Abstract
Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function
F
:
R
d
Γ
m
Γ
R
d
β
R
F: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R}
F
:
R
d
Γ
m
Γ
R
d
β
R
using an exponential activation function. Given a set of data points with labels
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
β¦
,
(
x
n
,
y
n
)
}
β
R
d
Γ
R
\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R}
{(
x
1
β
,
y
1
β
)
,
(
x
2
β
,
y
2
β
)
,
β¦
,
(
x
n
β
,
y
n
β
)}
β
R
d
Γ
R
where
n
n
n
denotes the number of the data. Here
F
(
W
(
t
)
,
x
)
F(W(t),x)
F
(
W
(
t
)
,
x
)
can be expressed as
F
(
W
(
t
)
,
x
)
:
=
β
r
=
1
m
a
r
exp
β‘
(
β¨
w
r
,
x
β©
)
F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle)
F
(
W
(
t
)
,
x
)
:=
β
r
=
1
m
β
a
r
β
exp
(β¨
w
r
β
,
x
β©)
, where
m
m
m
represents the number of neurons, and
w
r
(
t
)
w_r(t)
w
r
β
(
t
)
are weights at time
t
t
t
. It's standard in literature that
a
r
a_r
a
r
β
are the fixed weights and it's never changed during the training. We initialize the weights
W
(
0
)
β
R
d
Γ
m
W(0) \in \mathbb{R}^{d \times m}
W
(
0
)
β
R
d
Γ
m
with random Gaussian distributions, such that
w
r
(
0
)
βΌ
N
(
0
,
I
d
)
w_r(0) \sim \mathcal{N}(0, I_d)
w
r
β
(
0
)
βΌ
N
(
0
,
I
d
β
)
and initialize
a
r
a_r
a
r
β
from random sign distribution for each
r
β
[
m
]
r \in [m]
r
β
[
m
]
. Using the gradient descent algorithm, we can find a weight
W
(
T
)
W(T)
W
(
T
)
such that
β₯
F
(
W
(
T
)
,
X
)
β
y
β₯
2
β€
Ο΅
\| F(W(T), X) - y \|_2 \leq \epsilon
β₯
F
(
W
(
T
)
,
X
)
β
y
β₯
2
β
β€
Ο΅
holds with probability
1
β
Ξ΄
1-\delta
1
β
Ξ΄
, where
Ο΅
β
(
0
,
0.1
)
\epsilon \in (0,0.1)
Ο΅
β
(
0
,
0.1
)
and
m
=
Ξ©
(
n
2
+
o
(
1
)
log
β‘
(
n
/
Ξ΄
)
)
m = \Omega(n^{2+o(1)}\log(n/\delta))
m
=
Ξ©
(
n
2
+
o
(
1
)
lo
g
(
n
/
Ξ΄
))
. To optimize the over-parameterization bound
m
m
m
, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022]
Similar works
Full text
Available Versions
arXiv.org e-Print Archive
See this paper in CORE
Go to the repository landing page
Download from data provider
oai:arXiv.org:2303.16504
Last time updated on 02/04/2023