An Over-parameterized Exponential Regression

Abstract

Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function F:RdΓ—mΓ—Rdβ†’RF: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R} using an exponential activation function. Given a set of data points with labels {(x1,y1),(x2,y2),…,(xn,yn)}βŠ‚RdΓ—R\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R} where nn denotes the number of the data. Here F(W(t),x)F(W(t),x) can be expressed as F(W(t),x):=βˆ‘r=1marexp⁑(⟨wr,x⟩)F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle), where mm represents the number of neurons, and wr(t)w_r(t) are weights at time tt. It's standard in literature that ara_r are the fixed weights and it's never changed during the training. We initialize the weights W(0)∈RdΓ—mW(0) \in \mathbb{R}^{d \times m} with random Gaussian distributions, such that wr(0)∼N(0,Id)w_r(0) \sim \mathcal{N}(0, I_d) and initialize ara_r from random sign distribution for each r∈[m]r \in [m]. Using the gradient descent algorithm, we can find a weight W(T)W(T) such that βˆ₯F(W(T),X)βˆ’yβˆ₯2≀ϡ\| F(W(T), X) - y \|_2 \leq \epsilon holds with probability 1βˆ’Ξ΄1-\delta, where ϡ∈(0,0.1)\epsilon \in (0,0.1) and m=Ξ©(n2+o(1)log⁑(n/Ξ΄))m = \Omega(n^{2+o(1)}\log(n/\delta)). To optimize the over-parameterization bound mm, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022]

    Similar works

    Full text

    thumbnail-image

    Available Versions