Search CORE

5 research outputs found

GradientCoin: A Peer-to-Peer Decentralized Large Language Models

Author: Gao Yeqi
Song Zhao
Yin Junze
Publication venue
Publication date: 21/08/2023
Field of study

Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin has fundamentally changed the economic system over the last decade. Since 2022, large language models (LLMs) such as GPT have outperformed humans in many real-life tasks. However, these large language models have several practical issues. For example, the model is centralized and controlled by a specific unit. One weakness is that if that unit decides to shut down the model, it cannot be used anymore. The second weakness is the lack of guaranteed discrepancy behind this model, as certain dishonest units may design their own models and feed them unhealthy training data. In this work, we propose a purely theoretical design of a decentralized LLM that operates similarly to a Bitcoin cash system. However, implementing such a system might encounter various practical difficulties. Furthermore, this new system is unlikely to perform better than the standard Bitcoin system in economics. Therefore, the motivation for designing such a system is limited. It is likely that only two types of people would be interested in setting up a practical system for it:

\bullet

Those who prefer to use a decentralized ChatGPT-like software.

\bullet

Those who believe that the purpose of carbon-based life is to create silicon-based life, such as Optimus Prime in Transformers. The reason the second type of people may be interested is that it is possible that one day an AI system like this will awaken and become the next level of intelligence on this planet

arXiv.org e-Print Archive

Solving Tensor Low Cycle Rank Approximation

Author: Deng Yichuan
Gao Yeqi
Song Zhao
Publication venue
Publication date: 13/04/2023
Field of study

Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic context-free grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an

n \times n \times n

third order tensor

A

, we say that

A

has cycle rank-

k

if there exists three

n \times k^2

size matrices

U , V

, and

W

such that for each entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) } \end{align*} for all

a \in [n], b \in [n], c \in [n]

. For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous ``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank

arXiv.org e-Print Archive

An Over-parameterized Exponential Regression

Author: Gao Yeqi
Mahadevan Sridhar
Song Zhao
Publication venue
Publication date: 29/03/2023
Field of study

Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function

F: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R}

using an exponential activation function. Given a set of data points with labels

\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R}

where

n

denotes the number of the data. Here

F(W(t),x)

can be expressed as

F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle)

, where

m

represents the number of neurons, and

w_r(t)

are weights at time

t

. It's standard in literature that

a_r

are the fixed weights and it's never changed during the training. We initialize the weights

W(0) \in \mathbb{R}^{d \times m}

with random Gaussian distributions, such that

w_r(0) \sim \mathcal{N}(0, I_d)

and initialize

a_r

from random sign distribution for each

r \in [m]

. Using the gradient descent algorithm, we can find a weight

W(T)

such that

\| F(W(T), X) - y \|_2 \leq \epsilon

holds with probability

1-\delta

, where

\epsilon \in (0,0.1)

and

m = \Omega(n^{2+o(1)}\log(n/\delta))

. To optimize the over-parameterization bound

m

, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022]

arXiv.org e-Print Archive

A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Author: Gao Yeqi
Song Zhao
Wang Weixin
Yin Junze
Publication venue
Publication date: 14/09/2023
Field of study

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function

L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2

. Here

\mathsf{A} \in \mathbb{R}^{n^2 \times d^2}

is Kronecker product between

A_1 \in \mathbb{R}^{n \times d}

and

A_2 \in \mathbb{R}^{n \times d}

A_3

is a matrix in

\mathbb{R}^{n \times d}

\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}

is the

j_0

-th block of

\mathsf{A}

. The

X, Y \in \mathbb{R}^{d \times d}

are variables we want to learn.

B \in \mathbb{R}^{n \times d}

and

b_{j_0,i_0} \in \mathbb{R}

is one entry at

j_0

-th row and

i_0

-th column of

B

Y_{*,i_0} \in \mathbb{R}^d

is the

i_0

-column vector of

Y

, and

x \in \mathbb{R}^{d^2}

is the vectorization of

X

. In a multi-layer LLM network, the matrix

B \in \mathbb{R}^{n \times d}

can be viewed as the output of a layer, and

A_1= A_2 = A_3 \in \mathbb{R}^{n \times d}

can be viewed as the input of a layer. The matrix version of

x

can be viewed as

QK^\top

and

Y

can be viewed as

V

. We provide an iterative greedy algorithm to train loss function

L(X,Y)

\epsilon

that runs in

\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )

time. Here

{\cal T}_{\mathrm{mat}}(a,b,c)

denotes the time of multiplying

a \times b

matrix another

b \times c

matrix, and

\omega\approx 2.37

denotes the exponent of matrix multiplication

arXiv.org e-Print Archive

Improved Reconstruction for Fourier-Sparse Signals

Author: Gao Yeqi
Song Zhao
Sun Baocheng
Weinstein Omri
Zhang Ruizhe
Publication venue
Publication date: 17/11/2023
Field of study

We revisit the classical problem of Fourier-sparse signal reconstruction -- a variant of the \emph{Set Query} problem -- which asks to efficiently reconstruct (a subset of) a

d

-dimensional Fourier-sparse signal (

\|\hat{x}(t)\|_0 \leq k

), from minimum \emph{noisy} samples of

x(t)

in the time domain. We present a unified framework for this problem by developing a theory of sparse Fourier transforms (SFT) for frequencies lying on a \emph{lattice}, which can be viewed as a ``semi-continuous'' version of SFT in between discrete and continuous domains. Using this framework, we obtain the following results:

\bullet

**Dimension-free Fourier sparse recovery** We present a sample-optimal discrete Fourier Set-Query algorithm with

O(k^{\omega+1})

reconstruction time in one dimension, \emph{independent} of the signal's length (

n

) and

\ell_\infty

-norm. This complements the state-of-art algorithm of [Kapralov, STOC 2017], whose reconstruction time is

\tilde{O}(k \log^2 n \log R^*)

, where

R^* \approx \|\hat{x}\|_\infty

is a signal-dependent parameter, and the algorithm is limited to low dimensions. By contrast, our algorithm works for arbitrary

d

dimensions, mitigating the

\exp(d)

blowup in decoding time to merely linear in

d

. A key component in our algorithm is fast spectral sparsification of the Fourier basis.

\bullet

**High-accuracy Fourier interpolation** In one dimension, we design a poly-time

(3+ \sqrt{2} +\epsilon)

-approximation algorithm for continuous Fourier interpolation. This bypasses a barrier of all previous algorithms [Price and Song, FOCS 2015, Chen, Kane, Price and Song, FOCS 2016], which only achieve

c>100

approximation for this basic problem. Our main contribution is a new analytic tool for hierarchical frequency decomposition based on \emph{noise cancellation}

arXiv.org e-Print Archive