7 research outputs found
GradientCoin: A Peer-to-Peer Decentralized Large Language Models
Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin
has fundamentally changed the economic system over the last decade. Since 2022,
large language models (LLMs) such as GPT have outperformed humans in many
real-life tasks. However, these large language models have several practical
issues. For example, the model is centralized and controlled by a specific
unit. One weakness is that if that unit decides to shut down the model, it
cannot be used anymore. The second weakness is the lack of guaranteed
discrepancy behind this model, as certain dishonest units may design their own
models and feed them unhealthy training data.
In this work, we propose a purely theoretical design of a decentralized LLM
that operates similarly to a Bitcoin cash system. However, implementing such a
system might encounter various practical difficulties. Furthermore, this new
system is unlikely to perform better than the standard Bitcoin system in
economics. Therefore, the motivation for designing such a system is limited. It
is likely that only two types of people would be interested in setting up a
practical system for it:
Those who prefer to use a decentralized ChatGPT-like software.
Those who believe that the purpose of carbon-based life is to
create silicon-based life, such as Optimus Prime in Transformers.
The reason the second type of people may be interested is that it is possible
that one day an AI system like this will awaken and become the next level of
intelligence on this planet
An Over-parameterized Exponential Regression
Over the past few years, there has been a significant amount of research
focused on studying the ReLU activation function, with the aim of achieving
neural network convergence through over-parametrization. However, recent
developments in the field of Large Language Models (LLMs) have sparked interest
in the use of exponential activation functions, specifically in the attention
mechanism.
Mathematically, we define the neural function using an exponential activation
function. Given a set of data points with labels where denotes
the number of the data. Here can be expressed as , where represents the number
of neurons, and are weights at time . It's standard in literature
that are the fixed weights and it's never changed during the training. We
initialize the weights with random Gaussian
distributions, such that and initialize
from random sign distribution for each .
Using the gradient descent algorithm, we can find a weight such that
holds with probability , where
and . To optimize
the over-parameterization bound , we employ several tight analysis
techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor,
Song and Woodruff ICML 2022]
Solving Tensor Low Cycle Rank Approximation
Large language models have become ubiquitous in modern life, finding
applications in various domains such as natural language processing, language
translation, and speech recognition. Recently, a breakthrough work [Zhao,
Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from
probabilistic context-free grammar (PCFG). One of the central computation task
for computing probability in PCFG is formulating a particular tensor low rank
approximation problem, we can call it tensor cycle rank. Given an third order tensor , we say that has cycle rank- if there
exists three size matrices , and such that for each
entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k
U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) }
\end{align*} for all . For the tensor
classical rank, tucker rank and train rank, it has been well studied in [Song,
Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous
``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA
2019] and show an input sparsity time algorithm for cycle rank
Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Softmax distributions are widely used in machine learning, including Large
Language Models (LLMs) where the attention unit uses softmax distributions. We
abstract the attention unit as the softmax model, where given a vector input,
the model produces an output drawn from the softmax distribution (which depends
on the vector input). We consider the fundamental problem of binary hypothesis
testing in the setting of softmax models. That is, given an unknown softmax
model, which is known to be one of the two given softmax models, how many
queries are needed to determine which one is the truth? We show that the sample
complexity is asymptotically where is a certain
distance between the parameters of the models.
Furthermore, we draw analogy between the softmax model and the leverage score
model, an important tool for algorithm design in linear algebra and graph
theory. The leverage score model, on a high level, is a model which, given
vector input, produces an output drawn from a distribution dependent on the
input. We obtain similar results for the binary hypothesis testing problem for
leverage score models
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
Large language models (LLMs) have played a pivotal role in revolutionizing
various facets of our daily existence. Solving attention regression is a
fundamental task in optimizing LLMs. In this work, we focus on giving a
provable guarantee for the one-layer attention network objective function
. Here is Kronecker product between and
. is a matrix in , is the -th block of
. The are variables we want to
learn. and is one
entry at -th row and -th column of ,
is the -column vector of , and is the
vectorization of .
In a multi-layer LLM network, the matrix can
be viewed as the output of a layer, and can be viewed as the input of a layer. The matrix version of can
be viewed as and can be viewed as . We provide an iterative
greedy algorithm to train loss function up that runs in
time. Here denotes the time of multiplying matrix
another matrix, and denotes the exponent of
matrix multiplication
Improved Reconstruction for Fourier-Sparse Signals
We revisit the classical problem of Fourier-sparse signal reconstruction -- a
variant of the \emph{Set Query} problem -- which asks to efficiently
reconstruct (a subset of) a -dimensional Fourier-sparse signal
(), from minimum \emph{noisy} samples of in the
time domain. We present a unified framework for this problem by developing a
theory of sparse Fourier transforms (SFT) for frequencies lying on a
\emph{lattice}, which can be viewed as a ``semi-continuous'' version of SFT in
between discrete and continuous domains. Using this framework, we obtain the
following results:
**Dimension-free Fourier sparse recovery** We present a
sample-optimal discrete Fourier Set-Query algorithm with
reconstruction time in one dimension, \emph{independent} of the signal's length
() and -norm. This complements the state-of-art algorithm of
[Kapralov, STOC 2017], whose reconstruction time is , where is a signal-dependent parameter,
and the algorithm is limited to low dimensions. By contrast, our algorithm
works for arbitrary dimensions, mitigating the blowup in decoding
time to merely linear in . A key component in our algorithm is fast spectral
sparsification of the Fourier basis.
**High-accuracy Fourier interpolation** In one dimension, we design
a poly-time -approximation algorithm for continuous
Fourier interpolation. This bypasses a barrier of all previous algorithms
[Price and Song, FOCS 2015, Chen, Kane, Price and Song, FOCS 2016], which only
achieve approximation for this basic problem. Our main contribution is
a new analytic tool for hierarchical frequency decomposition based on
\emph{noise cancellation}