5 research outputs found

    GradientCoin: A Peer-to-Peer Decentralized Large Language Models

    Full text link
    Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin has fundamentally changed the economic system over the last decade. Since 2022, large language models (LLMs) such as GPT have outperformed humans in many real-life tasks. However, these large language models have several practical issues. For example, the model is centralized and controlled by a specific unit. One weakness is that if that unit decides to shut down the model, it cannot be used anymore. The second weakness is the lack of guaranteed discrepancy behind this model, as certain dishonest units may design their own models and feed them unhealthy training data. In this work, we propose a purely theoretical design of a decentralized LLM that operates similarly to a Bitcoin cash system. However, implementing such a system might encounter various practical difficulties. Furthermore, this new system is unlikely to perform better than the standard Bitcoin system in economics. Therefore, the motivation for designing such a system is limited. It is likely that only two types of people would be interested in setting up a practical system for it: βˆ™\bullet Those who prefer to use a decentralized ChatGPT-like software. βˆ™\bullet Those who believe that the purpose of carbon-based life is to create silicon-based life, such as Optimus Prime in Transformers. The reason the second type of people may be interested is that it is possible that one day an AI system like this will awaken and become the next level of intelligence on this planet

    Solving Tensor Low Cycle Rank Approximation

    Full text link
    Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic context-free grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an nΓ—nΓ—nn \times n \times n third order tensor AA, we say that AA has cycle rank-kk if there exists three nΓ—k2n \times k^2 size matrices U,VU , V, and WW such that for each entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) } \end{align*} for all a∈[n],b∈[n],c∈[n]a \in [n], b \in [n], c \in [n]. For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous ``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank

    An Over-parameterized Exponential Regression

    Full text link
    Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function F:RdΓ—mΓ—Rdβ†’RF: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R} using an exponential activation function. Given a set of data points with labels {(x1,y1),(x2,y2),…,(xn,yn)}βŠ‚RdΓ—R\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R} where nn denotes the number of the data. Here F(W(t),x)F(W(t),x) can be expressed as F(W(t),x):=βˆ‘r=1marexp⁑(⟨wr,x⟩)F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle), where mm represents the number of neurons, and wr(t)w_r(t) are weights at time tt. It's standard in literature that ara_r are the fixed weights and it's never changed during the training. We initialize the weights W(0)∈RdΓ—mW(0) \in \mathbb{R}^{d \times m} with random Gaussian distributions, such that wr(0)∼N(0,Id)w_r(0) \sim \mathcal{N}(0, I_d) and initialize ara_r from random sign distribution for each r∈[m]r \in [m]. Using the gradient descent algorithm, we can find a weight W(T)W(T) such that βˆ₯F(W(T),X)βˆ’yβˆ₯2≀ϡ\| F(W(T), X) - y \|_2 \leq \epsilon holds with probability 1βˆ’Ξ΄1-\delta, where ϡ∈(0,0.1)\epsilon \in (0,0.1) and m=Ξ©(n2+o(1)log⁑(n/Ξ΄))m = \Omega(n^{2+o(1)}\log(n/\delta)). To optimize the over-parameterization bound mm, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022]

    A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

    Full text link
    Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function L(X,Y)=βˆ‘j0=1nβˆ‘i0=1d(⟨⟨exp⁑(Aj0x),1nβŸ©βˆ’1exp⁑(Aj0x),A3Yβˆ—,i0βŸ©βˆ’bj0,i0)2L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2. Here A∈Rn2Γ—d2\mathsf{A} \in \mathbb{R}^{n^2 \times d^2} is Kronecker product between A1∈RnΓ—dA_1 \in \mathbb{R}^{n \times d} and A2∈RnΓ—dA_2 \in \mathbb{R}^{n \times d}. A3A_3 is a matrix in RnΓ—d\mathbb{R}^{n \times d}, Aj0∈RnΓ—d2\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2} is the j0j_0-th block of A\mathsf{A}. The X,Y∈RdΓ—dX, Y \in \mathbb{R}^{d \times d} are variables we want to learn. B∈RnΓ—dB \in \mathbb{R}^{n \times d} and bj0,i0∈Rb_{j_0,i_0} \in \mathbb{R} is one entry at j0j_0-th row and i0i_0-th column of BB, Yβˆ—,i0∈RdY_{*,i_0} \in \mathbb{R}^d is the i0i_0-column vector of YY, and x∈Rd2x \in \mathbb{R}^{d^2} is the vectorization of XX. In a multi-layer LLM network, the matrix B∈RnΓ—dB \in \mathbb{R}^{n \times d} can be viewed as the output of a layer, and A1=A2=A3∈RnΓ—dA_1= A_2 = A_3 \in \mathbb{R}^{n \times d} can be viewed as the input of a layer. The matrix version of xx can be viewed as QK⊀QK^\top and YY can be viewed as VV. We provide an iterative greedy algorithm to train loss function L(X,Y)L(X,Y) up Ο΅\epsilon that runs in O~((Tmat(n,n,d)+Tmat(n,d,d)+d2Ο‰)log⁑(1/Ο΅))\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) ) time. Here Tmat(a,b,c){\cal T}_{\mathrm{mat}}(a,b,c) denotes the time of multiplying aΓ—ba \times b matrix another bΓ—cb \times c matrix, and Ο‰β‰ˆ2.37\omega\approx 2.37 denotes the exponent of matrix multiplication

    Improved Reconstruction for Fourier-Sparse Signals

    Full text link
    We revisit the classical problem of Fourier-sparse signal reconstruction -- a variant of the \emph{Set Query} problem -- which asks to efficiently reconstruct (a subset of) a dd-dimensional Fourier-sparse signal (βˆ₯x^(t)βˆ₯0≀k\|\hat{x}(t)\|_0 \leq k), from minimum \emph{noisy} samples of x(t)x(t) in the time domain. We present a unified framework for this problem by developing a theory of sparse Fourier transforms (SFT) for frequencies lying on a \emph{lattice}, which can be viewed as a ``semi-continuous'' version of SFT in between discrete and continuous domains. Using this framework, we obtain the following results: βˆ™\bullet **Dimension-free Fourier sparse recovery** We present a sample-optimal discrete Fourier Set-Query algorithm with O(kΟ‰+1)O(k^{\omega+1}) reconstruction time in one dimension, \emph{independent} of the signal's length (nn) and β„“βˆž\ell_\infty-norm. This complements the state-of-art algorithm of [Kapralov, STOC 2017], whose reconstruction time is O~(klog⁑2nlog⁑Rβˆ—)\tilde{O}(k \log^2 n \log R^*), where Rβˆ—β‰ˆβˆ₯x^βˆ₯∞R^* \approx \|\hat{x}\|_\infty is a signal-dependent parameter, and the algorithm is limited to low dimensions. By contrast, our algorithm works for arbitrary dd dimensions, mitigating the exp⁑(d)\exp(d) blowup in decoding time to merely linear in dd. A key component in our algorithm is fast spectral sparsification of the Fourier basis. βˆ™\bullet **High-accuracy Fourier interpolation** In one dimension, we design a poly-time (3+2+Ο΅)(3+ \sqrt{2} +\epsilon)-approximation algorithm for continuous Fourier interpolation. This bypasses a barrier of all previous algorithms [Price and Song, FOCS 2015, Chen, Kane, Price and Song, FOCS 2016], which only achieve c>100c>100 approximation for this basic problem. Our main contribution is a new analytic tool for hierarchical frequency decomposition based on \emph{noise cancellation}
    corecore