18 research outputs found

    Faster Robust Tensor Power Method for Arbitrary Order

    Full text link
    Tensor decomposition is a fundamental method used in various areas to deal with high-dimensional data. \emph{Tensor power method} (TPM) is one of the widely-used techniques in the decomposition of tensors. This paper presents a novel tensor power method for decomposing arbitrary order tensors, which overcomes limitations of existing approaches that are often restricted to lower-order (less than 33) tensors or require strong assumptions about the underlying data structure. We apply sketching method, and we are able to achieve the running time of O~(npβˆ’1)\widetilde{O}(n^{p-1}), on the power pp and dimension nn tensor. We provide a detailed analysis for any pp-th order tensor, which is never given in previous works

    The Expressibility of Polynomial based Attention Scheme

    Full text link
    Large language models (LLMs) have significantly improved various aspects of our daily lives. These models have impacted numerous domains, from healthcare to education, enhancing productivity, decision-making processes, and accessibility. As a result, they have influenced and, to some extent, reshaped people's lifestyles. However, the quadratic complexity of attention in transformer architectures poses a challenge when scaling up these models for processing long textual contexts. This issue makes it impractical to train very large models on lengthy texts or use them efficiently during inference. While a recent study by [KMZ23] introduced a technique that replaces the softmax with a polynomial function and polynomial sketching to speed up attention mechanisms, the theoretical understandings of this new approach are not yet well understood. In this paper, we offer a theoretical analysis of the expressive capabilities of polynomial attention. Our study reveals a disparity in the ability of high-degree and low-degree polynomial attention. Specifically, we construct two carefully designed datasets, namely D0\mathcal{D}_0 and D1\mathcal{D}_1, where D1\mathcal{D}_1 includes a feature with a significantly larger value compared to D0\mathcal{D}_0. We demonstrate that with a sufficiently high degree Ξ²\beta, a single-layer polynomial attention network can distinguish between D0\mathcal{D}_0 and D1\mathcal{D}_1. However, with a low degree Ξ²\beta, the network cannot effectively separate the two datasets. This analysis underscores the greater effectiveness of high-degree polynomials in amplifying large values and distinguishing between datasets. Our analysis offers insight into the representational capacity of polynomial attention and provides a rationale for incorporating higher-degree polynomials in attention mechanisms to capture intricate linguistic correlations.Comment: arXiv admin note: substantial text overlap with arXiv:2310.1168

    GradientCoin: A Peer-to-Peer Decentralized Large Language Models

    Full text link
    Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin has fundamentally changed the economic system over the last decade. Since 2022, large language models (LLMs) such as GPT have outperformed humans in many real-life tasks. However, these large language models have several practical issues. For example, the model is centralized and controlled by a specific unit. One weakness is that if that unit decides to shut down the model, it cannot be used anymore. The second weakness is the lack of guaranteed discrepancy behind this model, as certain dishonest units may design their own models and feed them unhealthy training data. In this work, we propose a purely theoretical design of a decentralized LLM that operates similarly to a Bitcoin cash system. However, implementing such a system might encounter various practical difficulties. Furthermore, this new system is unlikely to perform better than the standard Bitcoin system in economics. Therefore, the motivation for designing such a system is limited. It is likely that only two types of people would be interested in setting up a practical system for it: βˆ™\bullet Those who prefer to use a decentralized ChatGPT-like software. βˆ™\bullet Those who believe that the purpose of carbon-based life is to create silicon-based life, such as Optimus Prime in Transformers. The reason the second type of people may be interested is that it is possible that one day an AI system like this will awaken and become the next level of intelligence on this planet

    Revisiting Quantum Algorithms for Linear Regressions: Quadratic Speedups without Data-Dependent Parameters

    Full text link
    Linear regression is one of the most fundamental linear algebra problems. Given a dense matrix A∈RnΓ—dA \in \mathbb{R}^{n \times d} and a vector bb, the goal is to find xβ€²x' such that βˆ₯Axβ€²βˆ’bβˆ₯22≀(1+Ο΅)min⁑xβˆ₯Axβˆ’bβˆ₯22 \| Ax' - b \|_2^2 \leq (1+\epsilon) \min_{x} \| A x - b \|_2^2 . The best classical algorithm takes O(nd)+poly(d/Ο΅)O(nd) + \mathrm{poly}(d/\epsilon) time [Clarkson and Woodruff STOC 2013, Nelson and Nguyen FOCS 2013]. On the other hand, quantum linear regression algorithms can achieve exponential quantum speedups, as shown in [Wang Phys. Rev. A 96, 012335, Kerenidis and Prakash ITCS 2017, Chakraborty, Gily{\'e}n and Jeffery ICALP 2019]. However, the running times of these algorithms depend on some quantum linear algebra-related parameters, such as ΞΊ(A)\kappa(A), the condition number of AA. In this work, we develop a quantum algorithm that runs in O~(Ο΅βˆ’1nd1.5)+poly(d/Ο΅)\widetilde{O}(\epsilon^{-1}\sqrt{n}d^{1.5}) + \mathrm{poly}(d/\epsilon) time. It provides a quadratic quantum speedup in nn over the classical lower bound without any dependence on data-dependent parameters. In addition, we also show our result can be generalized to multiple regression and ridge linear regression

    Federated Empirical Risk Minimization via Second-Order Method

    Full text link
    Many convex optimization problems with important applications in machine learning are formulated as empirical risk minimization (ERM). There are several examples: linear and logistic regression, LASSO, kernel regression, quantile regression, pp-norm regression, support vector machines (SVM), and mean-field variational inference. To improve data privacy, federated learning is proposed in machine learning as a framework for training deep learning models on the network edge without sharing data between participating nodes. In this work, we present an interior point method (IPM) to solve a general ERM problem under the federated learning setting. We show that the communication complexity of each iteration of our IPM is O~(d3/2)\tilde{O}(d^{3/2}), where dd is the dimension (i.e., number of features) of the dataset

    A Unified Scheme of ResNet and Softmax

    Full text link
    Large language models (LLMs) have brought significant changes to human society. Softmax regression and residual neural networks (ResNet) are two important techniques in deep learning: they not only serve as significant theoretical components supporting the functionality of LLMs but also are related to many other machine learning and theoretical computer science fields, including but not limited to image classification, object detection, semantic segmentation, and tensors. Previous research works studied these two concepts separately. In this paper, we provide a theoretical analysis of the regression problem: βˆ₯⟨exp⁑(Ax)+Ax,1nβŸ©βˆ’1(exp⁑(Ax)+Ax)βˆ’bβˆ₯22\| \langle \exp(Ax) + A x , {\bf 1}_n \rangle^{-1} ( \exp(Ax) + Ax ) - b \|_2^2, where AA is a matrix in RnΓ—d\mathbb{R}^{n \times d}, bb is a vector in Rn\mathbb{R}^n, and 1n{\bf 1}_n is the nn-dimensional vector whose entries are all 11. This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before. We derive the gradient, Hessian, and Lipschitz properties of the loss function. The Hessian is shown to be positive semidefinite, and its structure is characterized as the sum of a low-rank matrix and a diagonal matrix. This enables an efficient approximate Newton method. As a result, this unified scheme helps to connect two previously thought unrelated fields and provides novel insight into loss landscape and optimization for emerging over-parameterized neural networks, which is meaningful for future research in deep learning models

    Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression

    Full text link
    There have been significant advancements made by large language models (LLMs) in various aspects of our daily lives. LLMs serve as a transformative force in natural language processing, finding applications in text generation, translation, sentiment analysis, and question-answering. The accomplishments of LLMs have led to a substantial increase in research efforts in this domain. One specific two-layer regression problem has been well-studied in prior works, where the first layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. While previous works provide a solid analysis of building a two-layer regression, there is still a gap in the analysis of constructing regression problems with more than two layers. In this paper, we take a crucial step toward addressing this problem: we provide an analysis of a two-layer regression problem. In contrast to previous works, our first layer is activated by a softmax unit. This sets the stage for future analyses of creating more activation functions based on the softmax function. Rearranging the softmax function leads to significantly different analyses. Our main results involve analyzing the convergence properties of an approximate Newton method used to minimize the regularized training loss. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions. This enables us to establish local convergence guarantees for the proposed training algorithm. Specifically, with an appropriate initialization and after O(log⁑(1/Ο΅))O(\log(1/\epsilon)) iterations, our algorithm can find an Ο΅\epsilon-approximate minimizer of the training loss with high probability. Each iteration requires approximately O(nnz(C)+dΟ‰)O(\mathrm{nnz}(C) + d^\omega) time, where dd is the model size, CC is the input matrix, and Ο‰<2.374\omega < 2.374 is the matrix multiplication exponent

    Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time

    Full text link
    Given a matrix M∈RmΓ—nM\in \mathbb{R}^{m\times n}, the low rank matrix completion problem asks us to find a rank-kk approximation of MM as UV⊀UV^\top for U∈RmΓ—kU\in \mathbb{R}^{m\times k} and V∈RnΓ—kV\in \mathbb{R}^{n\times k} by only observing a few entries specified by a set of entries Ξ©βŠ†[m]Γ—[n]\Omega\subseteq [m]\times [n]. In particular, we examine an approach that is widely used in practice -- the alternating minimization framework. Jain, Netrapalli and Sanghavi~\cite{jns13} showed that if MM has incoherent rows and columns, then alternating minimization provably recovers the matrix MM by observing a nearly linear in nn number of entries. While the sample complexity has been subsequently improved~\cite{glz17}, alternating minimization steps are required to be computed exactly. This hinders the development of more efficient algorithms and fails to depict the practical implementation of alternating minimization, where the updates are usually performed approximately in favor of efficiency. In this paper, we take a major step towards a more efficient and error-robust alternating minimization framework. To this end, we develop an analytical framework for alternating minimization that can tolerate moderate amount of errors caused by approximate updates. Moreover, our algorithm runs in time O~(∣Ω∣k)\widetilde O(|\Omega| k), which is nearly linear in the time to verify the solution while preserving the sample complexity. This improves upon all prior known alternating minimization approaches which require O~(∣Ω∣k2)\widetilde O(|\Omega| k^2) time.Comment: Improve the runtime from O(mnk)O(mnk) to $O|\Omega| k)

    A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

    Full text link
    Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function L(X,Y)=βˆ‘j0=1nβˆ‘i0=1d(⟨⟨exp⁑(Aj0x),1nβŸ©βˆ’1exp⁑(Aj0x),A3Yβˆ—,i0βŸ©βˆ’bj0,i0)2L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2. Here A∈Rn2Γ—d2\mathsf{A} \in \mathbb{R}^{n^2 \times d^2} is Kronecker product between A1∈RnΓ—dA_1 \in \mathbb{R}^{n \times d} and A2∈RnΓ—dA_2 \in \mathbb{R}^{n \times d}. A3A_3 is a matrix in RnΓ—d\mathbb{R}^{n \times d}, Aj0∈RnΓ—d2\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2} is the j0j_0-th block of A\mathsf{A}. The X,Y∈RdΓ—dX, Y \in \mathbb{R}^{d \times d} are variables we want to learn. B∈RnΓ—dB \in \mathbb{R}^{n \times d} and bj0,i0∈Rb_{j_0,i_0} \in \mathbb{R} is one entry at j0j_0-th row and i0i_0-th column of BB, Yβˆ—,i0∈RdY_{*,i_0} \in \mathbb{R}^d is the i0i_0-column vector of YY, and x∈Rd2x \in \mathbb{R}^{d^2} is the vectorization of XX. In a multi-layer LLM network, the matrix B∈RnΓ—dB \in \mathbb{R}^{n \times d} can be viewed as the output of a layer, and A1=A2=A3∈RnΓ—dA_1= A_2 = A_3 \in \mathbb{R}^{n \times d} can be viewed as the input of a layer. The matrix version of xx can be viewed as QK⊀QK^\top and YY can be viewed as VV. We provide an iterative greedy algorithm to train loss function L(X,Y)L(X,Y) up Ο΅\epsilon that runs in O~((Tmat(n,n,d)+Tmat(n,d,d)+d2Ο‰)log⁑(1/Ο΅))\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) ) time. Here Tmat(a,b,c){\cal T}_{\mathrm{mat}}(a,b,c) denotes the time of multiplying aΓ—ba \times b matrix another bΓ—cb \times c matrix, and Ο‰β‰ˆ2.37\omega\approx 2.37 denotes the exponent of matrix multiplication

    Query Complexity of Active Learning for Function Family With Nearly Orthogonal Basis

    Full text link
    Many machine learning algorithms require large numbers of labeled data to deliver state-of-the-art results. In applications such as medical diagnosis and fraud detection, though there is an abundance of unlabeled data, it is costly to label the data by experts, experiments, or simulations. Active learning algorithms aim to reduce the number of required labeled data points while preserving performance. For many convex optimization problems such as linear regression and pp-norm regression, there are theoretical bounds on the number of required labels to achieve a certain accuracy. We call this the query complexity of active learning. However, today's active learning algorithms require the underlying learned function to have an orthogonal basis. For example, when applying active learning to linear regression, the requirement is the target function is a linear composition of a set of orthogonal linear functions, and active learning can find the coefficients of these linear functions. We present a theoretical result to show that active learning does not need an orthogonal basis but rather only requires a nearly orthogonal basis. We provide the corresponding theoretical proofs for the function family of nearly orthogonal basis, and its applications associated with the algorithmically efficient active learning framework
    corecore