Search CORE

18 research outputs found

Faster Robust Tensor Power Method for Arbitrary Order

Author: Deng Yichuan
Song Zhao
Yin Junze
Publication venue
Publication date: 01/06/2023
Field of study

Tensor decomposition is a fundamental method used in various areas to deal with high-dimensional data. \emph{Tensor power method} (TPM) is one of the widely-used techniques in the decomposition of tensors. This paper presents a novel tensor power method for decomposing arbitrary order tensors, which overcomes limitations of existing approaches that are often restricted to lower-order (less than

3

) tensors or require strong assumptions about the underlying data structure. We apply sketching method, and we are able to achieve the running time of

\widetilde{O}(n^{p-1})

, on the power

p

and dimension

n

tensor. We provide a detailed analysis for any

p

-th order tensor, which is never given in previous works

arXiv.org e-Print Archive

The Expressibility of Polynomial based Attention Scheme

Author: Song Zhao
Xu Guangyi
Yin Junze
Publication venue
Publication date: 30/10/2023
Field of study

Large language models (LLMs) have significantly improved various aspects of our daily lives. These models have impacted numerous domains, from healthcare to education, enhancing productivity, decision-making processes, and accessibility. As a result, they have influenced and, to some extent, reshaped people's lifestyles. However, the quadratic complexity of attention in transformer architectures poses a challenge when scaling up these models for processing long textual contexts. This issue makes it impractical to train very large models on lengthy texts or use them efficiently during inference. While a recent study by [KMZ23] introduced a technique that replaces the softmax with a polynomial function and polynomial sketching to speed up attention mechanisms, the theoretical understandings of this new approach are not yet well understood. In this paper, we offer a theoretical analysis of the expressive capabilities of polynomial attention. Our study reveals a disparity in the ability of high-degree and low-degree polynomial attention. Specifically, we construct two carefully designed datasets, namely

\mathcal{D}_0

and

\mathcal{D}_1

, where

\mathcal{D}_1

includes a feature with a significantly larger value compared to

\mathcal{D}_0

. We demonstrate that with a sufficiently high degree

\beta

, a single-layer polynomial attention network can distinguish between

\mathcal{D}_0

and

\mathcal{D}_1

. However, with a low degree

\beta

, the network cannot effectively separate the two datasets. This analysis underscores the greater effectiveness of high-degree polynomials in amplifying large values and distinguishing between datasets. Our analysis offers insight into the representational capacity of polynomial attention and provides a rationale for incorporating higher-degree polynomials in attention mechanisms to capture intricate linguistic correlations.Comment: arXiv admin note: substantial text overlap with arXiv:2310.1168

arXiv.org e-Print Archive

GradientCoin: A Peer-to-Peer Decentralized Large Language Models

Author: Gao Yeqi
Song Zhao
Yin Junze
Publication venue
Publication date: 21/08/2023
Field of study

Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin has fundamentally changed the economic system over the last decade. Since 2022, large language models (LLMs) such as GPT have outperformed humans in many real-life tasks. However, these large language models have several practical issues. For example, the model is centralized and controlled by a specific unit. One weakness is that if that unit decides to shut down the model, it cannot be used anymore. The second weakness is the lack of guaranteed discrepancy behind this model, as certain dishonest units may design their own models and feed them unhealthy training data. In this work, we propose a purely theoretical design of a decentralized LLM that operates similarly to a Bitcoin cash system. However, implementing such a system might encounter various practical difficulties. Furthermore, this new system is unlikely to perform better than the standard Bitcoin system in economics. Therefore, the motivation for designing such a system is limited. It is likely that only two types of people would be interested in setting up a practical system for it:

\bullet

Those who prefer to use a decentralized ChatGPT-like software.

\bullet

Those who believe that the purpose of carbon-based life is to create silicon-based life, such as Optimus Prime in Transformers. The reason the second type of people may be interested is that it is possible that one day an AI system like this will awaken and become the next level of intelligence on this planet

arXiv.org e-Print Archive

Revisiting Quantum Algorithms for Linear Regressions: Quadratic Speedups without Data-Dependent Parameters

Author: Song Zhao
Yin Junze
Zhang Ruizhe
Publication venue
Publication date: 24/11/2023
Field of study

Linear regression is one of the most fundamental linear algebra problems. Given a dense matrix

A \in \mathbb{R}^{n \times d}

and a vector

b

, the goal is to find

x'

such that

\| Ax' - b \|_2^2 \leq (1+\epsilon) \min_{x} \| A x - b \|_2^2

. The best classical algorithm takes

O(nd) + \mathrm{poly}(d/\epsilon)

time [Clarkson and Woodruff STOC 2013, Nelson and Nguyen FOCS 2013]. On the other hand, quantum linear regression algorithms can achieve exponential quantum speedups, as shown in [Wang Phys. Rev. A 96, 012335, Kerenidis and Prakash ITCS 2017, Chakraborty, Gily{\'e}n and Jeffery ICALP 2019]. However, the running times of these algorithms depend on some quantum linear algebra-related parameters, such as

\kappa(A)

, the condition number of

A

. In this work, we develop a quantum algorithm that runs in

\widetilde{O}(\epsilon^{-1}\sqrt{n}d^{1.5}) + \mathrm{poly}(d/\epsilon)

time. It provides a quadratic quantum speedup in

n

over the classical lower bound without any dependence on data-dependent parameters. In addition, we also show our result can be generalized to multiple regression and ridge linear regression

arXiv.org e-Print Archive

Federated Empirical Risk Minimization via Second-Order Method

Author: Bian Song
Song Zhao
Yin Junze
Publication venue
Publication date: 27/05/2023
Field of study

Many convex optimization problems with important applications in machine learning are formulated as empirical risk minimization (ERM). There are several examples: linear and logistic regression, LASSO, kernel regression, quantile regression,

p

-norm regression, support vector machines (SVM), and mean-field variational inference. To improve data privacy, federated learning is proposed in machine learning as a framework for training deep learning models on the network edge without sharing data between participating nodes. In this work, we present an interior point method (IPM) to solve a general ERM problem under the federated learning setting. We show that the communication complexity of each iteration of our IPM is

\tilde{O}(d^{3/2})

, where

d

is the dimension (i.e., number of features) of the dataset

arXiv.org e-Print Archive

A Unified Scheme of ResNet and Softmax

Author: Song Zhao
Wang Weixin
Yin Junze
Publication venue
Publication date: 23/09/2023
Field of study

Large language models (LLMs) have brought significant changes to human society. Softmax regression and residual neural networks (ResNet) are two important techniques in deep learning: they not only serve as significant theoretical components supporting the functionality of LLMs but also are related to many other machine learning and theoretical computer science fields, including but not limited to image classification, object detection, semantic segmentation, and tensors. Previous research works studied these two concepts separately. In this paper, we provide a theoretical analysis of the regression problem:

\| \langle \exp(Ax) + A x , {\bf 1}_n \rangle^{-1} ( \exp(Ax) + Ax ) - b \|_2^2

, where

A

is a matrix in

\mathbb{R}^{n \times d}

b

is a vector in

\mathbb{R}^n

, and

{\bf 1}_n

is the

n

-dimensional vector whose entries are all

1

. This regression problem is a unified scheme that combines softmax regression and ResNet, which has never been done before. We derive the gradient, Hessian, and Lipschitz properties of the loss function. The Hessian is shown to be positive semidefinite, and its structure is characterized as the sum of a low-rank matrix and a diagonal matrix. This enables an efficient approximate Newton method. As a result, this unified scheme helps to connect two previously thought unrelated fields and provides novel insight into loss landscape and optimization for emerging over-parameterized neural networks, which is meaningful for future research in deep learning models

arXiv.org e-Print Archive

Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression

Author: Li Zhihang
Song Zhao
Wang Zifan
Yin Junze
Publication venue
Publication date: 26/11/2023
Field of study

There have been significant advancements made by large language models (LLMs) in various aspects of our daily lives. LLMs serve as a transformative force in natural language processing, finding applications in text generation, translation, sentiment analysis, and question-answering. The accomplishments of LLMs have led to a substantial increase in research efforts in this domain. One specific two-layer regression problem has been well-studied in prior works, where the first layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. While previous works provide a solid analysis of building a two-layer regression, there is still a gap in the analysis of constructing regression problems with more than two layers. In this paper, we take a crucial step toward addressing this problem: we provide an analysis of a two-layer regression problem. In contrast to previous works, our first layer is activated by a softmax unit. This sets the stage for future analyses of creating more activation functions based on the softmax function. Rearranging the softmax function leads to significantly different analyses. Our main results involve analyzing the convergence properties of an approximate Newton method used to minimize the regularized training loss. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions. This enables us to establish local convergence guarantees for the proposed training algorithm. Specifically, with an appropriate initialization and after

O(\log(1/\epsilon))

iterations, our algorithm can find an

\epsilon

-approximate minimizer of the training loss with high probability. Each iteration requires approximately

O(\mathrm{nnz}(C) + d^\omega)

time, where

d

is the model size,

C

is the input matrix, and

\omega < 2.374

is the matrix multiplication exponent

arXiv.org e-Print Archive

Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time

Author: Gu Yuzhou
Song Zhao
Yin Junze
Zhang Lichen
Publication venue
Publication date: 20/08/2023
Field of study

Given a matrix

M\in \mathbb{R}^{m\times n}

, the low rank matrix completion problem asks us to find a rank-

k

approximation of

M

UV^\top

for

U\in \mathbb{R}^{m\times k}

and

V\in \mathbb{R}^{n\times k}

by only observing a few entries specified by a set of entries

\Omega\subseteq [m]\times [n]

. In particular, we examine an approach that is widely used in practice -- the alternating minimization framework. Jain, Netrapalli and Sanghavi~\cite{jns13} showed that if

M

has incoherent rows and columns, then alternating minimization provably recovers the matrix

M

by observing a nearly linear in

n

number of entries. While the sample complexity has been subsequently improved~\cite{glz17}, alternating minimization steps are required to be computed exactly. This hinders the development of more efficient algorithms and fails to depict the practical implementation of alternating minimization, where the updates are usually performed approximately in favor of efficiency. In this paper, we take a major step towards a more efficient and error-robust alternating minimization framework. To this end, we develop an analytical framework for alternating minimization that can tolerate moderate amount of errors caused by approximate updates. Moreover, our algorithm runs in time

\widetilde O(|\Omega| k)

, which is nearly linear in the time to verify the solution while preserving the sample complexity. This improves upon all prior known alternating minimization approaches which require

\widetilde O(|\Omega| k^2)

time.Comment: Improve the runtime from

O(mnk)

to $O|\Omega| k)

arXiv.org e-Print Archive

A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Author: Gao Yeqi
Song Zhao
Wang Weixin
Yin Junze
Publication venue
Publication date: 14/09/2023
Field of study

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function

L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2

. Here

\mathsf{A} \in \mathbb{R}^{n^2 \times d^2}

is Kronecker product between

A_1 \in \mathbb{R}^{n \times d}

and

A_2 \in \mathbb{R}^{n \times d}

A_3

is a matrix in

\mathbb{R}^{n \times d}

\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2}

is the

j_0

-th block of

\mathsf{A}

. The

X, Y \in \mathbb{R}^{d \times d}

are variables we want to learn.

B \in \mathbb{R}^{n \times d}

and

b_{j_0,i_0} \in \mathbb{R}

is one entry at

j_0

-th row and

i_0

-th column of

B

Y_{*,i_0} \in \mathbb{R}^d

is the

i_0

-column vector of

Y

, and

x \in \mathbb{R}^{d^2}

is the vectorization of

X

. In a multi-layer LLM network, the matrix

B \in \mathbb{R}^{n \times d}

can be viewed as the output of a layer, and

A_1= A_2 = A_3 \in \mathbb{R}^{n \times d}

can be viewed as the input of a layer. The matrix version of

x

can be viewed as

QK^\top

and

Y

can be viewed as

V

. We provide an iterative greedy algorithm to train loss function

L(X,Y)

\epsilon

that runs in

\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) )

time. Here

{\cal T}_{\mathrm{mat}}(a,b,c)

denotes the time of multiplying

a \times b

matrix another

b \times c

matrix, and

\omega\approx 2.37

denotes the exponent of matrix multiplication

arXiv.org e-Print Archive

Query Complexity of Active Learning for Function Family With Nearly Orthogonal Basis

Author: Chen Xiang
Song Zhao
Sun Baocheng
Yin Junze
Zhuo Danyang
Publication venue
Publication date: 05/06/2023
Field of study

Many machine learning algorithms require large numbers of labeled data to deliver state-of-the-art results. In applications such as medical diagnosis and fraud detection, though there is an abundance of unlabeled data, it is costly to label the data by experts, experiments, or simulations. Active learning algorithms aim to reduce the number of required labeled data points while preserving performance. For many convex optimization problems such as linear regression and

p

-norm regression, there are theoretical bounds on the number of required labels to achieve a certain accuracy. We call this the query complexity of active learning. However, today's active learning algorithms require the underlying learned function to have an orthogonal basis. For example, when applying active learning to linear regression, the requirement is the target function is a linear composition of a set of orthogonal linear functions, and active learning can find the coefficients of these linear functions. We present a theoretical result to show that active learning does not need an orthogonal basis but rather only requires a nearly orthogonal basis. We provide the corresponding theoretical proofs for the function family of nearly orthogonal basis, and its applications associated with the algorithmically efficient active learning framework

arXiv.org e-Print Archive