108 research outputs found
GradientCoin: A Peer-to-Peer Decentralized Large Language Models
Since 2008, after the proposal of a Bitcoin electronic cash system, Bitcoin
has fundamentally changed the economic system over the last decade. Since 2022,
large language models (LLMs) such as GPT have outperformed humans in many
real-life tasks. However, these large language models have several practical
issues. For example, the model is centralized and controlled by a specific
unit. One weakness is that if that unit decides to shut down the model, it
cannot be used anymore. The second weakness is the lack of guaranteed
discrepancy behind this model, as certain dishonest units may design their own
models and feed them unhealthy training data.
In this work, we propose a purely theoretical design of a decentralized LLM
that operates similarly to a Bitcoin cash system. However, implementing such a
system might encounter various practical difficulties. Furthermore, this new
system is unlikely to perform better than the standard Bitcoin system in
economics. Therefore, the motivation for designing such a system is limited. It
is likely that only two types of people would be interested in setting up a
practical system for it:
Those who prefer to use a decentralized ChatGPT-like software.
Those who believe that the purpose of carbon-based life is to
create silicon-based life, such as Optimus Prime in Transformers.
The reason the second type of people may be interested is that it is possible
that one day an AI system like this will awaken and become the next level of
intelligence on this planet
Federated Empirical Risk Minimization via Second-Order Method
Many convex optimization problems with important applications in machine
learning are formulated as empirical risk minimization (ERM). There are several
examples: linear and logistic regression, LASSO, kernel regression, quantile
regression, -norm regression, support vector machines (SVM), and mean-field
variational inference. To improve data privacy, federated learning is proposed
in machine learning as a framework for training deep learning models on the
network edge without sharing data between participating nodes. In this work, we
present an interior point method (IPM) to solve a general ERM problem under the
federated learning setting. We show that the communication complexity of each
iteration of our IPM is , where is the dimension (i.e.,
number of features) of the dataset
Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time
Given a matrix , the low rank matrix completion
problem asks us to find a rank- approximation of as for and by only observing a
few entries specified by a set of entries . In
particular, we examine an approach that is widely used in practice -- the
alternating minimization framework. Jain, Netrapalli and Sanghavi~\cite{jns13}
showed that if has incoherent rows and columns, then alternating
minimization provably recovers the matrix by observing a nearly linear in
number of entries. While the sample complexity has been subsequently
improved~\cite{glz17}, alternating minimization steps are required to be
computed exactly. This hinders the development of more efficient algorithms and
fails to depict the practical implementation of alternating minimization, where
the updates are usually performed approximately in favor of efficiency.
In this paper, we take a major step towards a more efficient and error-robust
alternating minimization framework. To this end, we develop an analytical
framework for alternating minimization that can tolerate moderate amount of
errors caused by approximate updates. Moreover, our algorithm runs in time
, which is nearly linear in the time to verify the
solution while preserving the sample complexity. This improves upon all prior
known alternating minimization approaches which require time.Comment: Improve the runtime from to $O|\Omega| k)
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time
Large language models (LLMs) have played a pivotal role in revolutionizing
various facets of our daily existence. Solving attention regression is a
fundamental task in optimizing LLMs. In this work, we focus on giving a
provable guarantee for the one-layer attention network objective function
. Here is Kronecker product between and
. is a matrix in , is the -th block of
. The are variables we want to
learn. and is one
entry at -th row and -th column of ,
is the -column vector of , and is the
vectorization of .
In a multi-layer LLM network, the matrix can
be viewed as the output of a layer, and can be viewed as the input of a layer. The matrix version of can
be viewed as and can be viewed as . We provide an iterative
greedy algorithm to train loss function up that runs in
time. Here denotes the time of multiplying matrix
another matrix, and denotes the exponent of
matrix multiplication
New Subset Selection Algorithms for Low Rank Approximation: Offline and Online
Subset selection for the rank approximation of an matrix
offers improvements in the interpretability of matrices, as well as a variety
of computational savings. This problem is well-understood when the error
measure is the Frobenius norm, with various tight algorithms known even in
challenging models such as the online model, where an algorithm must select the
column subset irrevocably when the columns arrive one by one. In contrast, for
other matrix losses, optimal trade-offs between the subset size and
approximation quality have not been settled, even in the offline setting. We
give a number of results towards closing these gaps.
In the offline setting, we achieve nearly optimal bicriteria algorithms in
two settings. First, we remove a factor from a result of [SWZ19] when
the loss function is any entrywise loss with an approximate triangle inequality
and at least linear growth. Our result is tight for the loss. We give
a similar improvement for entrywise losses for , improving a
previous distortion of to . Our results come from a
technique which replaces the use of a well-conditioned basis with a slightly
larger spanning set for which any vector can be expressed as a linear
combination with small Euclidean norm. We show that this technique also gives
the first oblivious subspace embeddings for with distortion, which is nearly optimal and closes a long line of work.
In the online setting, we give the first online subset selection algorithm
for subspace approximation and entrywise low rank
approximation by implementing sensitivity sampling online, which is challenging
due to the sequential nature of sensitivity sampling. Our main technique is an
online algorithm for detecting when an approximately optimal subspace changes
substantially.Comment: To appear in STOC 2023; abstract shortene
Large Scale Kernel Methods for Fun and Profit
Kernel methods are among the most flexible classes of machine learning models with strong theoretical guarantees. Wide classes of functions can be approximated arbitrarily well with kernels, while fast convergence and learning rates have been formally shown to hold. Exact kernel methods are known to scale poorly with increasing dataset size, and we believe that one of the factors limiting their usage in modern machine learning is the lack of scalable and easy to use algorithms and software. The main goal of this thesis is to study kernel methods from the point of view of efficient learning, with particular emphasis on large-scale data, but also on low-latency training, and user efficiency. We improve the state-of-the-art for scaling kernel solvers to datasets with billions of points using the Falkon algorithm, which combines random projections with fast optimization. Running it on GPUs, we show how to fully utilize available computing power for training kernel machines. To boost the ease-of-use of approximate kernel solvers, we propose an algorithm for automated hyperparameter tuning. By minimizing a penalized loss function, a model can be learned together with its hyperparameters, reducing the time needed for user-driven experimentation. In the setting of multi-class learning, we show that – under stringent but realistic assumptions on the separation between classes – a wide set of algorithms needs much fewer data points than in the more general setting (without assumptions on class separation) to reach the same accuracy. The first part of the thesis develops a framework for efficient and scalable kernel machines. This raises the question of whether our approaches can be used successfully in real-world applications, especially compared to alternatives based on deep learning which are often deemed hard to beat. The second part aims to investigate this question on two main applications, chosen because of the paramount importance of having an efficient algorithm. First, we consider the problem of instance segmentation of images taken from the iCub robot. Here Falkon is used as part of a larger pipeline, but the efficiency afforded by our solver is essential to ensure smooth human-robot interactions. In the second instance, we consider time-series forecasting of wind speed, analysing the relevance of different physical variables on the predictions themselves. We investigate different schemes to adapt i.i.d. learning to the time-series setting. Overall, this work aims to demonstrate, through novel algorithms and examples, that kernel methods are up to computationally demanding tasks, and that there are concrete applications in which their use is warranted and more efficient than that of other, more complex, and less theoretically grounded models
Linear-scaling kernels for protein sequences and small molecules outperform deep learning while providing uncertainty quantitation and improved interpretability
Gaussian process (GP) is a Bayesian model which provides several advantages
for regression tasks in machine learning such as reliable quantitation of
uncertainty and improved interpretability. Their adoption has been precluded by
their excessive computational cost and by the difficulty in adapting them for
analyzing sequences (e.g. amino acid and nucleotide sequences) and graphs (e.g.
ones representing small molecules). In this study, we develop efficient and
scalable approaches for fitting GP models as well as fast convolution kernels
which scale linearly with graph or sequence size. We implement these
improvements by building an open-source Python library called xGPR. We compare
the performance of xGPR with the reported performance of various deep learning
models on 20 benchmarks, including small molecule, protein sequence and tabular
data. We show that xGRP achieves highly competitive performance with much
shorter training time. Furthermore, we also develop new kernels for sequence
and graph data and show that xGPR generally outperforms convolutional neural
networks on predicting key properties of proteins and small molecules.
Importantly, xGPR provides uncertainty information not available from typical
deep learning models. Additionally, xGPR provides a representation of the input
data that can be used for clustering and data visualization. These results
demonstrate that xGPR provides a powerful and generic tool that can be broadly
useful in protein engineering and drug discovery.Comment: This is a revised version of the original manuscript with additional
experiment
On the Noise Sensitivity of the Randomized SVD
The randomized singular value decomposition (R-SVD) is a popular
sketching-based algorithm for efficiently computing the partial SVD of a large
matrix. When the matrix is low-rank, the R-SVD produces its partial SVD
exactly; but when the rank is large, it only yields an approximation.
Motivated by applications in data science and principal component analysis
(PCA), we analyze the R-SVD under a low-rank signal plus noise measurement
model; specifically, when its input is a spiked random matrix. The singular
values produced by the R-SVD are shown to exhibit a BBP-like phase transition:
when the SNR exceeds a certain detectability threshold, that depends on the
dimension reduction factor, the largest singular value is an outlier; below the
threshold, no outlier emerges from the bulk of singular values. We further
compute asymptotic formulas for the overlap between the ground truth signal
singular vectors and the approximations produced by the R-SVD.
Dimensionality reduction has the adverse affect of amplifying the noise in a
highly nonlinear manner. Our results demonstrate the statistical advantage --
in both signal detection and estimation -- of the R-SVD over more naive
sketched PCA variants; the advantage is especially dramatic when the sketching
dimension is small. Our analysis is asymptotically exact, and substantially
more fine-grained than existing operator-norm error bounds for the R-SVD, which
largely fail to give meaningful error estimates in the moderate SNR regime. It
applies for a broad family of sketching matrices previously considered in the
literature, including Gaussian i.i.d. sketches, random projections, and the
sub-sampled Hadamard transform, among others.
Lastly, we derive an optimal singular value shrinker for singular values and
vectors obtained through the R-SVD, which may be useful for applications in
matrix denoising
Structured Mixture Models
Finite mixture models are a staple of model-based clustering approaches for distinguishing subgroups. A common mixture model is the finite Gaussian mixture model, whose degrees of freedom scales quadratically with increasing data dimension. Methods in the literature often tackle the degrees of freedom of the Gaussian mixture model by sharing parameters between the eigendecomposition of covariance matrices across all mixture components. We posit finite Gaussian mixture models with alternate forms of parameter sharing by imposing additional structure on the parameters, such as sharing parameters with other components as a convex combination of the corresponding parent components or by imposing a sequence of hierarchical clustering structure in orthogonal subspaces with common parameters across levels. Estimation procedures using the Expectation-Maximization (EM) algorithm are derived throughout, with application to simulated and real-world datasets. As well, the proposed model structures have an interpretable meaning that can shed light on clustering analyses performed by practitioners in the context of their data.
The EM algorithm is a popular estimation method for tackling issues of latent data, such as in finite mixture models where component memberships are often latent. One aspect of the EM algorithm that hampers estimation is a slow rate of convergence, which affects the estimation of finite Gaussian mixture models. To explore avenues of improvement, we explore the extrapolation of the sequence of conditional expectations admitting general EM procedures, with minimal modifications for many common models. With the same mindset of accelerating iterative algorithms, we also examine the use of approximate sketching methods in estimating generalized linear models via iteratively re-weighted least squares, with emphasis on practical data infrastructure constraints. We propose a sketching method that controls for both data transfer and computation costs, the former of which is often overlooked in asymptotic complexity analyses, and are able to achieve an approximate result in much faster wall-clock time compared to the exact solution on real-world hardware, and can estimate standard errors in addition to point estimates
A Framework for Statistical Inference via Randomized Algorithms
Randomized algorithms, such as randomized sketching or projections, are a
promising approach to ease the computational burden in analyzing large
datasets. However, randomized algorithms also produce non-deterministic
outputs, leading to the problem of evaluating their accuracy. In this paper, we
develop a statistical inference framework for quantifying the uncertainty of
the outputs of randomized algorithms. We develop appropriate statistical
methods -- sub-randomization, multi-run plug-in and multi-run aggregation
inference -- by using multiple runs of the same randomized algorithm, or by
estimating the unknown parameters of the limiting distribution. As an example,
we develop methods for statistical inference for least squares parameters via
random sketching using matrices with i.i.d.entries, or uniform partial
orthogonal matrices. For this, we characterize the limiting distribution of
estimators obtained via sketch-and-solve as well as partial sketching methods.
The analysis of i.i.d. sketches uses a trigonometric interpolation argument to
establish a differential equation for the limiting expected characteristic
function and find the dependence on the kurtosis of the entries of the
sketching matrix. The results are supported via a broad range of simulations
- …