190 research outputs found

    Randomized algorithms for low-rank matrix approximation: Design, analysis, and applications

    Full text link
    This survey explores modern approaches for computing low-rank approximations of high-dimensional matrices by means of the randomized SVD, randomized subspace iteration, and randomized block Krylov iteration. The paper compares the procedures via theoretical analyses and numerical studies to highlight how the best choice of algorithm depends on spectral properties of the matrix and the computational resources available. Despite superior performance for many problems, randomized block Krylov iteration has not been widely adopted in computational science. The paper strengthens the case for this method in three ways. First, it presents new pseudocode that can significantly reduce computational costs. Second, it provides a new analysis that yields simple, precise, and informative error bounds. Last, it showcases applications to challenging scientific problems, including principal component analysis for genetic data and spectral clustering for molecular dynamics data.Comment: 60 pages, 14 figure

    Federated Empirical Risk Minimization via Second-Order Method

    Full text link
    Many convex optimization problems with important applications in machine learning are formulated as empirical risk minimization (ERM). There are several examples: linear and logistic regression, LASSO, kernel regression, quantile regression, pp-norm regression, support vector machines (SVM), and mean-field variational inference. To improve data privacy, federated learning is proposed in machine learning as a framework for training deep learning models on the network edge without sharing data between participating nodes. In this work, we present an interior point method (IPM) to solve a general ERM problem under the federated learning setting. We show that the communication complexity of each iteration of our IPM is O~(d3/2)\tilde{O}(d^{3/2}), where dd is the dimension (i.e., number of features) of the dataset

    Low Rank Matrix Completion via Robust Alternating Minimization in Nearly Linear Time

    Full text link
    Given a matrix MRm×nM\in \mathbb{R}^{m\times n}, the low rank matrix completion problem asks us to find a rank-kk approximation of MM as UVUV^\top for URm×kU\in \mathbb{R}^{m\times k} and VRn×kV\in \mathbb{R}^{n\times k} by only observing a few entries specified by a set of entries Ω[m]×[n]\Omega\subseteq [m]\times [n]. In particular, we examine an approach that is widely used in practice -- the alternating minimization framework. Jain, Netrapalli and Sanghavi~\cite{jns13} showed that if MM has incoherent rows and columns, then alternating minimization provably recovers the matrix MM by observing a nearly linear in nn number of entries. While the sample complexity has been subsequently improved~\cite{glz17}, alternating minimization steps are required to be computed exactly. This hinders the development of more efficient algorithms and fails to depict the practical implementation of alternating minimization, where the updates are usually performed approximately in favor of efficiency. In this paper, we take a major step towards a more efficient and error-robust alternating minimization framework. To this end, we develop an analytical framework for alternating minimization that can tolerate moderate amount of errors caused by approximate updates. Moreover, our algorithm runs in time O~(Ωk)\widetilde O(|\Omega| k), which is nearly linear in the time to verify the solution while preserving the sample complexity. This improves upon all prior known alternating minimization approaches which require O~(Ωk2)\widetilde O(|\Omega| k^2) time.Comment: Improve the runtime from O(mnk)O(mnk) to $O|\Omega| k)

    A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

    Full text link
    Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function L(X,Y)=j0=1ni0=1d(exp(Aj0x),1n1exp(Aj0x),A3Y,i0bj0,i0)2L(X,Y) = \sum_{j_0 = 1}^n \sum_{i_0 = 1}^d ( \langle \langle \exp( \mathsf{A}_{j_0} x ) , {\bf 1}_n \rangle^{-1} \exp( \mathsf{A}_{j_0} x ), A_{3} Y_{*,i_0} \rangle - b_{j_0,i_0} )^2. Here ARn2×d2\mathsf{A} \in \mathbb{R}^{n^2 \times d^2} is Kronecker product between A1Rn×dA_1 \in \mathbb{R}^{n \times d} and A2Rn×dA_2 \in \mathbb{R}^{n \times d}. A3A_3 is a matrix in Rn×d\mathbb{R}^{n \times d}, Aj0Rn×d2\mathsf{A}_{j_0} \in \mathbb{R}^{n \times d^2} is the j0j_0-th block of A\mathsf{A}. The X,YRd×dX, Y \in \mathbb{R}^{d \times d} are variables we want to learn. BRn×dB \in \mathbb{R}^{n \times d} and bj0,i0Rb_{j_0,i_0} \in \mathbb{R} is one entry at j0j_0-th row and i0i_0-th column of BB, Y,i0RdY_{*,i_0} \in \mathbb{R}^d is the i0i_0-column vector of YY, and xRd2x \in \mathbb{R}^{d^2} is the vectorization of XX. In a multi-layer LLM network, the matrix BRn×dB \in \mathbb{R}^{n \times d} can be viewed as the output of a layer, and A1=A2=A3Rn×dA_1= A_2 = A_3 \in \mathbb{R}^{n \times d} can be viewed as the input of a layer. The matrix version of xx can be viewed as QKQK^\top and YY can be viewed as VV. We provide an iterative greedy algorithm to train loss function L(X,Y)L(X,Y) up ϵ\epsilon that runs in O~((Tmat(n,n,d)+Tmat(n,d,d)+d2ω)log(1/ϵ))\widetilde{O}( ({\cal T}_{\mathrm{mat}}(n,n,d) + {\cal T}_{\mathrm{mat}}(n,d,d) + d^{2\omega}) \log(1/\epsilon) ) time. Here Tmat(a,b,c){\cal T}_{\mathrm{mat}}(a,b,c) denotes the time of multiplying a×ba \times b matrix another b×cb \times c matrix, and ω2.37\omega\approx 2.37 denotes the exponent of matrix multiplication

    Parallel Randomized Tucker Decomposition Algorithms

    Full text link
    The Tucker tensor decomposition is a natural extension of the singular value decomposition (SVD) to multiway data. We propose to accelerate Tucker tensor decomposition algorithms by using randomization and parallelization. We present two algorithms that scale to large data and many processors, significantly reduce both computation and communication cost compared to previous deterministic and randomized approaches, and obtain nearly the same approximation errors. The key idea in our algorithms is to perform randomized sketches with Kronecker-structured random matrices, which reduces computation compared to unstructured matrices and can be implemented using a fundamental tensor computational kernel. We provide probabilistic error analysis of our algorithms and implement a new parallel algorithm for the structured randomized sketch. Our experimental results demonstrate that our combination of randomization and parallelization achieves accurate Tucker decompositions much faster than alternative approaches. We observe up to a 16X speedup over the fastest deterministic parallel implementation on 3D simulation data

    New Subset Selection Algorithms for Low Rank Approximation: Offline and Online

    Full text link
    Subset selection for the rank kk approximation of an n×dn\times d matrix AA offers improvements in the interpretability of matrices, as well as a variety of computational savings. This problem is well-understood when the error measure is the Frobenius norm, with various tight algorithms known even in challenging models such as the online model, where an algorithm must select the column subset irrevocably when the columns arrive one by one. In contrast, for other matrix losses, optimal trade-offs between the subset size and approximation quality have not been settled, even in the offline setting. We give a number of results towards closing these gaps. In the offline setting, we achieve nearly optimal bicriteria algorithms in two settings. First, we remove a k\sqrt k factor from a result of [SWZ19] when the loss function is any entrywise loss with an approximate triangle inequality and at least linear growth. Our result is tight for the 1\ell_1 loss. We give a similar improvement for entrywise p\ell_p losses for p>2p>2, improving a previous distortion of k11/pk^{1-1/p} to k1/21/pk^{1/2-1/p}. Our results come from a technique which replaces the use of a well-conditioned basis with a slightly larger spanning set for which any vector can be expressed as a linear combination with small Euclidean norm. We show that this technique also gives the first oblivious p\ell_p subspace embeddings for 1<p<21<p<2 with O~(d1/p)\tilde O(d^{1/p}) distortion, which is nearly optimal and closes a long line of work. In the online setting, we give the first online subset selection algorithm for p\ell_p subspace approximation and entrywise p\ell_p low rank approximation by implementing sensitivity sampling online, which is challenging due to the sequential nature of sensitivity sampling. Our main technique is an online algorithm for detecting when an approximately optimal subspace changes substantially.Comment: To appear in STOC 2023; abstract shortene

    Large Scale Kernel Methods for Fun and Profit

    Get PDF
    Kernel methods are among the most flexible classes of machine learning models with strong theoretical guarantees. Wide classes of functions can be approximated arbitrarily well with kernels, while fast convergence and learning rates have been formally shown to hold. Exact kernel methods are known to scale poorly with increasing dataset size, and we believe that one of the factors limiting their usage in modern machine learning is the lack of scalable and easy to use algorithms and software. The main goal of this thesis is to study kernel methods from the point of view of efficient learning, with particular emphasis on large-scale data, but also on low-latency training, and user efficiency. We improve the state-of-the-art for scaling kernel solvers to datasets with billions of points using the Falkon algorithm, which combines random projections with fast optimization. Running it on GPUs, we show how to fully utilize available computing power for training kernel machines. To boost the ease-of-use of approximate kernel solvers, we propose an algorithm for automated hyperparameter tuning. By minimizing a penalized loss function, a model can be learned together with its hyperparameters, reducing the time needed for user-driven experimentation. In the setting of multi-class learning, we show that – under stringent but realistic assumptions on the separation between classes – a wide set of algorithms needs much fewer data points than in the more general setting (without assumptions on class separation) to reach the same accuracy. The first part of the thesis develops a framework for efficient and scalable kernel machines. This raises the question of whether our approaches can be used successfully in real-world applications, especially compared to alternatives based on deep learning which are often deemed hard to beat. The second part aims to investigate this question on two main applications, chosen because of the paramount importance of having an efficient algorithm. First, we consider the problem of instance segmentation of images taken from the iCub robot. Here Falkon is used as part of a larger pipeline, but the efficiency afforded by our solver is essential to ensure smooth human-robot interactions. In the second instance, we consider time-series forecasting of wind speed, analysing the relevance of different physical variables on the predictions themselves. We investigate different schemes to adapt i.i.d. learning to the time-series setting. Overall, this work aims to demonstrate, through novel algorithms and examples, that kernel methods are up to computationally demanding tasks, and that there are concrete applications in which their use is warranted and more efficient than that of other, more complex, and less theoretically grounded models

    Linear-scaling kernels for protein sequences and small molecules outperform deep learning while providing uncertainty quantitation and improved interpretability

    Full text link
    Gaussian process (GP) is a Bayesian model which provides several advantages for regression tasks in machine learning such as reliable quantitation of uncertainty and improved interpretability. Their adoption has been precluded by their excessive computational cost and by the difficulty in adapting them for analyzing sequences (e.g. amino acid and nucleotide sequences) and graphs (e.g. ones representing small molecules). In this study, we develop efficient and scalable approaches for fitting GP models as well as fast convolution kernels which scale linearly with graph or sequence size. We implement these improvements by building an open-source Python library called xGPR. We compare the performance of xGPR with the reported performance of various deep learning models on 20 benchmarks, including small molecule, protein sequence and tabular data. We show that xGRP achieves highly competitive performance with much shorter training time. Furthermore, we also develop new kernels for sequence and graph data and show that xGPR generally outperforms convolutional neural networks on predicting key properties of proteins and small molecules. Importantly, xGPR provides uncertainty information not available from typical deep learning models. Additionally, xGPR provides a representation of the input data that can be used for clustering and data visualization. These results demonstrate that xGPR provides a powerful and generic tool that can be broadly useful in protein engineering and drug discovery.Comment: This is a revised version of the original manuscript with additional experiment

    On the Noise Sensitivity of the Randomized SVD

    Full text link
    The randomized singular value decomposition (R-SVD) is a popular sketching-based algorithm for efficiently computing the partial SVD of a large matrix. When the matrix is low-rank, the R-SVD produces its partial SVD exactly; but when the rank is large, it only yields an approximation. Motivated by applications in data science and principal component analysis (PCA), we analyze the R-SVD under a low-rank signal plus noise measurement model; specifically, when its input is a spiked random matrix. The singular values produced by the R-SVD are shown to exhibit a BBP-like phase transition: when the SNR exceeds a certain detectability threshold, that depends on the dimension reduction factor, the largest singular value is an outlier; below the threshold, no outlier emerges from the bulk of singular values. We further compute asymptotic formulas for the overlap between the ground truth signal singular vectors and the approximations produced by the R-SVD. Dimensionality reduction has the adverse affect of amplifying the noise in a highly nonlinear manner. Our results demonstrate the statistical advantage -- in both signal detection and estimation -- of the R-SVD over more naive sketched PCA variants; the advantage is especially dramatic when the sketching dimension is small. Our analysis is asymptotically exact, and substantially more fine-grained than existing operator-norm error bounds for the R-SVD, which largely fail to give meaningful error estimates in the moderate SNR regime. It applies for a broad family of sketching matrices previously considered in the literature, including Gaussian i.i.d. sketches, random projections, and the sub-sampled Hadamard transform, among others. Lastly, we derive an optimal singular value shrinker for singular values and vectors obtained through the R-SVD, which may be useful for applications in matrix denoising

    Structured Mixture Models

    Get PDF
    Finite mixture models are a staple of model-based clustering approaches for distinguishing subgroups. A common mixture model is the finite Gaussian mixture model, whose degrees of freedom scales quadratically with increasing data dimension. Methods in the literature often tackle the degrees of freedom of the Gaussian mixture model by sharing parameters between the eigendecomposition of covariance matrices across all mixture components. We posit finite Gaussian mixture models with alternate forms of parameter sharing by imposing additional structure on the parameters, such as sharing parameters with other components as a convex combination of the corresponding parent components or by imposing a sequence of hierarchical clustering structure in orthogonal subspaces with common parameters across levels. Estimation procedures using the Expectation-Maximization (EM) algorithm are derived throughout, with application to simulated and real-world datasets. As well, the proposed model structures have an interpretable meaning that can shed light on clustering analyses performed by practitioners in the context of their data. The EM algorithm is a popular estimation method for tackling issues of latent data, such as in finite mixture models where component memberships are often latent. One aspect of the EM algorithm that hampers estimation is a slow rate of convergence, which affects the estimation of finite Gaussian mixture models. To explore avenues of improvement, we explore the extrapolation of the sequence of conditional expectations admitting general EM procedures, with minimal modifications for many common models. With the same mindset of accelerating iterative algorithms, we also examine the use of approximate sketching methods in estimating generalized linear models via iteratively re-weighted least squares, with emphasis on practical data infrastructure constraints. We propose a sketching method that controls for both data transfer and computation costs, the former of which is often overlooked in asymptotic complexity analyses, and are able to achieve an approximate result in much faster wall-clock time compared to the exact solution on real-world hardware, and can estimate standard errors in addition to point estimates
    corecore