495 research outputs found

    Near Optimal Linear Algebra in the Online and Sliding Window Models

    Full text link
    We initiate the study of numerical linear algebra in the sliding window model, where only the most recent WW updates in a stream form the underlying data set. We first introduce a unified row-sampling based framework that gives randomized algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and β„“1\ell_1-subspace embeddings in the sliding window model, which often use nearly optimal space and achieve nearly input sparsity runtime. Our algorithms are based on "reverse online" versions of offline sampling distributions such as (ridge) leverage scores, β„“1\ell_1 sensitivities, and Lewis weights to quantify both the importance and the recency of a row. Our row-sampling framework rather surprisingly implies connections to the well-studied online model; our structural results also give the first sample optimal (up to lower order terms) online algorithm for low-rank approximation/projection-cost preservation. Using this powerful primitive, we give online algorithms for column/row subset selection and principal component analysis that resolves the main open question of Bhaskara et. al.,(FOCS 2019). We also give the first online algorithm for β„“1\ell_1-subspace embeddings. We further formalize the connection between the online model and the sliding window model by introducing an additional unified framework for deterministic algorithms using a merge and reduce paradigm and the concept of online coresets. Our sampling based algorithms in the row-arrival online model yield online coresets, giving deterministic algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and β„“1\ell_1-subspace embeddings in the sliding window model that use nearly optimal space

    Revisiting Co-Occurring Directions: Sharper Analysis and Efficient Algorithm for Sparse Matrices

    Full text link
    We study the streaming model for approximate matrix multiplication (AMM). We are interested in the scenario that the algorithm can only take one pass over the data with limited memory. The state-of-the-art deterministic sketching algorithm for streaming AMM is the co-occurring directions (COD), which has much smaller approximation errors than randomized algorithms and outperforms other deterministic sketching methods empirically. In this paper, we provide a tighter error bound for COD whose leading term considers the potential approximate low-rank structure and the correlation of input matrices. We prove COD is space optimal with respect to our improved error bound. We also propose a variant of COD for sparse matrices with theoretical guarantees. The experiments on real-world sparse datasets show that the proposed algorithm is more efficient than baseline methods

    On the efficiency of finding and using tabular data summaries : scalability, accuracy, and hardness

    Get PDF
    Tabular data is ubiquitous in modern computer science. However, the size of these tables can be large so computing statistics over them is inefficient in both time and space. This thesis is concerned with finding and using small summaries of large tables for scalable and accurate approximation of the data's properties; or showing such a summary is hard to obtain in small space. This perspective yields the following results: β€’ We introduce projected frequency analysis over an n x d binary table. If the query columns are revealed after observing the data, then we show that space exponential in d is required for constant-factor approximation to statistics such as the number of distinct elements on columns S. We present algorithms that use smaller space than a brute-force approach, while tolerating some super constant error for the frequency estimation. β€’ We find small-space deterministic summaries for a variety of linear algebraic problems in all p-norms for pβ‰₯ 1. These include finding rows of high leverage, subspace embedding, regression, and low rank approximation. β€’ We implement and compare various summary techniques for efficient training of large-scale regression models. We show that a sparse random projection can lead to fast model training despite suboptimal theoretical guarantees than dense competitors. For ridge regression we show that a deterministic summary can reduce the number of gradient steps needed to train the model compared to random projections. We demonstrate the practicality of our approaches through various experiments by showing that small space summaries can lead to close to optimal solutions
    • …
    corecore