160 research outputs found
Block CUR: Decomposing Matrices using Groups of Columns
A common problem in large-scale data analysis is to approximate a matrix
using a combination of specifically sampled rows and columns, known as CUR
decomposition. Unfortunately, in many real-world environments, the ability to
sample specific individual rows or columns of the matrix is limited by either
system constraints or cost. In this paper, we consider matrix approximation by
sampling predefined \emph{blocks} of columns (or rows) from the matrix. We
present an algorithm for sampling useful column blocks and provide novel
guarantees for the quality of the approximation. This algorithm has application
in problems as diverse as biometric data analysis to distributed computing. We
demonstrate the effectiveness of the proposed algorithms for computing the
Block CUR decomposition of large matrices in a distributed setting with
multiple nodes in a compute cluster, where such blocks correspond to columns
(or rows) of the matrix stored on the same node, which can be retrieved with
much less overhead than retrieving individual columns stored across different
nodes. In the biometric setting, the rows correspond to different users and
columns correspond to users' biometric reaction to external stimuli, {\em
e.g.,}~watching video content, at a particular time instant. There is
significant cost in acquiring each user's reaction to lengthy content so we
sample a few important scenes to approximate the biometric response. An
individual time sample in this use case cannot be queried in isolation due to
the lack of context that caused that biometric reaction. Instead, collections
of time segments ({\em i.e.,} blocks) must be presented to the user. The
practical application of these algorithms is shown via experimental results
using real-world user biometric data from a content testing environment.Comment: shorter version to appear in ECML-PKDD 201
Optimal CUR Matrix Decompositions
The CUR decomposition of an matrix finds an
matrix with a subset of columns of together with an matrix with a subset of rows of as well as a
low-rank matrix such that the matrix approximates the matrix
that is, , where
denotes the Frobenius norm and is the best matrix
of rank constructed via the SVD. We present input-sparsity-time and
deterministic algorithms for constructing such a CUR decomposition where
and and rank. Up to constant
factors, our algorithms are simultaneously optimal in and rank.Comment: small revision in lemma 4.
Variant Ranker: a web-tool to rank genomic data according to functional significance
BACKGROUND: The increasing volume and complexity of high-throughput genomic data make analysis and prioritization of variants difficult for researchers with limited bioinformatics skills. Variant Ranker allows researchers to rank identified variants and determine the most confident variants for experimental validation. RESULTS: We describe Variant Ranker, a user-friendly simple web-based tool for ranking, filtering and annotation of coding and non-coding variants. Variant Ranker facilitates the identification of causal variants based on novelty, effect and annotation information. The algorithm implements and aggregates multiple prediction algorithm scores, conservation scores, allelic frequencies, clinical information and additional open-source annotations using accessible databases via ANNOVAR. The available information for a variant is transformed into user-specified weights, which are in turn encoded into the ranking algorithm. Through its different modules, users can (i) rank a list of variants (ii) perform genotype filtering for case-control samples (iii) filter large amounts of high-throughput data based on user custom filter requirements and apply different models of inheritance (iv) perform downstream functional enrichment analysis through network visualization. Using networks, users can identify clusters of genes that belong to multiple ontology categories (like pathways, gene ontology, disease categories) and therefore expedite scientific discoveries. We demonstrate the utility of Variant Ranker to identify causal genes using real and synthetic datasets. Our results indicate that Variant Ranker exhibits excellent performance by correctly identifying and ranking the candidate genes CONCLUSIONS: Variant Ranker is a freely available web server on http://paschou-lab.mbg.duth.gr/Software.html . This tool will enable users to prioritise potentially causal variants and is applicable to a wide range of sequencing data
Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming
Sketching algorithms have recently proven to be a powerful approach both for
designing low-space streaming algorithms as well as fast polynomial time
approximation schemes (PTAS). In this work, we develop new techniques to extend
the applicability of sketching-based approaches to the sparse dictionary
learning and the Euclidean -means clustering problems. In particular, we
initiate the study of the challenging setting where the dictionary/clustering
assignment for each of the input points must be output, which has
surprisingly received little attention in prior work. On the fast algorithms
front, we obtain a new approach for designing PTAS's for the -means
clustering problem, which generalizes to the first PTAS for the sparse
dictionary learning problem. On the streaming algorithms front, we obtain new
upper bounds and lower bounds for dictionary learning and -means clustering.
In particular, given a design matrix in a
turnstile stream, we show an space
upper bound for -sparse dictionary learning of size , an space upper bound for -means clustering, as
well as an space upper bound for -means clustering on random
order row insertion streams with a natural "bounded sensitivity" assumption. On
the lower bounds side, we obtain a general lower bound for -means clustering, as well as an
lower bound for algorithms which can estimate the
cost of a single fixed set of candidate centers.Comment: To appear in NeurIPS 202
Randomized Extended Kaczmarz for Solving Least-Squares
We present a randomized iterative algorithm that exponentially converges in
expectation to the minimum Euclidean norm least squares solution of a given
linear system of equations. The expected number of arithmetic operations
required to obtain an estimate of given accuracy is proportional to the square
condition number of the system multiplied by the number of non-zeros entries of
the input matrix. The proposed algorithm is an extension of the randomized
Kaczmarz method that was analyzed by Strohmer and Vershynin.Comment: 19 Pages, 5 figures; code is available at
https://github.com/zouzias/RE
Fast approximation of matrix coherence and statistical leverage
The statistical leverage scores of a matrix are the squared row-norms of
the matrix containing its (top) left singular vectors and the coherence is the
largest leverage score. These quantities are of interest in recently-popular
problems such as matrix completion and Nystr\"{o}m-based low-rank matrix
approximation as well as in large-scale statistical data analysis applications
more generally; moreover, they are of interest since they define the key
structural nonuniformity that must be dealt with in developing fast randomized
matrix algorithms. Our main result is a randomized algorithm that takes as
input an arbitrary matrix , with , and that returns as
output relative-error approximations to all of the statistical leverage
scores. The proposed algorithm runs (under assumptions on the precise values of
and ) in time, as opposed to the time required
by the na\"{i}ve algorithm that involves computing an orthogonal basis for the
range of . Our analysis may be viewed in terms of computing a relative-error
approximation to an underconstrained least-squares approximation problem, or,
relatedly, it may be viewed as an application of Johnson-Lindenstrauss type
ideas. Several practically-important extensions of our basic result are also
described, including the approximation of so-called cross-leverage scores, the
extension of these ideas to matrices with , and the extension to
streaming environments.Comment: 29 pages; conference version is in ICML; journal version is in JML
Solving -means on High-dimensional Big Data
In recent years, there have been major efforts to develop data stream
algorithms that process inputs in one pass over the data with little memory
requirement. For the -means problem, this has led to the development of
several -approximations (under the assumption that is a
constant), but also to the design of algorithms that are extremely fast in
practice and compute solutions of high accuracy. However, when not only the
length of the stream is high but also the dimensionality of the input points,
then current methods reach their limits.
We propose two algorithms, piecy and piecy-mr that are based on the recently
developed data stream algorithm BICO that can process high dimensional data in
one pass and output a solution of high quality. While piecy is suited for high
dimensional data with a medium number of points, piecy-mr is meant for high
dimensional data that comes in a very long stream. We provide an extensive
experimental study to evaluate piecy and piecy-mr that shows the strength of
the new algorithms.Comment: 23 pages, 9 figures, published at the 14th International Symposium on
Experimental Algorithms - SEA 201
Near Optimal Linear Algebra in the Online and Sliding Window Models
We initiate the study of numerical linear algebra in the sliding window
model, where only the most recent updates in a stream form the underlying
data set. We first introduce a unified row-sampling based framework that gives
randomized algorithms for spectral approximation, low-rank
approximation/projection-cost preservation, and -subspace embeddings in
the sliding window model, which often use nearly optimal space and achieve
nearly input sparsity runtime. Our algorithms are based on "reverse online"
versions of offline sampling distributions such as (ridge) leverage scores,
sensitivities, and Lewis weights to quantify both the importance and
the recency of a row. Our row-sampling framework rather surprisingly implies
connections to the well-studied online model; our structural results also give
the first sample optimal (up to lower order terms) online algorithm for
low-rank approximation/projection-cost preservation. Using this powerful
primitive, we give online algorithms for column/row subset selection and
principal component analysis that resolves the main open question of Bhaskara
et. al.,(FOCS 2019). We also give the first online algorithm for
-subspace embeddings. We further formalize the connection between the
online model and the sliding window model by introducing an additional unified
framework for deterministic algorithms using a merge and reduce paradigm and
the concept of online coresets. Our sampling based algorithms in the
row-arrival online model yield online coresets, giving deterministic algorithms
for spectral approximation, low-rank approximation/projection-cost
preservation, and -subspace embeddings in the sliding window model that
use nearly optimal space
On landmark selection and sampling in high-dimensional data analysis
In recent years, the spectral analysis of appropriately defined kernel
matrices has emerged as a principled way to extract the low-dimensional
structure often prevalent in high-dimensional data. Here we provide an
introduction to spectral methods for linear and nonlinear dimension reduction,
emphasizing ways to overcome the computational limitations currently faced by
practitioners with massive datasets. In particular, a data subsampling or
landmark selection process is often employed to construct a kernel based on
partial information, followed by an approximate spectral analysis termed the
Nystrom extension. We provide a quantitative framework to analyse this
procedure, and use it to demonstrate algorithmic performance bounds on a range
of practical approaches designed to optimize the landmark selection process. We
compare the practical implications of these bounds by way of real-world
examples drawn from the field of computer vision, whereby low-dimensional
manifold structure is shown to emerge from high-dimensional video data streams.Comment: 18 pages, 6 figures, submitted for publicatio
- …