199 research outputs found
Block CUR: Decomposing Matrices using Groups of Columns
A common problem in large-scale data analysis is to approximate a matrix
using a combination of specifically sampled rows and columns, known as CUR
decomposition. Unfortunately, in many real-world environments, the ability to
sample specific individual rows or columns of the matrix is limited by either
system constraints or cost. In this paper, we consider matrix approximation by
sampling predefined \emph{blocks} of columns (or rows) from the matrix. We
present an algorithm for sampling useful column blocks and provide novel
guarantees for the quality of the approximation. This algorithm has application
in problems as diverse as biometric data analysis to distributed computing. We
demonstrate the effectiveness of the proposed algorithms for computing the
Block CUR decomposition of large matrices in a distributed setting with
multiple nodes in a compute cluster, where such blocks correspond to columns
(or rows) of the matrix stored on the same node, which can be retrieved with
much less overhead than retrieving individual columns stored across different
nodes. In the biometric setting, the rows correspond to different users and
columns correspond to users' biometric reaction to external stimuli, {\em
e.g.,}~watching video content, at a particular time instant. There is
significant cost in acquiring each user's reaction to lengthy content so we
sample a few important scenes to approximate the biometric response. An
individual time sample in this use case cannot be queried in isolation due to
the lack of context that caused that biometric reaction. Instead, collections
of time segments ({\em i.e.,} blocks) must be presented to the user. The
practical application of these algorithms is shown via experimental results
using real-world user biometric data from a content testing environment.Comment: shorter version to appear in ECML-PKDD 201
Optimal CUR Matrix Decompositions
The CUR decomposition of an matrix finds an
matrix with a subset of columns of together with an matrix with a subset of rows of as well as a
low-rank matrix such that the matrix approximates the matrix
that is, , where
denotes the Frobenius norm and is the best matrix
of rank constructed via the SVD. We present input-sparsity-time and
deterministic algorithms for constructing such a CUR decomposition where
and and rank. Up to constant
factors, our algorithms are simultaneously optimal in and rank.Comment: small revision in lemma 4.
Variant Ranker: a web-tool to rank genomic data according to functional significance
BACKGROUND: The increasing volume and complexity of high-throughput genomic data make analysis and prioritization of variants difficult for researchers with limited bioinformatics skills. Variant Ranker allows researchers to rank identified variants and determine the most confident variants for experimental validation. RESULTS: We describe Variant Ranker, a user-friendly simple web-based tool for ranking, filtering and annotation of coding and non-coding variants. Variant Ranker facilitates the identification of causal variants based on novelty, effect and annotation information. The algorithm implements and aggregates multiple prediction algorithm scores, conservation scores, allelic frequencies, clinical information and additional open-source annotations using accessible databases via ANNOVAR. The available information for a variant is transformed into user-specified weights, which are in turn encoded into the ranking algorithm. Through its different modules, users can (i) rank a list of variants (ii) perform genotype filtering for case-control samples (iii) filter large amounts of high-throughput data based on user custom filter requirements and apply different models of inheritance (iv) perform downstream functional enrichment analysis through network visualization. Using networks, users can identify clusters of genes that belong to multiple ontology categories (like pathways, gene ontology, disease categories) and therefore expedite scientific discoveries. We demonstrate the utility of Variant Ranker to identify causal genes using real and synthetic datasets. Our results indicate that Variant Ranker exhibits excellent performance by correctly identifying and ranking the candidate genes CONCLUSIONS: Variant Ranker is a freely available web server on http://paschou-lab.mbg.duth.gr/Software.html . This tool will enable users to prioritise potentially causal variants and is applicable to a wide range of sequencing data
Solving -means on High-dimensional Big Data
In recent years, there have been major efforts to develop data stream
algorithms that process inputs in one pass over the data with little memory
requirement. For the -means problem, this has led to the development of
several -approximations (under the assumption that is a
constant), but also to the design of algorithms that are extremely fast in
practice and compute solutions of high accuracy. However, when not only the
length of the stream is high but also the dimensionality of the input points,
then current methods reach their limits.
We propose two algorithms, piecy and piecy-mr that are based on the recently
developed data stream algorithm BICO that can process high dimensional data in
one pass and output a solution of high quality. While piecy is suited for high
dimensional data with a medium number of points, piecy-mr is meant for high
dimensional data that comes in a very long stream. We provide an extensive
experimental study to evaluate piecy and piecy-mr that shows the strength of
the new algorithms.Comment: 23 pages, 9 figures, published at the 14th International Symposium on
Experimental Algorithms - SEA 201
Randomized Extended Kaczmarz for Solving Least-Squares
We present a randomized iterative algorithm that exponentially converges in
expectation to the minimum Euclidean norm least squares solution of a given
linear system of equations. The expected number of arithmetic operations
required to obtain an estimate of given accuracy is proportional to the square
condition number of the system multiplied by the number of non-zeros entries of
the input matrix. The proposed algorithm is an extension of the randomized
Kaczmarz method that was analyzed by Strohmer and Vershynin.Comment: 19 Pages, 5 figures; code is available at
https://github.com/zouzias/RE
A Sparse Stress Model
Force-directed layout methods constitute the most common approach to draw
general graphs. Among them, stress minimization produces layouts of
comparatively high quality but also imposes comparatively high computational
demands. We propose a speed-up method based on the aggregation of terms in the
objective function. It is akin to aggregate repulsion from far-away nodes
during spring embedding but transfers the idea from the layout space into a
preprocessing phase. An initial experimental study informs a method to select
representatives, and subsequent more extensive experiments indicate that our
method yields better approximations of minimum-stress layouts in less time than
related methods.Comment: Appears in the Proceedings of the 24th International Symposium on
Graph Drawing and Network Visualization (GD 2016
On landmark selection and sampling in high-dimensional data analysis
In recent years, the spectral analysis of appropriately defined kernel
matrices has emerged as a principled way to extract the low-dimensional
structure often prevalent in high-dimensional data. Here we provide an
introduction to spectral methods for linear and nonlinear dimension reduction,
emphasizing ways to overcome the computational limitations currently faced by
practitioners with massive datasets. In particular, a data subsampling or
landmark selection process is often employed to construct a kernel based on
partial information, followed by an approximate spectral analysis termed the
Nystrom extension. We provide a quantitative framework to analyse this
procedure, and use it to demonstrate algorithmic performance bounds on a range
of practical approaches designed to optimize the landmark selection process. We
compare the practical implications of these bounds by way of real-world
examples drawn from the field of computer vision, whereby low-dimensional
manifold structure is shown to emerge from high-dimensional video data streams.Comment: 18 pages, 6 figures, submitted for publicatio
BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
The rising volume of datasets has made training machine learning (ML) models
a major computational cost in the enterprise. Given the iterative nature of
model and parameter tuning, many analysts use a small sample of their entire
data during their initial stage of analysis to make quick decisions (e.g., what
features or hyperparameters to use) and use the entire dataset only in later
stages (i.e., when they have converged to a specific model). This sampling,
however, is performed in an ad-hoc fashion. Most practitioners cannot precisely
capture the effect of sampling on the quality of their model, and eventually on
their decision-making process during the tuning phase. Moreover, without
systematic support for sampling operators, many optimizations and reuse
opportunities are lost.
In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML
training. BlinkML allows users to make error-computation tradeoffs: instead of
training a model on their full data (i.e., full model), BlinkML can quickly
train an approximate model with quality guarantees using a sample. The quality
guarantees ensure that, with high probability, the approximate model makes the
same predictions as the full model. BlinkML currently supports any ML model
that relies on maximum likelihood estimation (MLE), which includes Generalized
Linear Models (e.g., linear regression, logistic regression, max entropy
classifier, Poisson regression) as well as PPCA (Probabilistic Principal
Component Analysis). Our experiments show that BlinkML can speed up the
training of large-scale ML tasks by 6.26x-629x while guaranteeing the same
predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201
A Matrix Hyperbolic Cosine Algorithm and Applications
In this paper, we generalize Spencer's hyperbolic cosine algorithm to the
matrix-valued setting. We apply the proposed algorithm to several problems by
analyzing its computational efficiency under two special cases of matrices; one
in which the matrices have a group structure and an other in which they have
rank-one. As an application of the former case, we present a deterministic
algorithm that, given the multiplication table of a finite group of size ,
it constructs an expanding Cayley graph of logarithmic degree in near-optimal
O(n^2 log^3 n) time. For the latter case, we present a fast deterministic
algorithm for spectral sparsification of positive semi-definite matrices, which
implies an improved deterministic algorithm for spectral graph sparsification
of dense graphs. In addition, we give an elementary connection between spectral
sparsification of positive semi-definite matrices and element-wise matrix
sparsification. As a consequence, we obtain improved element-wise
sparsification algorithms for diagonally dominant-like matrices.Comment: 16 pages, simplified proof and corrected acknowledging of prior work
in (current) Section
- …
