199 research outputs found

    Block CUR: Decomposing Matrices using Groups of Columns

    Full text link
    A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. Unfortunately, in many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined \emph{blocks} of columns (or rows) from the matrix. We present an algorithm for sampling useful column blocks and provide novel guarantees for the quality of the approximation. This algorithm has application in problems as diverse as biometric data analysis to distributed computing. We demonstrate the effectiveness of the proposed algorithms for computing the Block CUR decomposition of large matrices in a distributed setting with multiple nodes in a compute cluster, where such blocks correspond to columns (or rows) of the matrix stored on the same node, which can be retrieved with much less overhead than retrieving individual columns stored across different nodes. In the biometric setting, the rows correspond to different users and columns correspond to users' biometric reaction to external stimuli, {\em e.g.,}~watching video content, at a particular time instant. There is significant cost in acquiring each user's reaction to lengthy content so we sample a few important scenes to approximate the biometric response. An individual time sample in this use case cannot be queried in isolation due to the lack of context that caused that biometric reaction. Instead, collections of time segments ({\em i.e.,} blocks) must be presented to the user. The practical application of these algorithms is shown via experimental results using real-world user biometric data from a content testing environment.Comment: shorter version to appear in ECML-PKDD 201

    Optimal CUR Matrix Decompositions

    Full text link
    The CUR decomposition of an m×nm \times n matrix AA finds an m×cm \times c matrix CC with a subset of c<nc < n columns of A,A, together with an r×nr \times n matrix RR with a subset of r<mr < m rows of A,A, as well as a c×rc \times r low-rank matrix UU such that the matrix CURC U R approximates the matrix A,A, that is, ACURF2(1+ϵ)AAkF2 || A - CUR ||_F^2 \le (1+\epsilon) || A - A_k||_F^2, where .F||.||_F denotes the Frobenius norm and AkA_k is the best m×nm \times n matrix of rank kk constructed via the SVD. We present input-sparsity-time and deterministic algorithms for constructing such a CUR decomposition where c=O(k/ϵ)c=O(k/\epsilon) and r=O(k/ϵ)r=O(k/\epsilon) and rank(U)=k(U) = k. Up to constant factors, our algorithms are simultaneously optimal in c,r,c, r, and rank(U)(U).Comment: small revision in lemma 4.

    Variant Ranker: a web-tool to rank genomic data according to functional significance

    Get PDF
    BACKGROUND: The increasing volume and complexity of high-throughput genomic data make analysis and prioritization of variants difficult for researchers with limited bioinformatics skills. Variant Ranker allows researchers to rank identified variants and determine the most confident variants for experimental validation. RESULTS: We describe Variant Ranker, a user-friendly simple web-based tool for ranking, filtering and annotation of coding and non-coding variants. Variant Ranker facilitates the identification of causal variants based on novelty, effect and annotation information. The algorithm implements and aggregates multiple prediction algorithm scores, conservation scores, allelic frequencies, clinical information and additional open-source annotations using accessible databases via ANNOVAR. The available information for a variant is transformed into user-specified weights, which are in turn encoded into the ranking algorithm. Through its different modules, users can (i) rank a list of variants (ii) perform genotype filtering for case-control samples (iii) filter large amounts of high-throughput data based on user custom filter requirements and apply different models of inheritance (iv) perform downstream functional enrichment analysis through network visualization. Using networks, users can identify clusters of genes that belong to multiple ontology categories (like pathways, gene ontology, disease categories) and therefore expedite scientific discoveries. We demonstrate the utility of Variant Ranker to identify causal genes using real and synthetic datasets. Our results indicate that Variant Ranker exhibits excellent performance by correctly identifying and ranking the candidate genes CONCLUSIONS: Variant Ranker is a freely available web server on http://paschou-lab.mbg.duth.gr/Software.html . This tool will enable users to prioritise potentially causal variants and is applicable to a wide range of sequencing data

    Solving kk-means on High-dimensional Big Data

    Full text link
    In recent years, there have been major efforts to develop data stream algorithms that process inputs in one pass over the data with little memory requirement. For the kk-means problem, this has led to the development of several (1+ε)(1+\varepsilon)-approximations (under the assumption that kk is a constant), but also to the design of algorithms that are extremely fast in practice and compute solutions of high accuracy. However, when not only the length of the stream is high but also the dimensionality of the input points, then current methods reach their limits. We propose two algorithms, piecy and piecy-mr that are based on the recently developed data stream algorithm BICO that can process high dimensional data in one pass and output a solution of high quality. While piecy is suited for high dimensional data with a medium number of points, piecy-mr is meant for high dimensional data that comes in a very long stream. We provide an extensive experimental study to evaluate piecy and piecy-mr that shows the strength of the new algorithms.Comment: 23 pages, 9 figures, published at the 14th International Symposium on Experimental Algorithms - SEA 201

    Randomized Extended Kaczmarz for Solving Least-Squares

    Full text link
    We present a randomized iterative algorithm that exponentially converges in expectation to the minimum Euclidean norm least squares solution of a given linear system of equations. The expected number of arithmetic operations required to obtain an estimate of given accuracy is proportional to the square condition number of the system multiplied by the number of non-zeros entries of the input matrix. The proposed algorithm is an extension of the randomized Kaczmarz method that was analyzed by Strohmer and Vershynin.Comment: 19 Pages, 5 figures; code is available at https://github.com/zouzias/RE

    A Sparse Stress Model

    Full text link
    Force-directed layout methods constitute the most common approach to draw general graphs. Among them, stress minimization produces layouts of comparatively high quality but also imposes comparatively high computational demands. We propose a speed-up method based on the aggregation of terms in the objective function. It is akin to aggregate repulsion from far-away nodes during spring embedding but transfers the idea from the layout space into a preprocessing phase. An initial experimental study informs a method to select representatives, and subsequent more extensive experiments indicate that our method yields better approximations of minimum-stress layouts in less time than related methods.Comment: Appears in the Proceedings of the 24th International Symposium on Graph Drawing and Network Visualization (GD 2016

    On landmark selection and sampling in high-dimensional data analysis

    Full text link
    In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nystrom extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams.Comment: 18 pages, 6 figures, submitted for publicatio

    BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

    Full text link
    The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201

    A Matrix Hyperbolic Cosine Algorithm and Applications

    Full text link
    In this paper, we generalize Spencer's hyperbolic cosine algorithm to the matrix-valued setting. We apply the proposed algorithm to several problems by analyzing its computational efficiency under two special cases of matrices; one in which the matrices have a group structure and an other in which they have rank-one. As an application of the former case, we present a deterministic algorithm that, given the multiplication table of a finite group of size nn, it constructs an expanding Cayley graph of logarithmic degree in near-optimal O(n^2 log^3 n) time. For the latter case, we present a fast deterministic algorithm for spectral sparsification of positive semi-definite matrices, which implies an improved deterministic algorithm for spectral graph sparsification of dense graphs. In addition, we give an elementary connection between spectral sparsification of positive semi-definite matrices and element-wise matrix sparsification. As a consequence, we obtain improved element-wise sparsification algorithms for diagonally dominant-like matrices.Comment: 16 pages, simplified proof and corrected acknowledging of prior work in (current) Section
    corecore