49 research outputs found

    Fast approximation of matrix coherence and statistical leverage

    Full text link
    The statistical leverage scores of a matrix AA are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr\"{o}m-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n×dn \times d matrix AA, with ndn \gg d, and that returns as output relative-error approximations to all nn of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of nn and dd) in O(ndlogn)O(n d \log n) time, as opposed to the O(nd2)O(nd^2) time required by the na\"{i}ve algorithm that involves computing an orthogonal basis for the range of AA. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with ndn \approx d, and the extension to streaming environments.Comment: 29 pages; conference version is in ICML; journal version is in JML

    Topics in Matrix Sampling Algorithms

    Full text link
    We study three fundamental problems of Linear Algebra, lying in the heart of various Machine Learning applications, namely: 1)"Low-rank Column-based Matrix Approximation". We are given a matrix A and a target rank k. The goal is to select a subset of columns of A and, by using only these columns, compute a rank k approximation to A that is as good as the rank k approximation that would have been obtained by using all the columns; 2) "Coreset Construction in Least-Squares Regression". We are given a matrix A and a vector b. Consider the (over-constrained) least-squares problem of minimizing ||Ax-b||, over all vectors x in D. The domain D represents the constraints on the solution and can be arbitrary. The goal is to select a subset of the rows of A and b and, by using only these rows, find a solution vector that is as good as the solution vector that would have been obtained by using all the rows; 3) "Feature Selection in K-means Clustering". We are given a set of points described with respect to a large number of features. The goal is to select a subset of the features and, by using only this subset, obtain a k-partition of the points that is as good as the partition that would have been obtained by using all the features. We present novel algorithms for all three problems mentioned above. Our results can be viewed as follow-up research to a line of work known as "Matrix Sampling Algorithms". [Frieze, Kanna, Vempala, 1998] presented the first such algorithm for the Low-rank Matrix Approximation problem. Since then, such algorithms have been developed for several other problems, e.g. Graph Sparsification and Linear Equation Solving. Our contributions to this line of research are: (i) improved algorithms for Low-rank Matrix Approximation and Regression (ii) algorithms for a new problem domain (K-means Clustering).Comment: PhD Thesis, 150 page

    Understanding forest health with Remote sensing-Part II-A review of approaches and data models

    Get PDF
    Stress in forest ecosystems (FES) occurs as a result of land-use intensification, disturbances, resource limitations or unsustainable management, causing changes in forest health (FH) at various scales from the local to the global scale. Reactions to such stress depend on the phylogeny of forest species or communities and the characteristics of their impacting drivers and processes. There are many approaches to monitor indicators of FH using in-situ forest inventory and experimental studies, but they are generally limited to sample points or small areas, as well as being time- and labour-inte

    Extending the definition of modularity to directed graphs with overlapping communities

    Full text link
    Complex networks topologies present interesting and surprising properties, such as community structures, which can be exploited to optimize communication, to find new efficient and context-aware routing algorithms or simply to understand the dynamics and meaning of relationships among nodes. Complex networks are gaining more and more importance as a reference model and are a powerful interpretation tool for many different kinds of natural, biological and social networks, where directed relationships and contextual belonging of nodes to many different communities is a matter of fact. This paper starts from the definition of modularity function, given by M. Newman to evaluate the goodness of network community decompositions, and extends it to the more general case of directed graphs with overlapping community structures. Interesting properties of the proposed extension are discussed, a method for finding overlapping communities is proposed and results of its application to benchmark case-studies are reported. We also propose a new dataset which could be used as a reference benchmark for overlapping community structures identification.Comment: 22 pages, 11 figure

    Radar vision in the mapping of forest biodiversity from space

    Get PDF
    Recent progress in remote sensing provides much-needed, large-scale spatio-temporal information on habitat structures important for biodiversity conservation. Here we examine the potential of a newly launched satellite-borne radar system (Sentinel-1) to map the biodiversity of twelve taxa across five temperate forest regions in central Europe. We show that the sensitivity of radar to habitat structure is similar to that of airborne laser scanning (ALS), the current gold standard in the measurement of forest structure. Our models of different facets of biodiversity reveal that radar performs as well as ALS; median R² over twelve taxa by ALS and radar are 0.51 and 0.57 respectively for the first non-metric multidimensional scaling axes representing assemblage composition. We further demonstrate the promising predictive ability of radar-derived data with external validation based on the species composition of birds and saproxylic beetles. Establishing new area-wide biodiversity monitoring by remote sensing will require the coupling of radar data to stratified and standardized collected local species data

    A Survey of Bayesian Statistical Approaches for Big Data

    Full text link
    The modern era is characterised as an era of information or Big Data. This has motivated a huge literature on new methods for extracting information and insights from these data. A natural question is how these approaches differ from those that were available prior to the advent of Big Data. We present a review of published studies that present Bayesian statistical approaches specifically for Big Data and discuss the reported and perceived benefits of these approaches. We conclude by addressing the question of whether focusing only on improving computational algorithms and infrastructure will be enough to face the challenges of Big Data

    Distance Matrix Reconstruction from Incomplete Distance Information for Sensor Network Localization

    No full text
    This paper initiates the principled study of distance reconstruction for distance-based node localization. We address an important issue in node localization by showing that the highly incomplete set of inter-node distance measurements obtained in ad-hoc node deployments carries sufficient information for the accurate reconstruction of the missing distances, even in the presence of noise. We provide an efficient and provably accurate algorithm for this reconstruction, and we show that the resulting error is bounded, decreasing at a rate that is inversely proportional to √ n, the square root of the number of nodes in the region of deployment. Although this result is applicable to many localization schemes, in this paper we illustrate its use in conjunction with the popular MultiDimensional Scaling algorithm. Our analysis reveals valuable insights and key factors to consider during the sensor network setup phase, to improve the quality of the position estimates. 1
    corecore