7 research outputs found

    Compressive Clustering of High-Dimensional Data

    Full text link

    Sensing Structured Signals with Active and Ensemble Methods

    Full text link
    Modern problems in signal processing and machine learning involve the analysis of data that is high-volume, high-dimensional, or both. In one example, scientists studying the environment must choose their set of measurements from an infinite set of possible sample locations. In another, performing inference on high-resolution images involves operating on vectors whose dimensionality is on the order of tens of thousands. To combat the challenges presented by these and other applications, researchers rely on two key features intrinsic to many large datasets. First, large volumes of data can often be accurately represented by a few key points, allowing for efficient processing, summary, and collection of data. Second, high-dimensional data often has low-dimensional intrinsic structure that can be leveraged for processing and storage. This thesis leverages these facts to develop and analyze algorithms capable of handling the challenges presented by modern data. The first scenario considered in this thesis is that of monitoring regions of low oxygen concentration (hypoxia) in lakes via an autonomous robot. Tracking the spatial extent of such hypoxic regions is of great interest and importance to scientists studying the Great Lakes, but current systems rely heavily on hydrodynamic models and a very small number of measurements at predefined sample locations. Existing active learning algorithms minimize the samples required to determine the spatial extent but do not consider the distance traveled during the estimation procedure. We propose a novel active learning algorithm for tracking such regions that balances both the number of measurements taken and the distance traveled in estimating the boundary of the hypoxic zone. The second scenario considered is learning a union of subspaces (UoS) model that best fits a given collection of points. This model can be viewed as a generalization of principal components analysis (PCA) in which data vectors are drawn from one of several low-dimensional linear subspaces of the ambient space and has applications in image segmentation and object recognition. The problem of automatically sorting the data according to nearest subspace is known as subspace clustering, and existing unsupervised algorithms perform this task well in many situations. However, state-of-the-art algorithms do not fully leverage the problem geometry, and the resulting clustering errors are far from the best possible using the UoS model. We present two novel means of bridging this gap. We first present a method of incorporating semi-supervised information into existing unsupervised subspace clustering algorithms in the form of pairwise constraints between items. We next study an ensemble algorithm for unsupervised subspace clustering that functions by combining the outputs from many efficient but inaccurate base clusterings to achieve state-of- the-art performance. Finally, we perform the first principled study of model selection for subspace clustering, in which we define clustering quality metrics that do not rely on the ground truth and evaluate their ability to reliably predict clustering accuracy. The contributions of this thesis demonstrate the applicability of tools from signal processing and machine learning to problems ranging from scientific exploration to computer vision. By utilizing inherent structure in the data, we develop algorithms that are efficient in terms of computational complexity and other realistic costs, making them truly practical for modern problems in data science.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140795/1/lipor_1.pd

    Robust and Efficient Data Clustering with Signal Processing on Graphs

    Get PDF
    Data is pervasive in today's world and has actually been for quite some time. With the increasing volume of data to process, there is a need for faster and at least as accurate techniques than what we already have. In particular, the last decade recorded the effervescence of social networks and ubiquitous sensing (through smartphones and the Internet of Things). These phenomena, including also the progresses in bioinformatics and traffic monitoring, pushed forward the research on graph analysis and called for more efficient techniques. Clustering is an important field of machine learning because it belongs to the unsupervised techniques (i.e., one does not need to possess a ground truth about the data to start learning). With it, one can extract meaningful patterns from large data sources without requiring an expert to annotate a portion of the data, which can be very costly. However, the techniques of clustering designed so far all tend to be computationally demanding and have trouble scaling with the size of today's problems. The emergence of Graph Signal Processing, attempting to apply traditional signal processing techniques on graphs instead of time, provided additional tools for efficient graph analysis. By considering the clustering assignment as a signal lying on the nodes of the graph, one may now apply the tools of GSP to the improvement of graph clustering and more generally data clustering at large. In this thesis, we present several techniques using some of the latest developments of GSP in order to improve the scalability of clustering, while aiming for an accuracy resembling that of Spectral Clustering, a famous graph clustering technique that possess a solid mathematical intuition. On the one hand, we explore the benefits of random signal filtering on a practical and theoretical aspect for the determination of the eigenvectors of the graph Laplacian. In practice, this attempt requires the design of polynomial approximations of the step function for which we provided an accelerated heuristic. We used this series of work in order to reduce the complexity of dynamic graphs clustering, the problem of defining a partition to a graph which is evolving in time at each snapshot. We also used them to propose a fast method for the determination of the subspace generated by the first eigenvectors of any symmetrical matrix. This element is useful for clustering as it serves in Spectral Clustering but it goes beyond that since it also serves in graph visualization (with Laplacian Eigenmaps) and data mining (with Principal Components Projection). On the other hand, we were inspired by the latest works on graph filter localization in order to propose an extremely fast clustering technique. We tried to perform clustering by only using graph filtering and combining the results in order to obtain a partition of the nodes. These different contributions are completed by experiments using both synthetic datasets and real-world problems. Since we think that research should be shared in order to progress, all the experiments made in this thesis are publicly available on my personal Github account

    Compressive Clustering of High-dimensional Data

    No full text
    corecore