18 research outputs found

    Sensing Structured Signals with Active and Ensemble Methods

    Full text link
    Modern problems in signal processing and machine learning involve the analysis of data that is high-volume, high-dimensional, or both. In one example, scientists studying the environment must choose their set of measurements from an infinite set of possible sample locations. In another, performing inference on high-resolution images involves operating on vectors whose dimensionality is on the order of tens of thousands. To combat the challenges presented by these and other applications, researchers rely on two key features intrinsic to many large datasets. First, large volumes of data can often be accurately represented by a few key points, allowing for efficient processing, summary, and collection of data. Second, high-dimensional data often has low-dimensional intrinsic structure that can be leveraged for processing and storage. This thesis leverages these facts to develop and analyze algorithms capable of handling the challenges presented by modern data. The first scenario considered in this thesis is that of monitoring regions of low oxygen concentration (hypoxia) in lakes via an autonomous robot. Tracking the spatial extent of such hypoxic regions is of great interest and importance to scientists studying the Great Lakes, but current systems rely heavily on hydrodynamic models and a very small number of measurements at predefined sample locations. Existing active learning algorithms minimize the samples required to determine the spatial extent but do not consider the distance traveled during the estimation procedure. We propose a novel active learning algorithm for tracking such regions that balances both the number of measurements taken and the distance traveled in estimating the boundary of the hypoxic zone. The second scenario considered is learning a union of subspaces (UoS) model that best fits a given collection of points. This model can be viewed as a generalization of principal components analysis (PCA) in which data vectors are drawn from one of several low-dimensional linear subspaces of the ambient space and has applications in image segmentation and object recognition. The problem of automatically sorting the data according to nearest subspace is known as subspace clustering, and existing unsupervised algorithms perform this task well in many situations. However, state-of-the-art algorithms do not fully leverage the problem geometry, and the resulting clustering errors are far from the best possible using the UoS model. We present two novel means of bridging this gap. We first present a method of incorporating semi-supervised information into existing unsupervised subspace clustering algorithms in the form of pairwise constraints between items. We next study an ensemble algorithm for unsupervised subspace clustering that functions by combining the outputs from many efficient but inaccurate base clusterings to achieve state-of- the-art performance. Finally, we perform the first principled study of model selection for subspace clustering, in which we define clustering quality metrics that do not rely on the ground truth and evaluate their ability to reliably predict clustering accuracy. The contributions of this thesis demonstrate the applicability of tools from signal processing and machine learning to problems ranging from scientific exploration to computer vision. By utilizing inherent structure in the data, we develop algorithms that are efficient in terms of computational complexity and other realistic costs, making them truly practical for modern problems in data science.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140795/1/lipor_1.pd

    A Finite-Horizon Approach to Active Level Set Estimation

    Full text link
    We consider the problem of active learning in the context of spatial sampling for level set estimation (LSE), where the goal is to localize all regions where a function of interest lies above/below a given threshold as quickly as possible. We present a finite-horizon search procedure to perform LSE in one dimension while optimally balancing both the final estimation error and the distance traveled for a fixed number of samples. A tuning parameter is used to trade off between the estimation accuracy and distance traveled. We show that the resulting optimization problem can be solved in closed form and that the resulting policy generalizes existing approaches to this problem. We then show how this approach can be used to perform level set estimation in higher dimensions under the popular Gaussian process model. Empirical results on synthetic data indicate that as the cost of travel increases, our method's ability to treat distance nonmyopically allows it to significantly improve on the state of the art. On real air quality data, our approach achieves roughly one fifth the estimation error at less than half the cost of competing algorithms

    When less is more: How increasing the complexity of machine learning strategies for geothermal energy assessments may not lead toward better estimates

    Get PDF
    Previous moderate- and high-temperature geothermal resource assessments of the western United States utilized data-driven methods and expert decisions to estimate resource favorability. Although expert decisions can add confidence to the modeling process by ensuring reasonable models are employed, expert decisions also introduce human and, thereby, model bias. This bias can present a source of error that reduces the predictive performance of the models and confidence in the resulting resource estimates. Our study aims to develop robust data-driven methods with the goals of reducing bias and improving predictive ability. We present and compare nine favorability maps for geothermal resources in the western United States using data from the U.S. Geological Survey\u27s 2008 geothermal resource assessment. Two favorability maps are created using the expert decision-dependent methods from the 2008 assessment (i.e., weight-of-evidence and logistic regression). With the same data, we then create six different favorability maps using logistic regression (without underlying expert decisions), XGBoost, and support-vector machines paired with two training strategies. The training strategies are customized to address the inherent challenges of applying machine learning to the geothermal training data, which have no negative examples and severe class imbalance. We also create another favorability map using an artificial neural network. We demonstrate that modern machine learning approaches can improve upon systems built with expert decisions. We also find that XGBoost, a non-linear algorithm, produces greater agreement with the 2008 results than linear logistic regression without expert decisions, because the expert decisions in the 2008 assessment rendered the otherwise linear approaches non-linear despite the fact that the 2008 assessment used only linear methods. The F1 scores for all approaches appear low (F1 score \u3c 0.10), do not improve with increasing model complexity, and, therefore, indicate the fundamental limitations of the input features (i.e., training data). Until improved feature data are incorporated into the assessment process, simple non-linear algorithms (e.g., XGBoost) perform equally well or better than more complex methods (e.g., artificial neural networks) and remain easier to interpret

    Robust blind calibration via total least squares

    No full text
    This paper considers the problem of blindly calibrating large sensor networks to account for unknown gain and offset in each sensor. Under the assumption that the true signals mea-sured by the sensors lie in a known lower dimensional sub-space, previous work has shown that blind calibration is pos-sible. In practical scenarios, perfect signal subspace knowl-edge is difficult to obtain. In this paper, we show that a solu-tion robust to misspecification of the signal subspace can be obtained using total least squares (TLS) estimation. This for-mulation provides significant performance benefits over the standard least squares approach, as we show. Next, we ex-tend this TLS algorithm for incorporating exact knowledge of a few sensor gains, termed partially-blind total least squares. Index Terms — Blind calibration, sensor networks, total least squares

    Clustering Quality Metrics for Subspace Clustering

    No full text
    We study the problem of clustering validation, i.e., clustering evaluation without knowledge of ground-truth labels, for the increasingly-popular framework known as subspace clustering. Existing clustering quality metrics (CQMs) rely heavily on a notion of distance between points, but common metrics fail to capture the geometry of subspace clustering. We propose a novel point-to-point pseudometric for points lying on a union of subspaces and show how this allows for the application of existing CQMs to the subspace clustering problem. We provide theoretical and empirical justification for the proposed point-to-point distance, and then demonstrate on a number of common benchmark datasets that our proposed methods generally outperform existing graph-based CQMs in terms of choosing the best clustering and the number of clusters

    A Supervised Learning Approach to Water Quality Parameter Prediction and Fault Detection

    No full text
    Water quality parameters such as dissolved oxygen and turbidity play a key role in policy decisions regarding the maintenance and use of the nation\u27s major bodies of water. In particular, the United States Geological Survey (USGS) maintains a massive suite of sensors throughout the nation\u27s waterways that are used to inform such decisions, with all data made available to the public. However, the corresponding measurements are regularly corrupted due to sensor faults, fouling, and decalibration, and hence USGS scientists are forced to spend costly time and resources manually examining data to look for anomalies. We present a method of automatically detecting such events using supervised machine learning. We first present an extensive study of which water quality parameters can be reliably predicted, using support vector machines and gradient boosting algorithms for regression. We then show that the trained predictors can be used to automatically detect sensor decalibration, providing a system that could be easily deployed by the USGS to reduce the resources needed to maintain data fidelity

    A Novel Framework for Deep Learning from Pairwise Constraints

    No full text
    We consider the problem of deep semi-supervised classification, where label information is obtained in the form of pairwise constraints. Existing approaches to this problem begin with a clustering network and utilize custom loss functions to encourage the learned representations to conform to the obtained constraints. We present a novel framework that seamlessly integrates pairwise constrained clustering, semi-supervised classification, and supervised classification. This approach leverages advances in unsupervised learning by jointly training a Siamese network and autoencoder to learn a representation that is amenable for both clustering and classification. The resulting framework outperforms existing approaches on common image recognition datasets

    K-Subspaces for Sequential Data

    No full text
    We study the problem of clustering high-dimensional temporal data such as video sequences of human motion, where points that arrive sequentially in time are likely to belong to the same cluster. State-of-the-art approaches to this problem rely on the union-of-subspaces model, where points lie near one of KK unknown low-dimensional subspaces. We propose the first approach to sequential subspace clustering based on the popular KK-Subspaces (KSS) formulation, which we refer to as Temporal KK-Subspaces (TKSS). We show how sequential information can be incorporated into the KSS problem and provide an efficient algorithm for approximate minimization of the resulting cost function, proving convergence to a local minimum. Results on benchmark datasets show that TKSS achieves state-of-the-art performance, obtaining an accuracy increase of over 10% compared to existing methods
    corecore