59 research outputs found

    Structures in High-Dimensional Data: Intrinsic Dimension and Cluster Analysis

    Get PDF
    With today's improved measurement and data storing technologies it has become common to collect data in search for hypotheses instead of for testing hypotheses---to do exploratory data analysis. Finding patterns and structures in data is the main goal. This thesis deals with two kinds of structures that can convey relationships between different parts of data in a high-dimensional space: manifolds and clusters. They are in a way opposites of each other: a manifold structure shows that it is plausible to connect two distant points through the manifold, a clustering shows that it is plausible to separate two nearby points by assigning them to different clusters. But clusters and manifolds can also be the same: each cluster can be a manifold of its own.The first paper in this thesis concerns one specific aspect of a manifold structure, namely its dimension, also called the intrinsic dimension of the data. A novel estimator of intrinsic dimension, taking advantage of ``the curse of dimensionality'', is proposed and evaluated. It is shown that it has in general less bias than estimators from the literature and can therefore better distinguish manifolds with different dimensions.The second and third paper in this thesis concern cluster analysis of data generated by flow cytometry---a high-throughput single-cell measurement technology. In this area, clustering is performed routinely by manual assignment of data in two-dimensional plots, to identify cell populations. It is a tedious and subjective task, especially since data often has four, eight, twelve or even more dimensions, and the analysts need to decide which two dimensions to look at together, and in which order.In the second paper of the thesis a new pipeline for automated cell population identification is proposed, which can process multiple flow cytometry samples in parallel using a hierarchical model that shares information between the clusterings of the samples, thus making corresponding clusters in different samples similar while allowing for variation in cluster location and shape.In the third and final paper of the thesis, statistical tests for unimodality are investigated as a tool for quality control of automated cell population identification algorithms. It is shown that the different tests have different interpretations of unimodality and thus accept different kinds of clusters as sufficiently close to unimodal

    Learning effective stochastic differential equations from microscopic simulations: combining stochastic numerics and deep learning

    Full text link
    We identify effective stochastic differential equations (SDE) for coarse observables of fine-grained particle- or agent-based simulations; these SDE then provide coarse surrogate models of the fine scale dynamics. We approximate the drift and diffusivity functions in these effective SDE through neural networks, which can be thought of as effective stochastic ResNets. The loss function is inspired by, and embodies, the structure of established stochastic numerical integrators (here, Euler-Maruyama and Milstein); our approximations can thus benefit from error analysis of these underlying numerical schemes. They also lend themselves naturally to "physics-informed" gray-box identification when approximate coarse models, such as mean field equations, are available. Our approach does not require long trajectories, works on scattered snapshot data, and is designed to naturally handle different time steps per snapshot. We consider both the case where the coarse collective observables are known in advance, as well as the case where they must be found in a data-driven manner.Comment: 19 pages, includes supplemental materia

    Managing distributed situation awareness in a team of agents

    Get PDF
    The research presented in this thesis investigates the best ways to manage Distributed Situation Awareness (DSA) for a team of agents tasked to conduct search activity with limited resources (battery life, memory use, computational power, etc.). In the first part of the thesis, an algorithm to coordinate agents (e.g., UAVs) is developed. This is based on Delaunay triangulation with the aim of supporting efficient, adaptable, scalable, and predictable search. Results from simulation and physical experiments with UAVs show good performance in terms of resources utilisation, adaptability, scalability, and predictability of the developed method in comparison with the existing fixed-pattern, pseudorandom, and hybrid methods. The second aspect of the thesis employs Bayesian Belief Networks (BBNs) to define and manage DSA based on the information obtained from the agents' search activity. Algorithms and methods were developed to describe how agents update the BBN to model the system’s DSA, predict plausible future states of the agents’ search area, handle uncertainties, manage agents’ beliefs (based on sensor differences), monitor agents’ interactions, and maintains adaptable BBN for DSA management using structural learning. The evaluation uses environment situation information obtained from agents’ sensors during search activity, and the results proved superior performance over well-known alternative methods in terms of situation prediction accuracy, uncertainty handling, and adaptability. Therefore, the thesis’s main contributions are (i) the development of a simple search planning algorithm that combines the strength of fixed-pattern and pseudorandom methods with resources utilisation, scalability, adaptability, and predictability features; (ii) a formal model of DSA using BBN that can be updated and learnt during the mission; (iii) investigation of the relationship between agents search coordination and DSA management

    Bayesian score calibration for approximate models

    Full text link
    Scientists continue to develop increasingly complex mechanistic models to reflect their knowledge more realistically. Statistical inference using these models can be challenging since the corresponding likelihood function is often intractable and model simulation may be computationally burdensome. Fortunately, in many of these situations, it is possible to adopt a surrogate model or approximate likelihood function. It may be convenient to conduct Bayesian inference directly with the surrogate, but this can result in bias and poor uncertainty quantification. In this paper we propose a new method for adjusting approximate posterior samples to reduce bias and produce more accurate uncertainty quantification. We do this by optimizing a transform of the approximate posterior that maximizes a scoring rule. Our approach requires only a (fixed) small number of complex model simulations and is numerically stable. We demonstrate good performance of the new method on several examples of increasing complexity.Comment: 27 pages, 8 figures, 5 table

    PROBABILISTIC MODEL DISCOVERY RELATIONAL LEARNING AND SCALABLE INFERENCE

    Get PDF
    Department of Computer Science and EngineeringThis thesis studies interesting problems in compositionality for machine learning models under some settings including relational learning, scalability and deep models. Compositionality is the terminology describing the process of building small objects to complex ones. Bringing this concept into machine learning is important because it appears in many aspects from infinitesimal atomic to planetary structures. In this thesis, machine learning models center around Gaussian process of which covariance function is compositionally constructed. The proposed approach builds methods that can explore compositional model space automatically and efficiently as well as strives to address the interpretability for obtained models. The aforementioned problems are both important and challenging. Considering multivariate or relational learning is de facto in time series analysis for many domains. However, the existing methods of compositional learning are inapplicable to extend to such a setting since the explosion in model space makes it infeasible to use. Learning compositional structures is already a time-consuming task. Although there are existing approximation methods, they do not work well for compositional covariances. This makes it even harder to propose a scalable approach without sacrificing model performances. Finally, analyzing hierarchical deep Gaussian processes is notoriously difficult especially when incorporating different covariance functions. Previous work focuses on a single case of covariance function and is difficult to generalize for many other cases. The goal of this thesis is to propose solutions to the given problems. The first contribution of this thesis is a general framework for modeling multiple time series which provides descriptive relations between time series. Second, this thesis presents efficient probabilistic approaches to address the model search problem which previously is done by exhaustive enumerating evaluation. Furthermore, a scalable inference for Gaussian process is proposed, providing accurate approximation with guarantees of error bounds. Last but not least, to address the existing issues in deep Gaussian process, this thesis presents a unified theoretical framework to explain the pathology in deep Gasssian processes with better error bounds for various kernels compared to existing work and rates of convergence.ope
    • 

    corecore