1,474 research outputs found

    Interpretable statistics for complex modelling: quantile and topological learning

    Get PDF
    As the complexity of our data increased exponentially in the last decades, so has our need for interpretable features. This thesis revolves around two paradigms to approach this quest for insights. In the first part we focus on parametric models, where the problem of interpretability can be seen as a “parametrization selection”. We introduce a quantile-centric parametrization and we show the advantages of our proposal in the context of regression, where it allows to bridge the gap between classical generalized linear (mixed) models and increasingly popular quantile methods. The second part of the thesis, concerned with topological learning, tackles the problem from a non-parametric perspective. As topology can be thought of as a way of characterizing data in terms of their connectivity structure, it allows to represent complex and possibly high dimensional through few features, such as the number of connected components, loops and voids. We illustrate how the emerging branch of statistics devoted to recovering topological structures in the data, Topological Data Analysis, can be exploited both for exploratory and inferential purposes with a special emphasis on kernels that preserve the topological information in the data. Finally, we show with an application how these two approaches can borrow strength from one another in the identification and description of brain activity through fMRI data from the ABIDE project

    Geometric Cross-Modal Comparison of Heterogeneous Sensor Data

    Full text link
    In this work, we address the problem of cross-modal comparison of aerial data streams. A variety of simulated automobile trajectories are sensed using two different modalities: full-motion video, and radio-frequency (RF) signals received by detectors at various locations. The information represented by the two modalities is compared using self-similarity matrices (SSMs) corresponding to time-ordered point clouds in feature spaces of each of these data sources; we note that these feature spaces can be of entirely different scale and dimensionality. Several metrics for comparing SSMs are explored, including a cutting-edge time-warping technique that can simultaneously handle local time warping and partial matches, while also controlling for the change in geometry between feature spaces of the two modalities. We note that this technique is quite general, and does not depend on the choice of modalities. In this particular setting, we demonstrate that the cross-modal distance between SSMs corresponding to the same trajectory type is smaller than the cross-modal distance between SSMs corresponding to distinct trajectory types, and we formalize this observation via precision-recall metrics in experiments. Finally, we comment on promising implications of these ideas for future integration into multiple-hypothesis tracking systems.Comment: 10 pages, 13 figures, Proceedings of IEEE Aeroconf 201

    Advanced Statistical Methods for Atomic-Level Quantification of Multi-Component Alloys

    Get PDF
    This thesis comprises a collection of papers whose common theme is data analysis of high entropy alloys. The experimental technique used to view these alloys at the nano-scale produces a dataset that, while comprised of approximately 10^7 atoms, is corrupted by observational noise and sparsity. Our goal is to developstatistical methods to quantify the atomic structure of these materials. Understanding the atomic structure of these materials involves three parts: 1. Determining the crystal structure of the material 2. Finding the optimal transformation onto a reference structure 3. Finding the optimal matching between structures and the lattice constantFrom identifying these elements, we may map a noisy and sparse representation of an HEA onto its reference structure and determine the probabilities of different elemental types that are immediately adjacent, i.e., first neighbors, or are one-level removed and are second neighbors. Having these elemental descriptors of a material, researchers may then develop interaction potentials for molecular dynamics simulations, and make accurate predictions about these novel metallic alloys

    Shape-based Feature Engineering for Solar Flare Prediction

    Full text link
    Solar flares are caused by magnetic eruptions in active regions (ARs) on the surface of the sun. These events can have significant impacts on human activity, many of which can be mitigated with enough advance warning from good forecasts. To date, machine learning-based flare-prediction methods have employed physics-based attributes of the AR images as features; more recently, there has been some work that uses features deduced automatically by deep learning methods (such as convolutional neural networks). We describe a suite of novel shape-based features extracted from magnetogram images of the Sun using the tools of computational topology and computational geometry. We evaluate these features in the context of a multi-layer perceptron (MLP) neural network and compare their performance against the traditional physics-based attributes. We show that these abstract shape-based features outperform the features chosen by the human experts, and that a combination of the two feature sets improves the forecasting capability even further.Comment: To be published in Proceedings for Innovative Applications of Artificial Intelligence Conference 202

    Topological Data Analysis in Sub-cellular Motion Reconstruction and Filament Networks Classification

    Get PDF
    Topological Data Analysis is a powerful tool in the image data analysis. In this dissertation, we focus on studying cell physiology by the sub-cellular motions of organelles and generation process of filament networks, relying on topology of the cellular image data. We first develop a novel, automated algorithm, which tracks organelle movements and reconstructs their trajectories on stacks of microscopy image data. Our tracking method proceeds with three steps: (i) identification, (ii) localization, and (iii) linking, and does not assume a specific motion model. This method combines topological data analysis principles with Ensemble Kalman Filtering in the computation of associated nerve during the linking step. Moreover, we show a great success of our method with several applications. We then study filament networks as a classification problem, and propose a distancebased classifier. This algorithm combines topological data analysis with a supervised machine learning framework, and is built based on the foundation of persistence diagrams on the data.We adopt a new metric, the dcp distance, on the space of persistence diagrams, and show it is useful in catching the geometric difference of filament networks. Furthermore, our classifier succeeds in classifying filament networks with high accuracy rate

    Statistical Parameter Selection for Clustering Persistence Diagrams

    Get PDF
    International audienceIn urgent decision making applications, ensemble simulations are an important way to determine different outcome scenarios based on currently available data. In this paper, we will analyze the output of ensemble simulations by considering so-called persistence diagrams, which are reduced representations of the original data, motivated by the extraction of topological features. Based on a recently published progressive algorithm for the clustering of persistence diagrams, we determine the optimal number of clusters, and therefore the number of significantly different outcome scenarios, by the minimization of established statistical score functions. Furthermore, we present a proof-of-concept prototype implementation of the statistical selection of the number of clusters and provide the results of an experimental study, where this implementation has been applied to real-world ensemble data sets
    corecore