1,474 research outputs found
Interpretable statistics for complex modelling: quantile and topological learning
As the complexity of our data increased exponentially in the last decades, so has our
need for interpretable features. This thesis revolves around two paradigms to approach
this quest for insights.
In the first part we focus on parametric models, where the problem of interpretability
can be seen as a “parametrization selection”. We introduce a quantile-centric
parametrization and we show the advantages of our proposal in the context of regression,
where it allows to bridge the gap between classical generalized linear (mixed)
models and increasingly popular quantile methods.
The second part of the thesis, concerned with topological learning, tackles the
problem from a non-parametric perspective. As topology can be thought of as a way
of characterizing data in terms of their connectivity structure, it allows to represent
complex and possibly high dimensional through few features, such as the number of
connected components, loops and voids. We illustrate how the emerging branch of
statistics devoted to recovering topological structures in the data, Topological Data
Analysis, can be exploited both for exploratory and inferential purposes with a special
emphasis on kernels that preserve the topological information in the data.
Finally, we show with an application how these two approaches can borrow strength
from one another in the identification and description of brain activity through fMRI
data from the ABIDE project
Geometric Cross-Modal Comparison of Heterogeneous Sensor Data
In this work, we address the problem of cross-modal comparison of aerial data
streams. A variety of simulated automobile trajectories are sensed using two
different modalities: full-motion video, and radio-frequency (RF) signals
received by detectors at various locations. The information represented by the
two modalities is compared using self-similarity matrices (SSMs) corresponding
to time-ordered point clouds in feature spaces of each of these data sources;
we note that these feature spaces can be of entirely different scale and
dimensionality. Several metrics for comparing SSMs are explored, including a
cutting-edge time-warping technique that can simultaneously handle local time
warping and partial matches, while also controlling for the change in geometry
between feature spaces of the two modalities. We note that this technique is
quite general, and does not depend on the choice of modalities. In this
particular setting, we demonstrate that the cross-modal distance between SSMs
corresponding to the same trajectory type is smaller than the cross-modal
distance between SSMs corresponding to distinct trajectory types, and we
formalize this observation via precision-recall metrics in experiments.
Finally, we comment on promising implications of these ideas for future
integration into multiple-hypothesis tracking systems.Comment: 10 pages, 13 figures, Proceedings of IEEE Aeroconf 201
Advanced Statistical Methods for Atomic-Level Quantification of Multi-Component Alloys
This thesis comprises a collection of papers whose common theme is data analysis of high entropy alloys. The experimental technique used to view these alloys at the nano-scale produces a dataset that, while comprised of approximately 10^7 atoms, is corrupted by observational noise and sparsity. Our goal is to developstatistical methods to quantify the atomic structure of these materials. Understanding the atomic structure of these materials involves three parts: 1. Determining the crystal structure of the material 2. Finding the optimal transformation onto a reference structure 3. Finding the optimal matching between structures and the lattice constantFrom identifying these elements, we may map a noisy and sparse representation of an HEA onto its reference structure and determine the probabilities of different elemental types that are immediately adjacent, i.e., first neighbors, or are one-level removed and are second neighbors. Having these elemental descriptors of a material, researchers may then develop interaction potentials for molecular dynamics simulations, and make accurate predictions about these novel metallic alloys
Shape-based Feature Engineering for Solar Flare Prediction
Solar flares are caused by magnetic eruptions in active regions (ARs) on the
surface of the sun. These events can have significant impacts on human
activity, many of which can be mitigated with enough advance warning from good
forecasts. To date, machine learning-based flare-prediction methods have
employed physics-based attributes of the AR images as features; more recently,
there has been some work that uses features deduced automatically by deep
learning methods (such as convolutional neural networks). We describe a suite
of novel shape-based features extracted from magnetogram images of the Sun
using the tools of computational topology and computational geometry. We
evaluate these features in the context of a multi-layer perceptron (MLP) neural
network and compare their performance against the traditional physics-based
attributes. We show that these abstract shape-based features outperform the
features chosen by the human experts, and that a combination of the two feature
sets improves the forecasting capability even further.Comment: To be published in Proceedings for Innovative Applications of
Artificial Intelligence Conference 202
Topological Data Analysis in Sub-cellular Motion Reconstruction and Filament Networks Classification
Topological Data Analysis is a powerful tool in the image data analysis. In this dissertation, we focus on studying cell physiology by the sub-cellular motions of organelles and generation process of filament networks, relying on topology of the cellular image data. We first develop a novel, automated algorithm, which tracks organelle movements and reconstructs their trajectories on stacks of microscopy image data. Our tracking method proceeds with three steps: (i) identification, (ii) localization, and (iii) linking, and does not assume a specific motion model. This method combines topological data analysis principles with Ensemble Kalman Filtering in the computation of associated nerve during the linking step. Moreover, we show a great success of our method with several applications. We then study filament networks as a classification problem, and propose a distancebased classifier. This algorithm combines topological data analysis with a supervised machine learning framework, and is built based on the foundation of persistence diagrams on the data.We adopt a new metric, the dcp distance, on the space of persistence diagrams, and show it is useful in catching the geometric difference of filament networks. Furthermore, our classifier succeeds in classifying filament networks with high accuracy rate
Statistical Parameter Selection for Clustering Persistence Diagrams
International audienceIn urgent decision making applications, ensemble simulations are an important way to determine different outcome scenarios based on currently available data. In this paper, we will analyze the output of ensemble simulations by considering so-called persistence diagrams, which are reduced representations of the original data, motivated by the extraction of topological features. Based on a recently published progressive algorithm for the clustering of persistence diagrams, we determine the optimal number of clusters, and therefore the number of significantly different outcome scenarios, by the minimization of established statistical score functions. Furthermore, we present a proof-of-concept prototype implementation of the statistical selection of the number of clusters and provide the results of an experimental study, where this implementation has been applied to real-world ensemble data sets
- …