    Effective data reduction algorithm for topological data analysis

    One of the most interesting tools that have recently entered the data science toolbox is topological data analysis (TDA). With the explosion of available data sizes and dimensions, identifying and extracting the underlying structure of a given dataset is a fundamental challenge in data science, and TDA provides a methodology for analyzing the shape of a dataset using tools and prospects from algebraic topology. However, the computational complexity makes it quickly infeasible to process large datasets, especially those with high dimensions. Here, we introduce a preprocessing strategy called the Characteristic Lattice Algorithm (CLA), which allows users to reduce the size of a given dataset as desired while maintaining geometric and topological features in order to make the computation of TDA feasible or to shorten its computation time. In addition, we derive a stability theorem and an upper bound of the barcode errors for CLA based on the bottleneck distance.Comment: 13 pages, 10 figures, 2 table

    Topologically faithful image segmentation via induced matching of persistence barcodes

    Image segmentation is a largely researched field where neural networks find vast applications in many facets of technology. Some of the most popular approaches to train segmentation networks employ loss functions optimizing pixel-overlap, an objective that is insufficient for many segmentation tasks. In recent years, their limitations fueled a growing interest in topology-aware methods, which aim to recover the correct topology of the segmented structures. However, so far, none of the existing approaches achieve a spatially correct matching between the topological features of ground truth and prediction. In this work, we propose the first topologically and feature-wise accurate metric and loss function for supervised image segmentation, which we term Betti matching. We show how induced matchings guarantee the spatially correct matching between barcodes in a segmentation setting. Furthermore, we propose an efficient algorithm to compute the Betti matching of images. We show that the Betti matching error is an interpretable metric to evaluate the topological correctness of segmentations, which is more sensitive than the well-established Betti number error. Moreover, the differentiability of the Betti matching loss enables its use as a loss function. It improves the topological performance of segmentation networks across six diverse datasets while preserving the volumetric performance. Our code is available in https://github.com/nstucki/Betti-matching

    A Geometric Perspective on Sparse Filtrations

    We present a geometric perspective on sparse filtrations used in topological data analysis. This new perspective leads to much simpler proofs, while also being more general, applying equally to Rips filtrations and Cech filtrations for any convex metric. We also give an algorithm for finding the simplices in such a filtration and prove that the vertex removal can be implemented as a sequence of elementary edge collapses

    Computing multiparameter persistent homology through a discrete Morse-based approach

    Persistent homology allows for tracking topological features, like loops, holes and their higher-dimensional analogues, along a single-parameter family of nested shapes. Computing descriptors for complex data characterized by multiple parameters is becoming a major challenging task in several applications, including physics, chemistry, medicine, and geography. Multiparameter persistent homology generalizes persistent homology to allow for the exploration and analysis of shapes endowed with multiple filtering functions. Still, computational constraints prevent multiparameter persistent homology to be a feasible tool for analyzing large size data sets. We consider discrete Morse theory as a strategy to reduce the computation of multiparameter persistent homology by working on a reduced dataset. We propose a new preprocessing algorithm, well suited for parallel and distributed implementations, and we provide the first evaluation of the impact of multiparameter persistent homology on computations

    An Encoding for Order-Preserving Matching

    Encoding data structures store enough information to answer the queries they are meant to support but not enough to recover their underlying datasets. In this paper we give the first encoding data structure for the challenging problem of order-preserving pattern matching. This problem was introduced only a few years ago but has already attracted significant attention because of its applications in data analysis. Two strings are said to be an order-preserving match if the relative order of their characters is the same: e.g., (4, 1, 3, 2) and (10, 3, 7, 5) are an order-preserving match. We show how, given a string S[1..n] over an arbitrary alphabet of size sigma and a constant c >=1, we can build an O(n log log n)-bit encoding such that later, given a pattern P[1..m] with m >= log^c n, we can return the number of order-preserving occurrences of P in S in O(m) time. Within the same time bound we can also return the starting position of some order-preserving match for P in S (if such a match exists). We prove that our space bound is within a constant factor of optimal if log(sigma) = Omega(log log n); our query time is optimal if log(sigma) = Omega(log n). Our space bound contrasts with the Omega(n log n) bits needed in the worst case to store S itself, an index for order-preserving pattern matching with no restrictions on the pattern length, or an index for standard pattern matching even with restrictions on the pattern length. Moreover, we can build our encoding knowing only how each character compares to O(log^c n) neighbouring characters

    4D Seismic History Matching Incorporating Unsupervised Learning

    The work discussed and presented in this paper focuses on the history matching of reservoirs by integrating 4D seismic data into the inversion process using machine learning techniques. A new integrated scheme for the reconstruction of petrophysical properties with a modified Ensemble Smoother with Multiple Data Assimilation (ES-MDA) in a synthetic reservoir is proposed. The permeability field inside the reservoir is parametrised with an unsupervised learning approach, namely K-means with Singular Value Decomposition (K-SVD). This is combined with the Orthogonal Matching Pursuit (OMP) technique which is very typical for sparsity promoting regularisation schemes. Moreover, seismic attributes, in particular, acoustic impedance, are parametrised with the Discrete Cosine Transform (DCT). This novel combination of techniques from machine learning, sparsity regularisation, seismic imaging and history matching aims to address the ill-posedness of the inversion of historical production data efficiently using ES-MDA. In the numerical experiments provided, I demonstrate that these sparse representations of the petrophysical properties and the seismic attributes enables to obtain better production data matches to the true production data and to quantify the propagating waterfront better compared to more traditional methods that do not use comparable parametrisation techniques