27 research outputs found

    Persistent Homology in Multivariate Data Visualization

    Get PDF
    Technological advances of recent years have changed the way research is done. When describing complex phenomena, it is now possible to measure and model a myriad of different aspects pertaining to them. This increasing number of variables, however, poses significant challenges for the visual analysis and interpretation of such multivariate data. Yet, the effective visualization of structures in multivariate data is of paramount importance for building models, forming hypotheses, and understanding intrinsic properties of the underlying phenomena. This thesis provides novel visualization techniques that advance the field of multivariate visual data analysis by helping represent and comprehend the structure of high-dimensional data. In contrast to approaches that focus on visualizing multivariate data directly or by means of their geometrical features, the methods developed in this thesis focus on their topological properties. More precisely, these methods provide structural descriptions that are driven by persistent homology, a technique from the emerging field of computational topology. Such descriptions are developed in two separate parts of this thesis. The first part deals with the qualitative visualization of topological features in multivariate data. It presents novel visualization methods that directly depict topological information, thus permitting the comparison of structural features in a qualitative manner. The techniques described in this part serve as low-dimensional representations that make the otherwise high-dimensional topological features accessible. We show how to integrate them into data analysis workflows based on clustering in order to obtain more information about the underlying data. The efficacy of such combined workflows is demonstrated by analysing complex multivariate data sets from cultural heritage and political science, for example, whose structures are hidden to common visualization techniques. The second part of this thesis is concerned with the quantitative visualization of topological features. It describes novel methods that measure different aspects of multivariate data in order to provide quantifiable information about them. Here, the topological characteristics serve as a feature descriptor. Using these descriptors, the visualization techniques in this part focus on augmenting and improving existing data analysis processes. Among others, they deal with the visualization of high-dimensional regression models, the visualization of errors in embeddings of multivariate data, as well as the assessment and visualization of the results of different clustering algorithms. All the methods presented in this thesis are evaluated and analysed on different data sets in order to show their robustness. This thesis demonstrates that the combination of geometrical and topological methods may support, complement, and surpass existing approaches for multivariate visual data analysis

    Filtration Curves for Graph Representation

    No full text
    The two predominant approaches to graph comparison in recent years are based on (i) enumerating matching subgraphs or (ii) comparing neighborhoods of nodes. In this work, we complement these two perspectives with a third way of representing graphs: using filtration curves from topological data analysis that capture both edge weight information and global graph structure. Filtration curves are highly efficient to compute and lead to expressive representations of graphs, which we demonstrate on graph classification benchmark datasets. Our work opens the door to a new form of graph representation in data mining

    A Survey of Topological Machine Learning Methods

    No full text
    The last decade saw an enormous boost in the field of computational topology: methods and concepts from algebraic and differential topology, formerly confined to the realm of pure mathematics, have demonstrated their utility in numerous areas such as computational biology personalised medicine, and time-dependent data analysis, to name a few. The newly-emerging domain comprising topology-based techniques is often referred to as topological data analysis (TDA). Next to their applications in the aforementioned areas, TDA methods have also proven to be effective in supporting, enhancing, and augmenting both classical machine learning and deep learning models. In this paper, we review the state of the art of a nascent field we refer to as “topological machine learning,” i.e., the successful symbiosis of topology-based methods and machine learning algorithms, such as deep neural networks. We identify common threads, current applications, and future challenges

    A Persistent Weisfeiler–Lehman Procedure for Graph Classification

    No full text
    The Weisfeiler-Lehman graph kernel exhibits competitive performance in many graph classification tasks. However, its subtree features are not able to capture connected components and cycles, topological features known for characterising graphs. To extract such features, we leverage propagated node label information and transform unweighted graphs into metric ones. This permits us to augment the subtree features with topological information obtained using persistent homology, a concept from topological data analysis. Our method, which we formalise as a generalisation of Weisfeiler-Lehman subtree features, exhibits favourable classification accuracy and its improvements in predictive performance are mainly driven by including cycle information.ISSN:2640-349

    Topological Autoencoders

    No full text
    We propose a novel approach for preserving topological structures of the input space in latent representations of autoencoders. Using persistent homology, a technique from topological data analysis, we calculate topological signatures of both the input and latent space to derive a topological loss term. Under weak theoretical assumptions, we construct this loss in a differentiable manner, such that the encoding learns to retain multi-scale connectivity information. We show that our approach is theoretically well-founded and that it exhibits favourable latent representations on a synthetic manifold as well as on real-world image data sets, while preserving low reconstruction errors.ISSN:2640-349

    Stable topological signatures for metric trees through graph approximations

    Get PDF
    The rising field of Topological Data Analysis (TDA) provides a new approach to learning from data through persistence diagrams, which are topological signatures that quantify topological properties of data in a comparable manner. For point clouds, these diagrams are often derived from the Vietoris-Rips filtration—based on the metric equipped on the data—which allows one to deduce topological patterns such as components and cycles of the underlying space. In metric trees these diagrams often fail to capture other crucial topological properties, such as the present leaves and multifurcations. Prior methods and results for persistent homology attempting to overcome this issue mainly target Rips graphs, which are often unfavorable in case of non-uniform density across our point cloud. We therefore introduce a new theoretical foundation for learning a wider variety of topological patterns through any given graph. Given particular powerful functions defining persistence diagrams to summarize topological patterns, including the normalized centrality or eccentricity, we prove a new stability result, explicitly bounding the bottleneck distance between the true and empirical diagrams for metric trees. This bound is tight if the metric distortion obtained through the graph and its maximal edge-weight are small. Through a case study of gene expression data, we demonstrate that our newly introduced diagrams provide novel quality measures and insights into cell trajectory inference.ISSN:0167-8655ISSN:1872-734

    Enhancing statistical power in temporal biomarker discovery through representative shapelet mining

    No full text
    Motivation Temporal biomarker discovery in longitudinal data is based on detecting reoccurring trajectories, the so-called shapelets. The search for shapelets requires considering all subsequences in the data. While the accompanying issue of multiple testing has been mitigated in previous work, the redundancy and overlap of the detected shapelets results in an a priori unbounded number of highly similar and structurally meaningless shapelets. As a consequence, current temporal biomarker discovery methods are impractical and underpowered. Results We find that the pre- or post-processing of shapelets does not sufficiently increase the power and practical utility. Consequently, we present a novel method for temporal biomarker discovery: Statistically Significant Submodular Subset Shapelet Mining (S5M) that retrieves short subsequences that are (i) occurring in the data, (ii) are statistically significantly associated with the phenotype and (iii) are of manageable quantity while maximizing structural diversity. Structural diversity is achieved by pruning non-representative shapelets via submodular optimization. This increases the statistical power and utility of S5M compared to state-of-the-art approaches on simulated and real-world datasets. For patients admitted to the intensive care unit (ICU) showing signs of severe organ failure, we find temporal patterns in the sequential organ failure assessment score that are associated with in-ICU mortality. Availability and implementation S5M is an option in the python package of S3M: github.com/BorgwardtLab/S3M.ISSN:1367-4803ISSN:1460-205

    Set Functions for Time Series

    No full text
    Despite the eminent successes of deep neural networks, many architectures are often hard to transfer to irregularly-sampled and asynchronous time series that commonly occur in real-world datasets, especially in healthcare applications. This paper proposes a novel approach for classifying irregularly-sampled time series with unaligned measurements, focusing on high scalability and data efficiency. Our method SeFT (Set Functions for Time Series) is based on recent advances in differentiable set function learning, extremely parallelizable with a beneficial memory footprint, thus scaling well to large datasets of long time series and online monitoring scenarios. Furthermore, our approach permits quantifying per-observation contributions to the classification outcome. We extensively compare our method with existing algorithms on multiple healthcare time series datasets and demonstrate that it performs competitively whilst significantly reducing runtime.ISSN:2640-349
    corecore