5 research outputs found
Progressive Wasserstein Barycenters of Persistence Diagrams
This paper presents an efficient algorithm for the progressive approximation
of Wasserstein barycenters of persistence diagrams, with applications to the
visual analysis of ensemble data. Given a set of scalar fields, our approach
enables the computation of a persistence diagram which is representative of the
set, and which visually conveys the number, data ranges and saliences of the
main features of interest found in the set. Such representative diagrams are
obtained by computing explicitly the discrete Wasserstein barycenter of the set
of persistence diagrams, a notoriously computationally intensive task. In
particular, we revisit efficient algorithms for Wasserstein distance
approximation [12,51] to extend previous work on barycenter estimation [94]. We
present a new fast algorithm, which progressively approximates the barycenter
by iteratively increasing the computation accuracy as well as the number of
persistent features in the output diagram. Such a progressivity drastically
improves convergence in practice and allows to design an interruptible
algorithm, capable of respecting computation time constraints. This enables the
approximation of Wasserstein barycenters within interactive times. We present
an application to ensemble clustering where we revisit the k-means algorithm to
exploit our barycenters and compute, within execution time constraints,
meaningful clusters of ensemble data along with their barycenter diagram.
Extensive experiments on synthetic and real-life data sets report that our
algorithm converges to barycenters that are qualitatively meaningful with
regard to the applications, and quantitatively comparable to previous
techniques, while offering an order of magnitude speedup when run until
convergence (without time constraint). Our algorithm can be trivially
parallelized to provide additional speedups in practice on standard
workstations. [...
Statistical Parameter Selection for Clustering Persistence Diagrams
International audienceIn urgent decision making applications, ensemble simulations are an important way to determine different outcome scenarios based on currently available data. In this paper, we will analyze the output of ensemble simulations by considering so-called persistence diagrams, which are reduced representations of the original data, motivated by the extraction of topological features. Based on a recently published progressive algorithm for the clustering of persistence diagrams, we determine the optimal number of clusters, and therefore the number of significantly different outcome scenarios, by the minimization of established statistical score functions. Furthermore, we present a proof-of-concept prototype implementation of the statistical selection of the number of clusters and provide the results of an experimental study, where this implementation has been applied to real-world ensemble data sets
λμ©λ λ°μ΄ν° νμμ μν μ μ§μ μκ°ν μμ€ν μ€κ³
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :곡과λν μ»΄ν¨ν°κ³΅νλΆ,2020. 2. μμ§μ±.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)βa visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these resultsβto address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.νλ λ°μ΄ν° μ¬μ΄μΈμ€μμ μΈν°λν°λΈν μκ°νλ₯Ό ν΅ν΄ λ°μ΄ν°λ₯Ό μ΄ν΄νλ κ²μ νμμ μΈ λΆμ λ°©λ² μ€ νλμ΄λ€. κ·Έλ¬λ, μ΅κ·Ό λ°μ΄ν°μ ν¬κΈ°κ° νλ°μ μΌλ‘ μ¦κ°νλ©΄μ λ°μ΄ν° ν¬κΈ°λ‘ μΈν΄ λ°μνλ μ§μ° μκ°μ΄ μΈν°λν°λΈν μκ°μ λΆμμ ν° κ±Έλ¦Όλμ΄ λμλ€. λ³Έ μ°κ΅¬μμλ μ΄λ¬ν νμ₯μ± λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ μ μ§μ μκ°μ λΆμ(Progressive Visual Analytics)μ μ§μνλ μΌλ ¨μ μμ€ν
μ λμμΈνκ³ κ°λ°νλ€. μ΄λ¬ν μ μ§μ μκ°μ λΆμ μμ€ν
μ λ°μ΄ν° μ²λ¦¬κ° μμ ν λλμ§ μλλΌλ μ€κ° λΆμ κ²°κ³Όλ₯Ό μ¬μ©μμκ² μ 곡ν¨μΌλ‘μ¨ λ°μ΄ν°μ ν¬κΈ°λ‘ μΈν΄ λ°μνλ μ§μ° μκ° λ¬Έμ λ₯Ό μνν μ μλ€. 첫째λ‘, μμμ΅ κ±΄μ νμ κ°μ§λ λ°μ΄ν°λ₯Ό μκ°μ μΌλ‘ νμν μ μλ SwiftTuna μμ€ν
μ μ μνλ€. λ°μ΄ν° μ²λ¦¬ λ° μκ°μ ννμ νμ₯μ±μ λͺ©νλ‘ κ°λ°λ μ΄ μμ€ν
μ, μ½ 40μ΅ κ±΄μ νμ κ°μ§ λ°μ΄ν°μ λν μκ°νλ₯Ό μ μ²λ¦¬ μμ΄ μ μ΄λ§λ€ μ
λ°μ΄νΈν μ μλ κ²μΌλ‘ λνλ¬λ€. λμ§Έλ‘, κ·Όμ¬μ k-μ΅κ·Όμ μ (Approximate k-Nearest Neighbor) λ¬Έμ λ₯Ό μ μ§μ μΌλ‘ κ³μ°νλ PANENE μκ³ λ¦¬μ¦μ μ μνλ€. κ·Όμ¬μ k-μ΅κ·Όμ μ λ¬Έμ λ μ¬λ¬ κΈ°κ³ νμ΅ κΈ°λ²μμ μ°μμλ λΆκ΅¬νκ³ μ΄κΈ° κ³μ° μκ°μ΄ κΈΈμ΄μ μΈν°λν°λΈν μμ€ν
μ μ μ©νκΈ° νλ νκ³κ° μμλ€. PANENE μκ³ λ¦¬μ¦μ μ΄λ¬ν κΈ΄ μ΄κΈ° κ³μ° μκ°μ νκΈ°μ μΌλ‘ κ°μ νμ¬ λ€μν κΈ°κ³ νμ΅ κΈ°λ²μ μκ°μ λΆμμ νμ©ν μ μλλ‘ νλ€. νΉν, μ μ©ν λΉμ νμ μ°¨μ κ°μ κΈ°λ²μΈ t-λΆν¬ νλ₯ μ μλ² λ©(t-Distributed Stochastic Neighbor Embedding)μ κ°μνμ¬ μλ°± κ°μ μ°¨μμ κ°μ§λ λ°μ΄ν°λ₯Ό λΉ λ₯Έ μκ° λ΄μ μ¬μν μ μλ€. μμ λ μμ€ν
κ³Ό μκ³ λ¦¬μ¦μ΄ λ°μ΄ν°μ ν λλ μ΄μ κ°μλ‘ μΈν νμ₯μ± λ¬Έμ λ₯Ό ν΄κ²°νκ³ μ νλ€λ©΄, μΈ λ²μ§Έ μμ€ν
μμλ μ μ§μ μκ°μ λΆμμ μ λ’°λ λ¬Έμ λ₯Ό κ°μ νκ³ μ νλ€. μ μ§μ μκ°μ λΆμμμ μ¬μ©μμκ² μ£Όμ΄μ§λ μ€κ° κ³μ° κ²°κ³Όλ μ΅μ’
κ²°κ³Όμ κ·Όμ¬μΉμ΄λ―λ‘ λΆνμ€μ±μ΄ μ‘΄μ¬νλ€. λ³Έ μ°κ΅¬μμλ μΈμ΄νκ°λλ₯Ό μ΄μ©ν μ μ§μ μκ°μ λΆμ(Progressive Visual Analytics with Safeguards)μ΄λΌλ μλ‘μ΄ κ°λ
μ μ μνλ€. μ΄ κ°λ
μ μ¬μ©μκ° μ μ§μ νμμμ λ§μ£Όνλ λΆνμ€ν μ€κ° μ§μμ μΈμ΄νκ°λλ₯Ό λ¨κΈΈ μ μλλ‘ νμ¬ νμμμ μ»μ μ§μμ μ νλλ₯Ό μΆν κ²μ¦ν μ μλλ‘ νλ€. λν, μ΄λ¬ν κ°λ
μ μ€μ λ‘ κ΅¬ννμ¬ νμ¬ν ProReveal μμ€ν
μ μκ°νλ€. ProRevealλ₯Ό μ΄μ©ν μ¬μ©μ μ€νμμ μ¬μ©μλ€μ μΈμ΄νκ°λλ₯Ό μ±κ³΅μ μΌλ‘ λ§λ€ μ μμμ λΏλ§ μλλΌ, μ€κ° μ§μμ λΆνμ€μ±μ λ€λ£¨κΈ° μν΄ μΈμ΄νκ°λλ₯Ό μλ°μ μΌλ‘ μ΄μ©νλ€λ κ²μ μ μ μμλ€. λ§μ§λ§μΌλ‘, μ μΈ κ°μ§ μ°κ΅¬μ κ²°κ³Όλ₯Ό μ’
ν©νμ¬ μ μ§μ μκ°μ λΆμ μμ€ν
μ ꡬνν λμ λμμΈμ λμ μ ν₯ν μ°κ΅¬ λ°©ν₯μ λͺ¨μνλ€.CHAPTER1. Introduction 2
1.1 Background and Motivation 2
1.2 Thesis Statement and Research Questions 5
1.3 Thesis Contributions 5
1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6
1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7
1.3.3 Progressive Visual Analytics with Safeguards 8
1.4 Structure of Dissertation 9
CHAPTER2. Related Work 11
2.1 Progressive Visual Analytics 11
2.1.1 Definitions 11
2.1.2 System Latency and Human Factors 13
2.1.3 Users, Tasks, and Models 15
2.1.4 Techniques, Algorithms, and Systems. 17
2.1.5 Uncertainty Visualization 19
2.2 Approaches for Scalable Visualization Systems 20
2.3 The k-Nearest Neighbor (KNN) Problem 22
2.4 t-Distributed Stochastic Neighbor Embedding 26
CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28
3.1 The SwiTuna Design 31
3.1.1 Design Considerations 32
3.1.2 System Overview 33
3.1.3 Scalable Visualization Components 36
3.1.4 Visualization Cards 40
3.1.5 User Interface and Interaction 42
3.2 Responsive Querying 44
3.2.1 Querying Pipeline 44
3.2.2 Prompt Responses 47
3.2.3 Incremental Processing 47
3.3 Evaluation: Performance Benchmark 49
3.3.1 Study Design 49
3.3.2 Results and Discussion 52
3.4 Implementation 56
3.5 Summary 56
CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58
4.1 Approximate k-Nearest Neighbor 61
4.1.1 A Sequential Algorithm 62
4.1.2 An Online Algorithm 63
4.1.3 A Progressive Algorithm 66
4.1.4 Filtered AKNN Search 71
4.2 k-Nearest Neighbor Lookup Table 72
4.3 Benchmark. 78
4.3.1 Online and Progressive k-d Trees 78
4.3.2 k-Nearest Neighbor Lookup Tables 83
4.4 Applications 85
4.4.1 Progressive Regression and Density Estimation 85
4.4.2 Responsive t-SNE 87
4.5 Implementation 92
4.6 Discussion 92
4.7 Summary 93
CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95
5.1 Progressive Visual Analytics with Safeguards 98
5.1.1 Definition 98
5.1.2 Examples 101
5.1.3 Design Considerations 103
5.2 ProReveal 105
5.3 Evaluation 121
5.4 Discussion 127
5.5 Summary 130
CHAPTER6. Discussion 132
6.1 Lessons Learned 132
6.2 Limitations 135
CHAPTER7. Conclusion 137
7.1 Thesis Contributions Revisited 137
7.2 Future Research Agenda 139
7.3 Final Remarks 141
Abstract (Korean) 155
Acknowledgments (Korean) 157Docto