947 research outputs found
Navigating Diverse Datasets in the Face of Uncertainty
When exploring big volumes of data, one of the challenging aspects is their diversity
of origin. Multiple files that have not yet been ingested into a database system may
contain information of interest to a researcher, who must curate, understand and sieve
their content before being able to extract knowledge.
Performance is one of the greatest difficulties in exploring these datasets. On the
one hand, examining non-indexed, unprocessed files can be inefficient. On the other
hand, any processing before its understanding introduces latency and potentially un-
necessary work if the chosen schema matches poorly the data. We have surveyed the
state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle
data in-situ performantly.
Another major difficulty is matching files from multiple origins since their schema
and layout may not be compatible or properly documented. Most surveyed solutions
overlook this problem, especially for numeric, uncertain data, as is typical in fields
like astronomy.
The main objective of our research is to assist data scientists during the exploration
of unprocessed, numerical, raw data distributed across multiple files based solely on
its intrinsic distribution.
In this thesis, we first introduce the concept of Equally-Distributed Dependencies,
which provides the foundations to match this kind of dataset. We propose PresQ,
a novel algorithm that finds quasi-cliques on hypergraphs based on their expected
statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can
be assumed to be the same.
Finally, we propose a two-sample statistical test based on Self-Organizing Maps
(SOM). This method can outperform, in terms of power, other classifier-based two-
sample tests, being in some cases comparable to kernel-based methods, with the
advantage of being interpretable.
Both PresQ and the SOM-based statistical test can provide insights that drive
serendipitous discoveries
λμ©λ λ°μ΄ν° νμμ μν μ μ§μ μκ°ν μμ€ν μ€κ³
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :곡과λν μ»΄ν¨ν°κ³΅νλΆ,2020. 2. μμ§μ±.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)βa visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these resultsβto address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.νλ λ°μ΄ν° μ¬μ΄μΈμ€μμ μΈν°λν°λΈν μκ°νλ₯Ό ν΅ν΄ λ°μ΄ν°λ₯Ό μ΄ν΄νλ κ²μ νμμ μΈ λΆμ λ°©λ² μ€ νλμ΄λ€. κ·Έλ¬λ, μ΅κ·Ό λ°μ΄ν°μ ν¬κΈ°κ° νλ°μ μΌλ‘ μ¦κ°νλ©΄μ λ°μ΄ν° ν¬κΈ°λ‘ μΈν΄ λ°μνλ μ§μ° μκ°μ΄ μΈν°λν°λΈν μκ°μ λΆμμ ν° κ±Έλ¦Όλμ΄ λμλ€. λ³Έ μ°κ΅¬μμλ μ΄λ¬ν νμ₯μ± λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ μ μ§μ μκ°μ λΆμ(Progressive Visual Analytics)μ μ§μνλ μΌλ ¨μ μμ€ν
μ λμμΈνκ³ κ°λ°νλ€. μ΄λ¬ν μ μ§μ μκ°μ λΆμ μμ€ν
μ λ°μ΄ν° μ²λ¦¬κ° μμ ν λλμ§ μλλΌλ μ€κ° λΆμ κ²°κ³Όλ₯Ό μ¬μ©μμκ² μ 곡ν¨μΌλ‘μ¨ λ°μ΄ν°μ ν¬κΈ°λ‘ μΈν΄ λ°μνλ μ§μ° μκ° λ¬Έμ λ₯Ό μνν μ μλ€. 첫째λ‘, μμμ΅ κ±΄μ νμ κ°μ§λ λ°μ΄ν°λ₯Ό μκ°μ μΌλ‘ νμν μ μλ SwiftTuna μμ€ν
μ μ μνλ€. λ°μ΄ν° μ²λ¦¬ λ° μκ°μ ννμ νμ₯μ±μ λͺ©νλ‘ κ°λ°λ μ΄ μμ€ν
μ, μ½ 40μ΅ κ±΄μ νμ κ°μ§ λ°μ΄ν°μ λν μκ°νλ₯Ό μ μ²λ¦¬ μμ΄ μ μ΄λ§λ€ μ
λ°μ΄νΈν μ μλ κ²μΌλ‘ λνλ¬λ€. λμ§Έλ‘, κ·Όμ¬μ k-μ΅κ·Όμ μ (Approximate k-Nearest Neighbor) λ¬Έμ λ₯Ό μ μ§μ μΌλ‘ κ³μ°νλ PANENE μκ³ λ¦¬μ¦μ μ μνλ€. κ·Όμ¬μ k-μ΅κ·Όμ μ λ¬Έμ λ μ¬λ¬ κΈ°κ³ νμ΅ κΈ°λ²μμ μ°μμλ λΆκ΅¬νκ³ μ΄κΈ° κ³μ° μκ°μ΄ κΈΈμ΄μ μΈν°λν°λΈν μμ€ν
μ μ μ©νκΈ° νλ νκ³κ° μμλ€. PANENE μκ³ λ¦¬μ¦μ μ΄λ¬ν κΈ΄ μ΄κΈ° κ³μ° μκ°μ νκΈ°μ μΌλ‘ κ°μ νμ¬ λ€μν κΈ°κ³ νμ΅ κΈ°λ²μ μκ°μ λΆμμ νμ©ν μ μλλ‘ νλ€. νΉν, μ μ©ν λΉμ νμ μ°¨μ κ°μ κΈ°λ²μΈ t-λΆν¬ νλ₯ μ μλ² λ©(t-Distributed Stochastic Neighbor Embedding)μ κ°μνμ¬ μλ°± κ°μ μ°¨μμ κ°μ§λ λ°μ΄ν°λ₯Ό λΉ λ₯Έ μκ° λ΄μ μ¬μν μ μλ€. μμ λ μμ€ν
κ³Ό μκ³ λ¦¬μ¦μ΄ λ°μ΄ν°μ ν λλ μ΄μ κ°μλ‘ μΈν νμ₯μ± λ¬Έμ λ₯Ό ν΄κ²°νκ³ μ νλ€λ©΄, μΈ λ²μ§Έ μμ€ν
μμλ μ μ§μ μκ°μ λΆμμ μ λ’°λ λ¬Έμ λ₯Ό κ°μ νκ³ μ νλ€. μ μ§μ μκ°μ λΆμμμ μ¬μ©μμκ² μ£Όμ΄μ§λ μ€κ° κ³μ° κ²°κ³Όλ μ΅μ’
κ²°κ³Όμ κ·Όμ¬μΉμ΄λ―λ‘ λΆνμ€μ±μ΄ μ‘΄μ¬νλ€. λ³Έ μ°κ΅¬μμλ μΈμ΄νκ°λλ₯Ό μ΄μ©ν μ μ§μ μκ°μ λΆμ(Progressive Visual Analytics with Safeguards)μ΄λΌλ μλ‘μ΄ κ°λ
μ μ μνλ€. μ΄ κ°λ
μ μ¬μ©μκ° μ μ§μ νμμμ λ§μ£Όνλ λΆνμ€ν μ€κ° μ§μμ μΈμ΄νκ°λλ₯Ό λ¨κΈΈ μ μλλ‘ νμ¬ νμμμ μ»μ μ§μμ μ νλλ₯Ό μΆν κ²μ¦ν μ μλλ‘ νλ€. λν, μ΄λ¬ν κ°λ
μ μ€μ λ‘ κ΅¬ννμ¬ νμ¬ν ProReveal μμ€ν
μ μκ°νλ€. ProRevealλ₯Ό μ΄μ©ν μ¬μ©μ μ€νμμ μ¬μ©μλ€μ μΈμ΄νκ°λλ₯Ό μ±κ³΅μ μΌλ‘ λ§λ€ μ μμμ λΏλ§ μλλΌ, μ€κ° μ§μμ λΆνμ€μ±μ λ€λ£¨κΈ° μν΄ μΈμ΄νκ°λλ₯Ό μλ°μ μΌλ‘ μ΄μ©νλ€λ κ²μ μ μ μμλ€. λ§μ§λ§μΌλ‘, μ μΈ κ°μ§ μ°κ΅¬μ κ²°κ³Όλ₯Ό μ’
ν©νμ¬ μ μ§μ μκ°μ λΆμ μμ€ν
μ ꡬνν λμ λμμΈμ λμ μ ν₯ν μ°κ΅¬ λ°©ν₯μ λͺ¨μνλ€.CHAPTER1. Introduction 2
1.1 Background and Motivation 2
1.2 Thesis Statement and Research Questions 5
1.3 Thesis Contributions 5
1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6
1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7
1.3.3 Progressive Visual Analytics with Safeguards 8
1.4 Structure of Dissertation 9
CHAPTER2. Related Work 11
2.1 Progressive Visual Analytics 11
2.1.1 Definitions 11
2.1.2 System Latency and Human Factors 13
2.1.3 Users, Tasks, and Models 15
2.1.4 Techniques, Algorithms, and Systems. 17
2.1.5 Uncertainty Visualization 19
2.2 Approaches for Scalable Visualization Systems 20
2.3 The k-Nearest Neighbor (KNN) Problem 22
2.4 t-Distributed Stochastic Neighbor Embedding 26
CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28
3.1 The SwiTuna Design 31
3.1.1 Design Considerations 32
3.1.2 System Overview 33
3.1.3 Scalable Visualization Components 36
3.1.4 Visualization Cards 40
3.1.5 User Interface and Interaction 42
3.2 Responsive Querying 44
3.2.1 Querying Pipeline 44
3.2.2 Prompt Responses 47
3.2.3 Incremental Processing 47
3.3 Evaluation: Performance Benchmark 49
3.3.1 Study Design 49
3.3.2 Results and Discussion 52
3.4 Implementation 56
3.5 Summary 56
CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58
4.1 Approximate k-Nearest Neighbor 61
4.1.1 A Sequential Algorithm 62
4.1.2 An Online Algorithm 63
4.1.3 A Progressive Algorithm 66
4.1.4 Filtered AKNN Search 71
4.2 k-Nearest Neighbor Lookup Table 72
4.3 Benchmark. 78
4.3.1 Online and Progressive k-d Trees 78
4.3.2 k-Nearest Neighbor Lookup Tables 83
4.4 Applications 85
4.4.1 Progressive Regression and Density Estimation 85
4.4.2 Responsive t-SNE 87
4.5 Implementation 92
4.6 Discussion 92
4.7 Summary 93
CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95
5.1 Progressive Visual Analytics with Safeguards 98
5.1.1 Definition 98
5.1.2 Examples 101
5.1.3 Design Considerations 103
5.2 ProReveal 105
5.3 Evaluation 121
5.4 Discussion 127
5.5 Summary 130
CHAPTER6. Discussion 132
6.1 Lessons Learned 132
6.2 Limitations 135
CHAPTER7. Conclusion 137
7.1 Thesis Contributions Revisited 137
7.2 Future Research Agenda 139
7.3 Final Remarks 141
Abstract (Korean) 155
Acknowledgments (Korean) 157Docto
You can't always sketch what you want: Understanding Sensemaking in Visual Query Systems
Visual query systems (VQSs) empower users to interactively search for line
charts with desired visual patterns, typically specified using intuitive
sketch-based interfaces. Despite decades of past work on VQSs, these efforts
have not translated to adoption in practice, possibly because VQSs are largely
evaluated in unrealistic lab-based settings. To remedy this gap in adoption, we
collaborated with experts from three diverse domains---astronomy, genetics, and
material science---via a year-long user-centered design process to develop a
VQS that supports their workflow and analytical needs, and evaluate how VQSs
can be used in practice. Our study results reveal that ad-hoc sketch-only
querying is not as commonly used as prior work suggests, since analysts are
often unable to precisely express their patterns of interest. In addition, we
characterize three essential sensemaking processes supported by our enhanced
VQS. We discover that participants employ all three processes, but in different
proportions, depending on the analytical needs in each domain. Our findings
suggest that all three sensemaking processes must be integrated in order to
make future VQSs useful for a wide range of analytical inquiries.Comment: Accepted for presentation at IEEE VAST 2019, to be held October 20-25
in Vancouver, Canada. Paper will also be published in a special issue of IEEE
Transactions on Visualization and Computer Graphics (TVCG) IEEE VIS
(InfoVis/VAST/SciVis) 2019 ACM 2012 CCS - Human-centered computing,
Visualization, Visualization design and evaluation method
Hillview:A trillion-cell spreadsheet for big data
Hillview is a distributed spreadsheet for browsing very large datasets that
cannot be handled by a single machine. As a spreadsheet, Hillview provides a
high degree of interactivity that permits data analysts to explore information
quickly along many dimensions while switching visualizations on a whim. To
provide the required responsiveness, Hillview introduces visualization
sketches, or vizketches, as a simple idea to produce compact data
visualizations. Vizketches combine algorithmic techniques for data
summarization with computer graphics principles for efficient rendering. While
simple, vizketches are effective at scaling the spreadsheet by parallelizing
computation, reducing communication, providing progressive visualizations, and
offering precise accuracy guarantees. Using Hillview running on eight servers,
we can navigate and visualize datasets of tens of billions of rows and
trillions of cells, much beyond the published capabilities of competing
systems
DesignSense: A Visual Analytics Interface for Navigating Generated Design Spaces
Generative Design (GD) produces many design alternatives and promises novel and performant solutions to architectural design problems. The success of GD rests on the ability to navigate the generated alternatives in a way that is unhindered by their number and in a manner that reflects design judgment, with its quantitative and qualitative dimensions. I address this challenge by critically analyzing the literature on design space navigation (DSN) tools through a set of iteratively developed lenses. The lenses are informed by domain experts\u27 feedback and behavioural studies on design navigation under choice-overload conditions. The lessons from the analysis shaped DesignSense, which is a DSN tool that relies on visual analytics techniques for selecting, inspecting, clustering and grouping alternatives. Furthermore, I present case studies of navigating realistic GD datasets from architecture and game design. Finally, I conduct a formative focus group evaluation with design professionals that shows the tool\u27s potential and highlights future directions
MFA-DVR: Direct Volume Rendering of MFA Models
3D volume rendering is widely used to reveal insightful intrinsic patterns of
volumetric datasets across many domains. However, the complex structures and
varying scales of volumetric data can make efficiently generating high-quality
volume rendering results a challenging task. Multivariate functional
approximation (MFA) is a new data model that addresses some of the critical
challenges: high-order evaluation of both value and derivative anywhere in the
spatial domain, compact representation for large-scale volumetric data, and
uniform representation of both structured and unstructured data. In this paper,
we present MFA-DVR, the first direct volume rendering pipeline utilizing the
MFA model, for both structured and unstructured volumetric datasets. We
demonstrate improved rendering quality using MFA-DVR on both synthetic and real
datasets through a comparative study. We show that MFA-DVR not only generates
more faithful volume rendering than using local filters but also performs
faster on high-order interpolations on structured and unstructured datasets.
MFA-DVR is implemented in the existing volume rendering pipeline of the
Visualization Toolkit (VTK) to be accessible by the scientific visualization
community
Balancing Interactive Data Management of Massive Data with Situational Awareness through Smart Aggregation
Designing a visualization system capable of processing, managing, and presenting massive data sets while maximizing the userβs situational awareness (SA) is a challenging, but important, research question in visual analytics. Traditional data management and interactive retrieval approaches have often focused on solving the data overload problem at the expense of the userβs SA. This paper discusses various data management strategies and the strengths and limitations of each approach in providing the user with SA. A new data management strategy, coined Smart Aggregation, is presented as a powerful approach to overcome the challenges of both massive data sets and maintaining SA. By combining automatic data aggregation with user-defined controls on what, how, and when data should be aggregated, we present a visualization system that can handle massive amounts of data while affording the user with the best possible SA. This approach ensures that a system is always usable in terms of both system resources and human perceptual resources. We have implemented our Smart Aggregation approach in a visual analytics system called VIAssist (Visual Assistant for Information Assurance Analysis) to facilitate exploration, discovery, and SA in th
Efficient Point Clustering for Visualization
The visualization of large spatial point data sets constitutes a problem with respect to runtime and quality. A visualization of raw data often leads to occlusion and clutter and thus a loss of information. Furthermore, particularly mobile devices have problems in displaying millions of data items. Often, thinning via sampling is not the optimal choice because users want to see distributional patterns, cardinalities and outliers. In particular for visual analytics, an aggregation of this type of data is very valuable for providing an interactive user experience. This thesis defines the problem of visual point clustering that leads to proportional circle maps. It furthermore introduces a set of quality measures that assess different aspects of resulting circle representations.
The Circle Merging Quadtree constitutes a novel and efficient method to produce visual point clusterings via aggregation. It is able to outperform comparable methods in terms of runtime and also by evaluating it with the aforementioned quality measures. Moreover, the introduction of a preprocessing step leads to further substantial performance improvements and a guaranteed stability of the Circle Merging Quadtree. This thesis furthermore addresses the incorporation of miscellaneous attributes into the aggregation. It discusses means to provide statistical values for numerical and textual attributes that are suitable for side-views such as plots and data tables. The incorporation of multiple data sets or data sets that contain class attributes poses another problem for aggregation and visualization. This thesis provides methods for extending the Circle Merging Quadtree to output pie chart maps or maps that contain circle packings. For the latter variant, this thesis provides results of a user study that investigates the methods and the introduced quality criteria.
In the context of providing methods for interactive data visualization, this thesis finally presents the VAT System, where VAT stands for visualization, analysis and transformation. This system constitutes an exploratory geographical information system that implements principles of visual analytics for working with spatio-temporal data. This thesis details on the user interface concept for facilitating exploratory analysis and provides the results of two user studies that assess the approach
- β¦