8 research outputs found

    Selecting adequate samples for approximate decision support queries

    Get PDF
    For highly selective queries, a simple random sample of records drawn from a large data warehouse may not contain sufficient number of records that satisfy the query conditions. Efficient sampling schemes for such queries require innovative techniques that can access records that are relevant to each specific query. In drawing the sample, it is advantageous to know what would be an adequate sample size for a given query. This paper proposes methods for picking adequate samples that ensure approximate query results with a desired level of accuracy. A special index based on a structure known as the k-MDI Tree is used to draw samples. An unbiased estimator named inverse simple random sampling without replacement is adapted to estimate adequate sample sizes for queries. The methods are evaluated experimentally on a large real life data set. The results of evaluation show that adequate sample sizes can be determined such that errors in outputs of most queries are wtihin the acceptable limit of 5%

    Rapid Sampling for Visualizations with Ordering Guarantees

    Get PDF
    Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual proper- ties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes.Comment: Tech Report. 17 pages. Condensed version to appear in VLDB Vol. 8 No.

    Dynamic Prefetching of Data Tiles for Interactive Visualization

    Get PDF
    In this paper, we present ForeCache, a general-purpose tool for exploratory browsing of large datasets. ForeCache utilizes a client-server architecture, where the user interacts with a lightweight client-side interface to browse datasets, and the data to be browsed is retrieved from a DBMS running on a back-end server. We assume a detail-on-demand browsing paradigm, and optimize the back-end support for this paradigm by inserting a separate middleware layer in front of the DBMS. To improve response times, the middleware layer fetches data ahead of the user as she explores a dataset. We consider two different mechanisms for prefetching: (a) learning what to fetch from the user's recent movements, and (b) using data characteristics (e.g., histograms) to find data similar to what the user has viewed in the past. We incorporate these mechanisms into a single prediction engine that adjusts its prediction strategies over time, based on changes in the user's behavior. We evaluated our prediction engine with a user study, and found that our dynamic prefetching strategy provides: (1) significant improvements in overall latency when compared with non-prefetching systems (430% improvement); and (2) substantial improvements in both prediction accuracy (25% improvement) and latency (88% improvement) relative to existing prefetching techniques

    Interactive visualization of big data leveraging databases for scalable computation

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 55-57).Modern database management systems (DBMS) have been designed to efficiently store, manage and perform computations on massive amounts of data. In contrast, many existing visualization systems do not scale seamlessly from small data sets to enormous ones. We have designed a three-tiered visualization system called ScalaR to deal with this issue. ScalaR dynamically performs resolution reduction when the expected result of a DBMS query is too large to be effectively rendered on existing screen real estate. Instead of running the original query, ScalaR inserts aggregation, sampling or filtering operations to reduce the size of the result. This thesis presents the design and implementation of ScalaR, and shows results for two example applications, visualizing earthquake records and satellite imagery data, stored in SciDB as the back-end DBMS.by Leilani Marie Battle.S.M

    λŒ€μš©λŸ‰ 데이터 탐색을 μœ„ν•œ 점진적 μ‹œκ°ν™” μ‹œμŠ€ν…œ 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀,2020. 2. μ„œμ§„μš±.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)β€”a visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these resultsβ€”to address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.ν˜„λŒ€ 데이터 μ‚¬μ΄μ–ΈμŠ€μ—μ„œ μΈν„°λž™ν‹°λΈŒν•œ μ‹œκ°ν™”λ₯Ό 톡해 데이터λ₯Ό μ΄ν•΄ν•˜λŠ” 것은 ν•„μˆ˜μ μΈ 뢄석 방법 쀑 ν•˜λ‚˜μ΄λ‹€. κ·ΈλŸ¬λ‚˜, 졜근 λ°μ΄ν„°μ˜ 크기가 폭발적으둜 μ¦κ°€ν•˜λ©΄μ„œ 데이터 크기둜 인해 λ°œμƒν•˜λŠ” 지연 μ‹œκ°„μ΄ μΈν„°λž™ν‹°λΈŒν•œ μ‹œκ°μ  뢄석에 큰 걸림돌이 λ˜μ—ˆλ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” μ΄λŸ¬ν•œ ν™•μž₯μ„± 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 점진적 μ‹œκ°μ  뢄석(Progressive Visual Analytics)을 μ§€μ›ν•˜λŠ” 일련의 μ‹œμŠ€ν…œμ„ λ””μžμΈν•˜κ³  κ°œλ°œν•œλ‹€. μ΄λŸ¬ν•œ 점진적 μ‹œκ°μ  뢄석 μ‹œμŠ€ν…œμ€ 데이터 μ²˜λ¦¬κ°€ μ™„μ „νžˆ λλ‚˜μ§€ μ•Šλ”λΌλ„ 쀑간 뢄석 κ²°κ³Όλ₯Ό μ‚¬μš©μžμ—κ²Œ μ œκ³΅ν•¨μœΌλ‘œμ¨ λ°μ΄ν„°μ˜ 크기둜 인해 λ°œμƒν•˜λŠ” 지연 μ‹œκ°„ 문제λ₯Ό μ™„ν™”ν•  수 μžˆλ‹€. 첫째둜, μˆ˜μ‹­μ–΅ 건의 행을 κ°€μ§€λŠ” 데이터λ₯Ό μ‹œκ°μ μœΌλ‘œ 탐색할 수 μžˆλŠ” SwiftTuna μ‹œμŠ€ν…œμ„ μ œμ•ˆν•œλ‹€. 데이터 처리 및 μ‹œκ°μ  ν‘œν˜„μ˜ ν™•μž₯성을 λͺ©ν‘œλ‘œ 개발된 이 μ‹œμŠ€ν…œμ€, μ•½ 40μ–΅ 건의 행을 가진 데이터에 λŒ€ν•œ μ‹œκ°ν™”λ₯Ό μ „μ²˜λ¦¬ 없이 수 μ΄ˆλ§ˆλ‹€ μ—…λ°μ΄νŠΈν•  수 μžˆλŠ” κ²ƒμœΌλ‘œ λ‚˜νƒ€λ‚¬λ‹€. λ‘˜μ§Έλ‘œ, 근사적 k-μ΅œκ·Όμ ‘μ (Approximate k-Nearest Neighbor) 문제λ₯Ό μ μ§„μ μœΌλ‘œ κ³„μ‚°ν•˜λŠ” PANENE μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•œλ‹€. 근사적 k-μ΅œκ·Όμ ‘μ  λ¬Έμ œλŠ” μ—¬λŸ¬ 기계 ν•™μŠ΅ κΈ°λ²•μ—μ„œ μ“°μž„μ—λ„ λΆˆκ΅¬ν•˜κ³  초기 계산 μ‹œκ°„μ΄ κΈΈμ–΄μ„œ μΈν„°λž™ν‹°λΈŒν•œ μ‹œμŠ€ν…œμ— μ μš©ν•˜κΈ° νž˜λ“  ν•œκ³„κ°€ μžˆμ—ˆλ‹€. PANENE μ•Œκ³ λ¦¬μ¦˜μ€ μ΄λŸ¬ν•œ κΈ΄ 초기 계산 μ‹œκ°„μ„ 획기적으둜 κ°œμ„ ν•˜μ—¬ λ‹€μ–‘ν•œ 기계 ν•™μŠ΅ 기법을 μ‹œκ°μ  뢄석에 ν™œμš©ν•  수 μžˆλ„λ‘ ν•œλ‹€. 특히, μœ μš©ν•œ λΉ„μ„ ν˜•μ  차원 κ°μ†Œ 기법인 t-뢄포 ν™•λ₯ μ  μž„λ² λ”©(t-Distributed Stochastic Neighbor Embedding)을 κ°€μ†ν•˜μ—¬ 수백 개의 차원을 κ°€μ§€λŠ” 데이터λ₯Ό λΉ λ₯Έ μ‹œκ°„ 내에 μ‚¬μ˜ν•  수 μžˆλ‹€. μœ„μ˜ 두 μ‹œμŠ€ν…œκ³Ό μ•Œκ³ λ¦¬μ¦˜μ΄ λ°μ΄ν„°μ˜ ν–‰ λ˜λŠ” μ—΄μ˜ 개수둜 μΈν•œ ν™•μž₯μ„± 문제λ₯Ό ν•΄κ²°ν•˜κ³ μž ν–ˆλ‹€λ©΄, μ„Έ 번째 μ‹œμŠ€ν…œμ—μ„œλŠ” 점진적 μ‹œκ°μ  λΆ„μ„μ˜ 신뒰도 문제λ₯Ό κ°œμ„ ν•˜κ³ μž ν•œλ‹€. 점진적 μ‹œκ°μ  λΆ„μ„μ—μ„œ μ‚¬μš©μžμ—κ²Œ μ£Όμ–΄μ§€λŠ” 쀑간 계산 κ²°κ³ΌλŠ” μ΅œμ’… 결과의 κ·Όμ‚¬μΉ˜μ΄λ―€λ‘œ λΆˆν™•μ‹€μ„±μ΄ μ‘΄μž¬ν•œλ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” μ„Έμ΄ν”„κ°€λ“œλ₯Ό μ΄μš©ν•œ 점진적 μ‹œκ°μ  뢄석(Progressive Visual Analytics with Safeguards)μ΄λΌλŠ” μƒˆλ‘œμš΄ κ°œλ…μ„ μ œμ•ˆν•œλ‹€. 이 κ°œλ…μ€ μ‚¬μš©μžκ°€ 점진적 νƒμƒ‰μ—μ„œ λ§ˆμ£Όν•˜λŠ” λΆˆν™•μ‹€ν•œ 쀑간 지식에 μ„Έμ΄ν”„κ°€λ“œλ₯Ό 남길 수 μžˆλ„λ‘ ν•˜μ—¬ νƒμƒ‰μ—μ„œ 얻은 μ§€μ‹μ˜ 정확도λ₯Ό μΆ”ν›„ 검증할 수 μžˆλ„λ‘ ν•œλ‹€. λ˜ν•œ, μ΄λŸ¬ν•œ κ°œλ…μ„ μ‹€μ œλ‘œ κ΅¬ν˜„ν•˜μ—¬ νƒ‘μž¬ν•œ ProReveal μ‹œμŠ€ν…œμ„ μ†Œκ°œν•œλ‹€. ProRevealλ₯Ό μ΄μš©ν•œ μ‚¬μš©μž μ‹€ν—˜μ—μ„œ μ‚¬μš©μžλ“€μ€ μ„Έμ΄ν”„κ°€λ“œλ₯Ό μ„±κ³΅μ μœΌλ‘œ λ§Œλ“€ 수 μžˆμ—ˆμ„ 뿐만 μ•„λ‹ˆλΌ, 쀑간 μ§€μ‹μ˜ λΆˆν™•μ‹€μ„±μ„ 닀루기 μœ„ν•΄ μ„Έμ΄ν”„κ°€λ“œλ₯Ό 자발적으둜 μ΄μš©ν•œλ‹€λŠ” 것을 μ•Œ 수 μžˆμ—ˆλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, μœ„ μ„Έ 가지 μ—°κ΅¬μ˜ κ²°κ³Όλ₯Ό μ’…ν•©ν•˜μ—¬ 점진적 μ‹œκ°μ  뢄석 μ‹œμŠ€ν…œμ„ κ΅¬ν˜„ν•  λ•Œμ˜ λ””μžμΈμ  λ‚œμ œμ™€ ν–₯ν›„ 연ꡬ λ°©ν–₯을 λͺ¨μƒ‰ν•œλ‹€.CHAPTER1. Introduction 2 1.1 Background and Motivation 2 1.2 Thesis Statement and Research Questions 5 1.3 Thesis Contributions 5 1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6 1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7 1.3.3 Progressive Visual Analytics with Safeguards 8 1.4 Structure of Dissertation 9 CHAPTER2. Related Work 11 2.1 Progressive Visual Analytics 11 2.1.1 Definitions 11 2.1.2 System Latency and Human Factors 13 2.1.3 Users, Tasks, and Models 15 2.1.4 Techniques, Algorithms, and Systems. 17 2.1.5 Uncertainty Visualization 19 2.2 Approaches for Scalable Visualization Systems 20 2.3 The k-Nearest Neighbor (KNN) Problem 22 2.4 t-Distributed Stochastic Neighbor Embedding 26 CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28 3.1 The SwiTuna Design 31 3.1.1 Design Considerations 32 3.1.2 System Overview 33 3.1.3 Scalable Visualization Components 36 3.1.4 Visualization Cards 40 3.1.5 User Interface and Interaction 42 3.2 Responsive Querying 44 3.2.1 Querying Pipeline 44 3.2.2 Prompt Responses 47 3.2.3 Incremental Processing 47 3.3 Evaluation: Performance Benchmark 49 3.3.1 Study Design 49 3.3.2 Results and Discussion 52 3.4 Implementation 56 3.5 Summary 56 CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58 4.1 Approximate k-Nearest Neighbor 61 4.1.1 A Sequential Algorithm 62 4.1.2 An Online Algorithm 63 4.1.3 A Progressive Algorithm 66 4.1.4 Filtered AKNN Search 71 4.2 k-Nearest Neighbor Lookup Table 72 4.3 Benchmark. 78 4.3.1 Online and Progressive k-d Trees 78 4.3.2 k-Nearest Neighbor Lookup Tables 83 4.4 Applications 85 4.4.1 Progressive Regression and Density Estimation 85 4.4.2 Responsive t-SNE 87 4.5 Implementation 92 4.6 Discussion 92 4.7 Summary 93 CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95 5.1 Progressive Visual Analytics with Safeguards 98 5.1.1 Definition 98 5.1.2 Examples 101 5.1.3 Design Considerations 103 5.2 ProReveal 105 5.3 Evaluation 121 5.4 Discussion 127 5.5 Summary 130 CHAPTER6. Discussion 132 6.1 Lessons Learned 132 6.2 Limitations 135 CHAPTER7. Conclusion 137 7.1 Thesis Contributions Revisited 137 7.2 Future Research Agenda 139 7.3 Final Remarks 141 Abstract (Korean) 155 Acknowledgments (Korean) 157Docto

    Incremental, approximate database queries and uncertainty for exploratory visualization

    No full text

    Scalable visual analytics over voluminous spatiotemporal data

    Get PDF
    2018 Fall.Includes bibliographical references.Visualization is a critical part of modern data analytics. This is especially true of interactive and exploratory visual analytics, which encourages speedy discovery of trends, patterns, and connections in data by allowing analysts to rapidly change what data is displayed and how it is displayed. Unfortunately, the explosion of data production in recent years has led to problems of scale as storage, processing, querying, and visualization have struggled to keep pace with data volumes. Visualization of spatiotemporal data pose unique challenges, thanks in part to high-dimensionality in the input feature space, interactions between features, and the production of voluminous, high-resolution outputs. In this dissertation, we address challenges associated with supporting interactive, exploratory visualization of voluminous spatiotemporal datasets and underlying phenomena. This requires the visualization of millions of entities and changes to these entities as the spatiotemporal phenomena unfolds. The rendering and propagation of spatiotemporal phenomena must be both accurate and timely. Key contributions of this dissertation include: 1) the temporal and spatial coupling of spatially localized models to enable the visualization of phenomena at far greater geospatial scales; 2) the ability to directly compare and contrast diverging spatiotemporal outcomes that arise from multiple exploratory "what-if" queries; and 3) the computational framework required to support an interactive user experience in a heavily resource-constrained environment. We additionally provide support for collaborative and competitive exploration with multiple synchronized clients
    corecore