1,058 research outputs found

    CubiST: A New Algorithm for Improving the Performance of Ad-hoc OLAP Queries

    Get PDF
    Being able to efficiently answer arbitrary OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes has been a continued, major concern in data warehousing. In this paper, we introduce a new data structure, called Statistics Tree (ST), together with an efficient algorithm called CubiST, for evaluating ad-hoc OLAP queries on top of a relational data warehouse. We are focusing on a class of queries called cube queries, which generalize the data cube operator. CubiST represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the familiar view lattice to compute and materialize new views from existing views in some heuristic fashion. CubiST is the first OLAP algorithm that needs only one scan over the detailed data set and can efficiently answer any cube query without additional I/O when the ST fits into memory. We have implemented CubiST and our experiments have demonstrated significant improvements in performance and scalability over existing ROLAP/MOLAP approaches

    ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ํƒ์ƒ‰์„ ์œ„ํ•œ ์ ์ง„์  ์‹œ๊ฐํ™” ์‹œ์Šคํ…œ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์„œ์ง„์šฑ.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)โ€”a visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these resultsโ€”to address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.ํ˜„๋Œ€ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค์—์„œ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ํ•„์ˆ˜์ ์ธ ๋ถ„์„ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์ตœ๊ทผ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ์ง€์—ฐ ์‹œ๊ฐ„์ด ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ๊ฐ์  ๋ถ„์„์— ํฐ ๊ฑธ๋ฆผ๋Œ์ด ๋˜์—ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ™•์žฅ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„(Progressive Visual Analytics)์„ ์ง€์›ํ•˜๋Š” ์ผ๋ จ์˜ ์‹œ์Šคํ…œ์„ ๋””์ž์ธํ•˜๊ณ  ๊ฐœ๋ฐœํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„ ์‹œ์Šคํ…œ์€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๊ฐ€ ์™„์ „ํžˆ ๋๋‚˜์ง€ ์•Š๋”๋ผ๋„ ์ค‘๊ฐ„ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•จ์œผ๋กœ์จ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฒซ์งธ๋กœ, ์ˆ˜์‹ญ์–ต ๊ฑด์˜ ํ–‰์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋Š” SwiftTuna ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐ ์‹œ๊ฐ์  ํ‘œํ˜„์˜ ํ™•์žฅ์„ฑ์„ ๋ชฉํ‘œ๋กœ ๊ฐœ๋ฐœ๋œ ์ด ์‹œ์Šคํ…œ์€, ์•ฝ 40์–ต ๊ฑด์˜ ํ–‰์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์‹œ๊ฐํ™”๋ฅผ ์ „์ฒ˜๋ฆฌ ์—†์ด ์ˆ˜ ์ดˆ๋งˆ๋‹ค ์—…๋ฐ์ดํŠธํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋‘˜์งธ๋กœ, ๊ทผ์‚ฌ์  k-์ตœ๊ทผ์ ‘์ (Approximate k-Nearest Neighbor) ๋ฌธ์ œ๋ฅผ ์ ์ง„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” PANENE ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๊ทผ์‚ฌ์  k-์ตœ๊ทผ์ ‘์  ๋ฌธ์ œ๋Š” ์—ฌ๋Ÿฌ ๊ธฐ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์—์„œ ์“ฐ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ดˆ๊ธฐ ๊ณ„์‚ฐ ์‹œ๊ฐ„์ด ๊ธธ์–ด์„œ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ์Šคํ…œ์— ์ ์šฉํ•˜๊ธฐ ํž˜๋“  ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค. PANENE ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ด๋Ÿฌํ•œ ๊ธด ์ดˆ๊ธฐ ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‹œ๊ฐ์  ๋ถ„์„์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ํŠนํžˆ, ์œ ์šฉํ•œ ๋น„์„ ํ˜•์  ์ฐจ์› ๊ฐ์†Œ ๊ธฐ๋ฒ•์ธ t-๋ถ„ํฌ ํ™•๋ฅ ์  ์ž„๋ฒ ๋”ฉ(t-Distributed Stochastic Neighbor Embedding)์„ ๊ฐ€์†ํ•˜์—ฌ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅธ ์‹œ๊ฐ„ ๋‚ด์— ์‚ฌ์˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๋‘ ์‹œ์Šคํ…œ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋ฐ์ดํ„ฐ์˜ ํ–‰ ๋˜๋Š” ์—ด์˜ ๊ฐœ์ˆ˜๋กœ ์ธํ•œ ํ™•์žฅ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ๋‹ค๋ฉด, ์„ธ ๋ฒˆ์งธ ์‹œ์Šคํ…œ์—์„œ๋Š” ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„์˜ ์‹ ๋ขฐ๋„ ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„์—์„œ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ฃผ์–ด์ง€๋Š” ์ค‘๊ฐ„ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ์˜ ๊ทผ์‚ฌ์น˜์ด๋ฏ€๋กœ ๋ถˆํ™•์‹ค์„ฑ์ด ์กด์žฌํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์ด์šฉํ•œ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„(Progressive Visual Analytics with Safeguards)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฐœ๋…์„ ์ œ์•ˆํ•œ๋‹ค. ์ด ๊ฐœ๋…์€ ์‚ฌ์šฉ์ž๊ฐ€ ์ ์ง„์  ํƒ์ƒ‰์—์„œ ๋งˆ์ฃผํ•˜๋Š” ๋ถˆํ™•์‹คํ•œ ์ค‘๊ฐ„ ์ง€์‹์— ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ๋‚จ๊ธธ ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ํƒ์ƒ‰์—์„œ ์–ป์€ ์ง€์‹์˜ ์ •ํ™•๋„๋ฅผ ์ถ”ํ›„ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ๋˜ํ•œ, ์ด๋Ÿฌํ•œ ๊ฐœ๋…์„ ์‹ค์ œ๋กœ ๊ตฌํ˜„ํ•˜์—ฌ ํƒ‘์žฌํ•œ ProReveal ์‹œ์Šคํ…œ์„ ์†Œ๊ฐœํ•œ๋‹ค. ProReveal๋ฅผ ์ด์šฉํ•œ ์‚ฌ์šฉ์ž ์‹คํ—˜์—์„œ ์‚ฌ์šฉ์ž๋“ค์€ ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์—ˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ค‘๊ฐ„ ์ง€์‹์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์ž๋ฐœ์ ์œผ๋กœ ์ด์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์œ„ ์„ธ ๊ฐ€์ง€ ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜์—ฌ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„ ์‹œ์Šคํ…œ์„ ๊ตฌํ˜„ํ•  ๋•Œ์˜ ๋””์ž์ธ์  ๋‚œ์ œ์™€ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ๋ชจ์ƒ‰ํ•œ๋‹ค.CHAPTER1. Introduction 2 1.1 Background and Motivation 2 1.2 Thesis Statement and Research Questions 5 1.3 Thesis Contributions 5 1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6 1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7 1.3.3 Progressive Visual Analytics with Safeguards 8 1.4 Structure of Dissertation 9 CHAPTER2. Related Work 11 2.1 Progressive Visual Analytics 11 2.1.1 Definitions 11 2.1.2 System Latency and Human Factors 13 2.1.3 Users, Tasks, and Models 15 2.1.4 Techniques, Algorithms, and Systems. 17 2.1.5 Uncertainty Visualization 19 2.2 Approaches for Scalable Visualization Systems 20 2.3 The k-Nearest Neighbor (KNN) Problem 22 2.4 t-Distributed Stochastic Neighbor Embedding 26 CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28 3.1 The SwiTuna Design 31 3.1.1 Design Considerations 32 3.1.2 System Overview 33 3.1.3 Scalable Visualization Components 36 3.1.4 Visualization Cards 40 3.1.5 User Interface and Interaction 42 3.2 Responsive Querying 44 3.2.1 Querying Pipeline 44 3.2.2 Prompt Responses 47 3.2.3 Incremental Processing 47 3.3 Evaluation: Performance Benchmark 49 3.3.1 Study Design 49 3.3.2 Results and Discussion 52 3.4 Implementation 56 3.5 Summary 56 CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58 4.1 Approximate k-Nearest Neighbor 61 4.1.1 A Sequential Algorithm 62 4.1.2 An Online Algorithm 63 4.1.3 A Progressive Algorithm 66 4.1.4 Filtered AKNN Search 71 4.2 k-Nearest Neighbor Lookup Table 72 4.3 Benchmark. 78 4.3.1 Online and Progressive k-d Trees 78 4.3.2 k-Nearest Neighbor Lookup Tables 83 4.4 Applications 85 4.4.1 Progressive Regression and Density Estimation 85 4.4.2 Responsive t-SNE 87 4.5 Implementation 92 4.6 Discussion 92 4.7 Summary 93 CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95 5.1 Progressive Visual Analytics with Safeguards 98 5.1.1 Definition 98 5.1.2 Examples 101 5.1.3 Design Considerations 103 5.2 ProReveal 105 5.3 Evaluation 121 5.4 Discussion 127 5.5 Summary 130 CHAPTER6. Discussion 132 6.1 Lessons Learned 132 6.2 Limitations 135 CHAPTER7. Conclusion 137 7.1 Thesis Contributions Revisited 137 7.2 Future Research Agenda 139 7.3 Final Remarks 141 Abstract (Korean) 155 Acknowledgments (Korean) 157Docto

    GeoLens: enabling interactive visual analytics over large-scale, multidimensional geospatial datasets

    Get PDF
    2015 Spring.Includes bibliographical references.With the rapid increase of scientific data volumes, interactive tools that enable effective visual representation for scientists are needed. This is critical when scientists are manipulating voluminous datasets and especially when they need to explore datasets interactively to develop their hypotheses. In this paper, we present an interactive visual analytics framework, GeoLens. GeoLens provides fast and expressive interactions with voluminous geospatial datasets. We provide an expressive visual query evaluation scheme to support advanced interactive visual analytics technique, such as brushing and linking. To achieve this, we designed and developed the geohash based image tile generation algorithm that automatically adjusts the range of data to access based on the minimum acceptable size of the image tile. In addition, we have also designed an autonomous histogram generation algorithm that generates histograms of user-defined data subsets that do not have pre-computed data properties. Using our approach, applications can generate histograms of datasets containing millions of data points with sub-second latency. The work builds on our visual query coordinating scheme that evaluates geospatial query and orchestrates data aggregation in a distributed storage environment while preserving data locality and minimizing data movements. This paper includes empirical benchmarks of our framework encompassing a billion-file dataset published by the National Climactic Data Center

    Clustering-Based Materialized View Selection in Data Warehouses

    Full text link
    Materialized view selection is a non-trivial task. Hence, its complexity must be reduced. A judicious choice of views must be cost-driven and influenced by the workload experienced by the system. In this paper, we propose a framework for materialized view selection that exploits a data mining technique (clustering), in order to determine clusters of similar queries. We also propose a view merging algorithm that builds a set of candidate views, as well as a greedy process for selecting a set of views to materialize. This selection is based on cost models that evaluate the cost of accessing data using views and the cost of storing these views. To validate our strategy, we executed a workload of decision-support queries on a test data warehouse, with and without using our strategy. Our experimental results demonstrate its efficiency, even when storage space is limited

    CubiST++: Evaluating Ad-Hoc CUBE Queries Using Statistics Trees

    Get PDF
    We report on a new, efficient encoding for the data cube, which results in a drastic speed-up of OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes. We are focusing on a class of queries called cube queries, which return aggregated values rather than sets of tuples. Our approach, termed CubiST++ (Cubing with Statistics Trees Plus Families), represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the view lattice to compute and materialize new views from existing views in some heuristic fashion. Instead, CubiST++ encodes all possible aggregate views in the leaves of a new data structure called statistics tree (ST) during a one-time scan of the detailed data. In order to optimize the queries involving constraints on hierarchy levels of the underlying dimensions, we select and materialize a family of candidate trees, which represent superviews over the different hierarchical levels of the dimensions. Given a query, our query evaluation algorithm selects the smallest tree in the family, which can provide the answer. Extensive evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems

    Using Fuzzy Linguistic Representations to Provide Explanatory Semantics for Data Warehouses

    Get PDF
    A data warehouse integrates large amounts of extracted and summarized data from multiple sources for direct querying and analysis. While it provides decision makers with easy access to such historical and aggregate data, the real meaning of the data has been ignored. For example, "whether a total sales amount 1,000 items indicates a good or bad sales performance" is still unclear. From the decision makers' point of view, the semantics rather than raw numbers which convey the meaning of the data is very important. In this paper, we explore the use of fuzzy technology to provide this semantics for the summarizations and aggregates developed in data warehousing systems. A three layered data warehouse semantic model, consisting of quantitative (numerical) summarization, qualitative (categorical) summarization, and quantifier summarization, is proposed for capturing and explicating the semantics of warehoused data. Based on the model, several algebraic operators are defined. We also extend the SQL language to allow for flexible queries against such enhanced data warehouses

    Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

    Full text link
    Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes
    • โ€ฆ
    corecore