52 research outputs found

    A design methodology for data warehouses

    Get PDF
    The objective of this work is to develop a design methodology for data warehouses. It is based on the three level modeling approach with emphasis on conceptual modeling. Logical design to the relational model and physical tuning in this environment will also be treated

    Integrating the UB-Tree into a Database System Kernel

    Get PDF
    Multidimensional access methods have shown high potential for significant performance improvements in various application domains

    IDEAS-1997-2021-Final-Programs

    Get PDF
    This document records the final program for each of the 26 meetings of the International Database and Engineering Application Symposium from 1997 through 2021. These meetings were organized in various locations on three continents. Most of the papers published during these years are in the digital libraries of IEEE(1997-2007) or ACM(2008-2021)

    Query Optimization and Execution for Multi-Dimensional OLAP

    Get PDF
    Online Analytical Processing (OLAP) is a database paradigm that supports the rich analysis of multi-dimensional data. While current OLAP tools are primarily constructed as extensions to conventional relational databases, the unique modeling and processing requirements of OLAP systems often make for a relatively awkward fit with RDBM systems in general, and their embedded string-based query languages in particular. In this thesis, we discuss the design, implementation, and evaluation of a robust multi-dimensional OLAP server. In fact, we focus on several distinct but related themes. To begin, we investigate the integration of an open source embedded storage engine with our own OLAP-specific indexing and access methods. We then present a comprehensive OLAP query algebra that ultimately allows developers to create expressive OLAP queries in native client languages such as Java. By utilizing a formal algebraic model, we are able to support an intuitive Object Oriented query API, as well as a powerful query optimization and execution engine. The thesis describes both the optimization methodology and the related algorithms for the efficient execution of the associated query plans. The end result of our research is a comprehensive OLAP DBMS prototype that clearly demonstrates new opportunities for improving the accessibility, functionality, and performance of current OLAP database management systems

    Query Workload-Aware Index Structures for Range Searches in 1D, 2D, and High-Dimensional Spaces

    Get PDF
    abstract: Most current database management systems are optimized for single query execution. Yet, often, queries come as part of a query workload. Therefore, there is a need for index structures that can take into consideration existence of multiple queries in a query workload and efficiently produce accurate results for the entire query workload. These index structures should be scalable to handle large amounts of data as well as large query workloads. The main objective of this dissertation is to create and design scalable index structures that are optimized for range query workloads. Range queries are an important type of queries with wide-ranging applications. There are no existing index structures that are optimized for efficient execution of range query workloads. There are also unique challenges that need to be addressed for range queries in 1D, 2D, and high-dimensional spaces. In this work, I introduce novel cost models, index selection algorithms, and storage mechanisms that can tackle these challenges and efficiently process a given range query workload in 1D, 2D, and high-dimensional spaces. In particular, I introduce the index structures, HCS (for 1D spaces), cSHB (for 2D spaces), and PSLSH (for high-dimensional spaces) that are designed specifically to efficiently handle range query workload and the unique challenges arising from their respective spaces. I experimentally show the effectiveness of the above proposed index structures by comparing with state-of-the-art techniques.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Efficient Processing of Range Queries in Main Memory

    Get PDF
    Datenbanksysteme verwenden Indexstrukturen, um Suchanfragen zu beschleunigen. Im Laufe der letzten Jahre haben Forscher verschiedene Ansรคtze zur Indexierung von Datenbanktabellen im Hauptspeicher entworfen. Hauptspeicherindexstrukturen versuchen mรถglichst hรคufig Daten zu verwenden, die bereits im Zwischenspeicher der CPU vorrรคtig sind, anstatt, wie bei traditionellen Datenbanksystemen, die Zugriffe auf den externen Speicher zu optimieren. Die meisten vorgeschlagenen Indexstrukturen fรผr den Hauptspeicher beschrรคnken sich jedoch auf Punktabfragen und vernachlรคssigen die ebenso wichtigen Bereichsabfragen, die in zahlreichen Anwendungen, wie in der Analyse von Genomdaten, Sensornetzwerken, oder analytischen Datenbanksystemen, zum Einsatz kommen. Diese Dissertation verfolgt als Hauptziel die Fรคhigkeiten von modernen Hauptspeicherdatenbanksystemen im Ausfรผhren von Bereichsabfragen zu verbessern. Dazu schlagen wir zunรคchst die Cache-Sensitive Skip List, eine neue aktualisierbare Hauptspeicherindexstruktur, vor, die fรผr die Zwischenspeicher moderner Prozessoren optimiert ist und das Ausfรผhren von Bereichsabfragen auf einzelnen Datenbankspalten ermรถglicht. Im zweiten Abschnitt analysieren wir die Performanz von multidimensionalen Bereichsabfragen auf modernen Serverarchitekturen, bei denen Daten im Hauptspeicher hinterlegt sind und Prozessoren รผber SIMD-Instruktionen und Multithreading verfรผgen. Um die Relevanz unserer Experimente fรผr praktische Anwendungen zu erhรถhen, schlagen wir zudem einen realistischen Benchmark fรผr multidimensionale Bereichsabfragen vor, der auf echten Genomdaten ausgefรผhrt wird. Im letzten Abschnitt der Dissertation prรคsentieren wir den BB-Tree als neue, hochperformante und speichereffziente Hauptspeicherindexstruktur. Der BB-Tree ermรถglicht das Ausfรผhren von multidimensionalen Bereichs- und Punktabfragen und verfรผgt รผber einen parallelen Suchoperator, der mehrere Threads verwenden kann, um die Performanz von Suchanfragen zu erhรถhen.Database systems employ index structures as means to accelerate search queries. Over the last years, the research community has proposed many different in-memory approaches that optimize cache misses instead of disk I/O, as opposed to disk-based systems, and make use of the grown parallel capabilities of modern CPUs. However, these techniques mainly focus on single-key lookups, but neglect equally important range queries. Range queries are an ubiquitous operator in data management commonly used in numerous domains, such as genomic analysis, sensor networks, or online analytical processing. The main goal of this dissertation is thus to improve the capabilities of main-memory database systems with regard to executing range queries. To this end, we first propose a cache-optimized, updateable main-memory index structure, the cache-sensitive skip list, which targets the execution of range queries on single database columns. Second, we study the performance of multidimensional range queries on modern hardware, where data are stored in main memory and processors support SIMD instructions and multi-threading. We re-evaluate a previous rule of thumb suggesting that, on disk-based systems, scans outperform index structures for selectivities of approximately 15-20% or more. To increase the practical relevance of our analysis, we also contribute a novel benchmark consisting of several realistic multidimensional range queries applied to real- world genomic data. Third, based on the outcomes of our experimental analysis, we devise a novel, fast and space-effcient, main-memory based index structure, the BB- Tree, which supports multidimensional range and point queries and provides a parallel search operator that leverages the multi-threading capabilities of modern CPUs

    A Survey on Spatial Indexing

    Get PDF
    Spatial information processing has been a centre of attention of research in the previous decade. In spatial databases, data related with spatial coordinates and extents are retrieved based on spatial proximity. A large number of spatial indexes have been proposed to make ease of efficient indexing of spatial objects in large databases and spatial data retrieval. The goal of this paper is to review the advance techniques of the access methods. This paper tries to classify the existing multidimensional access methods, according to the types of indexing, and their performance over spatial queries. K-d trees out performs quad tress without requiring additional memory usage

    Query-driven learning for automating exploratory analytics in large-scale data management systems

    Get PDF
    As organizations collect petabytes of data, analysts spend most of their time trying to extract insights. Although data analytic systems have become extremely efficient and sophisticated, the data exploration phase is still a laborious task with high productivity, monetary and mental costs. This dissertation presents the Query-Driven learning methodology in which multiple systems/frameworks are introduced to address the need of more efficient methods to analyze large data sets. Countless queries are executed daily, in large deployments, and are often left unexploited but we believe they are of immense value. This work describes how Machine Learning can be used to expedite the data exploration process by (a) estimating the results of aggregate queries (b) explaining data spaces through interpretable Machine Learning models (c) identifying data space regions that could be of interest to the data analyst. Compared to related work in all the associated domains, the proposed solutions do not utilize any of the underlying data. Because of that, they are extremely efficient, decoupled from underlying infrastructure and can easily be adapted. This dissertation is a first account of how the Query-Driven methodology can be effectively used to expedite the data exploration process focusing solely on extracting knowledge from queries and not from data

    ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ํƒ์ƒ‰์„ ์œ„ํ•œ ์ ์ง„์  ์‹œ๊ฐํ™” ์‹œ์Šคํ…œ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์„œ์ง„์šฑ.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)โ€”a visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these resultsโ€”to address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.ํ˜„๋Œ€ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค์—์„œ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ํ•„์ˆ˜์ ์ธ ๋ถ„์„ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์ตœ๊ทผ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ์ง€์—ฐ ์‹œ๊ฐ„์ด ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ๊ฐ์  ๋ถ„์„์— ํฐ ๊ฑธ๋ฆผ๋Œ์ด ๋˜์—ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ™•์žฅ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„(Progressive Visual Analytics)์„ ์ง€์›ํ•˜๋Š” ์ผ๋ จ์˜ ์‹œ์Šคํ…œ์„ ๋””์ž์ธํ•˜๊ณ  ๊ฐœ๋ฐœํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„ ์‹œ์Šคํ…œ์€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๊ฐ€ ์™„์ „ํžˆ ๋๋‚˜์ง€ ์•Š๋”๋ผ๋„ ์ค‘๊ฐ„ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•จ์œผ๋กœ์จ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฒซ์งธ๋กœ, ์ˆ˜์‹ญ์–ต ๊ฑด์˜ ํ–‰์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋Š” SwiftTuna ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ๋‹ค. ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐ ์‹œ๊ฐ์  ํ‘œํ˜„์˜ ํ™•์žฅ์„ฑ์„ ๋ชฉํ‘œ๋กœ ๊ฐœ๋ฐœ๋œ ์ด ์‹œ์Šคํ…œ์€, ์•ฝ 40์–ต ๊ฑด์˜ ํ–‰์„ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์‹œ๊ฐํ™”๋ฅผ ์ „์ฒ˜๋ฆฌ ์—†์ด ์ˆ˜ ์ดˆ๋งˆ๋‹ค ์—…๋ฐ์ดํŠธํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋‘˜์งธ๋กœ, ๊ทผ์‚ฌ์  k-์ตœ๊ทผ์ ‘์ (Approximate k-Nearest Neighbor) ๋ฌธ์ œ๋ฅผ ์ ์ง„์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” PANENE ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•œ๋‹ค. ๊ทผ์‚ฌ์  k-์ตœ๊ทผ์ ‘์  ๋ฌธ์ œ๋Š” ์—ฌ๋Ÿฌ ๊ธฐ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์—์„œ ์“ฐ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ดˆ๊ธฐ ๊ณ„์‚ฐ ์‹œ๊ฐ„์ด ๊ธธ์–ด์„œ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒํ•œ ์‹œ์Šคํ…œ์— ์ ์šฉํ•˜๊ธฐ ํž˜๋“  ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค. PANENE ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ด๋Ÿฌํ•œ ๊ธด ์ดˆ๊ธฐ ๊ณ„์‚ฐ ์‹œ๊ฐ„์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๊ธฐ๋ฒ•์„ ์‹œ๊ฐ์  ๋ถ„์„์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ํŠนํžˆ, ์œ ์šฉํ•œ ๋น„์„ ํ˜•์  ์ฐจ์› ๊ฐ์†Œ ๊ธฐ๋ฒ•์ธ t-๋ถ„ํฌ ํ™•๋ฅ ์  ์ž„๋ฒ ๋”ฉ(t-Distributed Stochastic Neighbor Embedding)์„ ๊ฐ€์†ํ•˜์—ฌ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅธ ์‹œ๊ฐ„ ๋‚ด์— ์‚ฌ์˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๋‘ ์‹œ์Šคํ…œ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋ฐ์ดํ„ฐ์˜ ํ–‰ ๋˜๋Š” ์—ด์˜ ๊ฐœ์ˆ˜๋กœ ์ธํ•œ ํ™•์žฅ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ๋‹ค๋ฉด, ์„ธ ๋ฒˆ์งธ ์‹œ์Šคํ…œ์—์„œ๋Š” ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„์˜ ์‹ ๋ขฐ๋„ ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„์—์„œ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ฃผ์–ด์ง€๋Š” ์ค‘๊ฐ„ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ์˜ ๊ทผ์‚ฌ์น˜์ด๋ฏ€๋กœ ๋ถˆํ™•์‹ค์„ฑ์ด ์กด์žฌํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์ด์šฉํ•œ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„(Progressive Visual Analytics with Safeguards)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฐœ๋…์„ ์ œ์•ˆํ•œ๋‹ค. ์ด ๊ฐœ๋…์€ ์‚ฌ์šฉ์ž๊ฐ€ ์ ์ง„์  ํƒ์ƒ‰์—์„œ ๋งˆ์ฃผํ•˜๋Š” ๋ถˆํ™•์‹คํ•œ ์ค‘๊ฐ„ ์ง€์‹์— ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ๋‚จ๊ธธ ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ํƒ์ƒ‰์—์„œ ์–ป์€ ์ง€์‹์˜ ์ •ํ™•๋„๋ฅผ ์ถ”ํ›„ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ๋˜ํ•œ, ์ด๋Ÿฌํ•œ ๊ฐœ๋…์„ ์‹ค์ œ๋กœ ๊ตฌํ˜„ํ•˜์—ฌ ํƒ‘์žฌํ•œ ProReveal ์‹œ์Šคํ…œ์„ ์†Œ๊ฐœํ•œ๋‹ค. ProReveal๋ฅผ ์ด์šฉํ•œ ์‚ฌ์šฉ์ž ์‹คํ—˜์—์„œ ์‚ฌ์šฉ์ž๋“ค์€ ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์—ˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ค‘๊ฐ„ ์ง€์‹์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ์„ธ์ดํ”„๊ฐ€๋“œ๋ฅผ ์ž๋ฐœ์ ์œผ๋กœ ์ด์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์œ„ ์„ธ ๊ฐ€์ง€ ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜์—ฌ ์ ์ง„์  ์‹œ๊ฐ์  ๋ถ„์„ ์‹œ์Šคํ…œ์„ ๊ตฌํ˜„ํ•  ๋•Œ์˜ ๋””์ž์ธ์  ๋‚œ์ œ์™€ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ๋ชจ์ƒ‰ํ•œ๋‹ค.CHAPTER1. Introduction 2 1.1 Background and Motivation 2 1.2 Thesis Statement and Research Questions 5 1.3 Thesis Contributions 5 1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6 1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7 1.3.3 Progressive Visual Analytics with Safeguards 8 1.4 Structure of Dissertation 9 CHAPTER2. Related Work 11 2.1 Progressive Visual Analytics 11 2.1.1 Definitions 11 2.1.2 System Latency and Human Factors 13 2.1.3 Users, Tasks, and Models 15 2.1.4 Techniques, Algorithms, and Systems. 17 2.1.5 Uncertainty Visualization 19 2.2 Approaches for Scalable Visualization Systems 20 2.3 The k-Nearest Neighbor (KNN) Problem 22 2.4 t-Distributed Stochastic Neighbor Embedding 26 CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28 3.1 The SwiTuna Design 31 3.1.1 Design Considerations 32 3.1.2 System Overview 33 3.1.3 Scalable Visualization Components 36 3.1.4 Visualization Cards 40 3.1.5 User Interface and Interaction 42 3.2 Responsive Querying 44 3.2.1 Querying Pipeline 44 3.2.2 Prompt Responses 47 3.2.3 Incremental Processing 47 3.3 Evaluation: Performance Benchmark 49 3.3.1 Study Design 49 3.3.2 Results and Discussion 52 3.4 Implementation 56 3.5 Summary 56 CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58 4.1 Approximate k-Nearest Neighbor 61 4.1.1 A Sequential Algorithm 62 4.1.2 An Online Algorithm 63 4.1.3 A Progressive Algorithm 66 4.1.4 Filtered AKNN Search 71 4.2 k-Nearest Neighbor Lookup Table 72 4.3 Benchmark. 78 4.3.1 Online and Progressive k-d Trees 78 4.3.2 k-Nearest Neighbor Lookup Tables 83 4.4 Applications 85 4.4.1 Progressive Regression and Density Estimation 85 4.4.2 Responsive t-SNE 87 4.5 Implementation 92 4.6 Discussion 92 4.7 Summary 93 CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95 5.1 Progressive Visual Analytics with Safeguards 98 5.1.1 Definition 98 5.1.2 Examples 101 5.1.3 Design Considerations 103 5.2 ProReveal 105 5.3 Evaluation 121 5.4 Discussion 127 5.5 Summary 130 CHAPTER6. Discussion 132 6.1 Lessons Learned 132 6.2 Limitations 135 CHAPTER7. Conclusion 137 7.1 Thesis Contributions Revisited 137 7.2 Future Research Agenda 139 7.3 Final Remarks 141 Abstract (Korean) 155 Acknowledgments (Korean) 157Docto

    Partial Replica Location And Selection For Spatial Datasets

    Get PDF
    As the size of scientific datasets continues to grow, we will not be able to store enormous datasets on a single grid node, but must distribute them across many grid nodes. The implementation of partial or incomplete replicas, which represent only a subset of a larger dataset, has been an active topic of research. Partial Spatial Replicas extend this functionality to spatial data, allowing us to distribute a spatial dataset in pieces over several locations. We investigate solutions to the partial spatial replica selection problems. First, we describe and develop two designs for an Spatial Replica Location Service (SRLS), which must return the set of replicas that intersect with a query region. Integrating a relational database, a spatial data structure and grid computing software, we build a scalable solution that works well even for several million replicas. In our SRLS, we have improved performance by designing a R-tree structure in the backend database, and by aggregating several queries into one larger query, which reduces overhead. We also use the Morton Space-filling Curve during R-tree construction, which improves spatial locality. In addition, we describe R-tree Prefetching(RTP), which effectively utilizes the modern multi-processor architecture. Second, we present and implement a fast replica selection algorithm in which a set of partial replicas is chosen from a set of candidates so that retrieval performance is maximized. Using an R-tree based heuristic algorithm, we achieve O(n log n) complexity for this NP-complete problem. We describe a model for disk access performance that takes filesystem prefetching into account and is sufficiently accurate for spatial replica selection. Making a few simplifying assumptions, we present a fast replica selection algorithm for partial spatial replicas. The algorithm uses a greedy approach that attempts to maximize performance by choosing a collection of replica subsets that allow fast data retrieval by a client machine. Experiments show that the performance of the solution found by our algorithm is on average always at least 91% and 93.4% of the performance of the optimal solution in 4-node and 8-node tests respectively
    • โ€ฆ
    corecore