1,120 research outputs found

    Bridging the gap between algorithmic and learned index structures

    Get PDF
    Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion. In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine. Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream. Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces

    Feature Study on a Programmable Network Traffic Classifier

    Get PDF

    λŒ€μš©λŸ‰ 데이터 탐색을 μœ„ν•œ 점진적 μ‹œκ°ν™” μ‹œμŠ€ν…œ 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀,2020. 2. μ„œμ§„μš±.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)β€”a visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these resultsβ€”to address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.ν˜„λŒ€ 데이터 μ‚¬μ΄μ–ΈμŠ€μ—μ„œ μΈν„°λž™ν‹°λΈŒν•œ μ‹œκ°ν™”λ₯Ό 톡해 데이터λ₯Ό μ΄ν•΄ν•˜λŠ” 것은 ν•„μˆ˜μ μΈ 뢄석 방법 쀑 ν•˜λ‚˜μ΄λ‹€. κ·ΈλŸ¬λ‚˜, 졜근 λ°μ΄ν„°μ˜ 크기가 폭발적으둜 μ¦κ°€ν•˜λ©΄μ„œ 데이터 크기둜 인해 λ°œμƒν•˜λŠ” 지연 μ‹œκ°„μ΄ μΈν„°λž™ν‹°λΈŒν•œ μ‹œκ°μ  뢄석에 큰 걸림돌이 λ˜μ—ˆλ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” μ΄λŸ¬ν•œ ν™•μž₯μ„± 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 점진적 μ‹œκ°μ  뢄석(Progressive Visual Analytics)을 μ§€μ›ν•˜λŠ” 일련의 μ‹œμŠ€ν…œμ„ λ””μžμΈν•˜κ³  κ°œλ°œν•œλ‹€. μ΄λŸ¬ν•œ 점진적 μ‹œκ°μ  뢄석 μ‹œμŠ€ν…œμ€ 데이터 μ²˜λ¦¬κ°€ μ™„μ „νžˆ λλ‚˜μ§€ μ•Šλ”λΌλ„ 쀑간 뢄석 κ²°κ³Όλ₯Ό μ‚¬μš©μžμ—κ²Œ μ œκ³΅ν•¨μœΌλ‘œμ¨ λ°μ΄ν„°μ˜ 크기둜 인해 λ°œμƒν•˜λŠ” 지연 μ‹œκ°„ 문제λ₯Ό μ™„ν™”ν•  수 μžˆλ‹€. 첫째둜, μˆ˜μ‹­μ–΅ 건의 행을 κ°€μ§€λŠ” 데이터λ₯Ό μ‹œκ°μ μœΌλ‘œ 탐색할 수 μžˆλŠ” SwiftTuna μ‹œμŠ€ν…œμ„ μ œμ•ˆν•œλ‹€. 데이터 처리 및 μ‹œκ°μ  ν‘œν˜„μ˜ ν™•μž₯성을 λͺ©ν‘œλ‘œ 개발된 이 μ‹œμŠ€ν…œμ€, μ•½ 40μ–΅ 건의 행을 가진 데이터에 λŒ€ν•œ μ‹œκ°ν™”λ₯Ό μ „μ²˜λ¦¬ 없이 수 μ΄ˆλ§ˆλ‹€ μ—…λ°μ΄νŠΈν•  수 μžˆλŠ” κ²ƒμœΌλ‘œ λ‚˜νƒ€λ‚¬λ‹€. λ‘˜μ§Έλ‘œ, 근사적 k-μ΅œκ·Όμ ‘μ (Approximate k-Nearest Neighbor) 문제λ₯Ό μ μ§„μ μœΌλ‘œ κ³„μ‚°ν•˜λŠ” PANENE μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•œλ‹€. 근사적 k-μ΅œκ·Όμ ‘μ  λ¬Έμ œλŠ” μ—¬λŸ¬ 기계 ν•™μŠ΅ κΈ°λ²•μ—μ„œ μ“°μž„μ—λ„ λΆˆκ΅¬ν•˜κ³  초기 계산 μ‹œκ°„μ΄ κΈΈμ–΄μ„œ μΈν„°λž™ν‹°λΈŒν•œ μ‹œμŠ€ν…œμ— μ μš©ν•˜κΈ° νž˜λ“  ν•œκ³„κ°€ μžˆμ—ˆλ‹€. PANENE μ•Œκ³ λ¦¬μ¦˜μ€ μ΄λŸ¬ν•œ κΈ΄ 초기 계산 μ‹œκ°„μ„ 획기적으둜 κ°œμ„ ν•˜μ—¬ λ‹€μ–‘ν•œ 기계 ν•™μŠ΅ 기법을 μ‹œκ°μ  뢄석에 ν™œμš©ν•  수 μžˆλ„λ‘ ν•œλ‹€. 특히, μœ μš©ν•œ λΉ„μ„ ν˜•μ  차원 κ°μ†Œ 기법인 t-뢄포 ν™•λ₯ μ  μž„λ² λ”©(t-Distributed Stochastic Neighbor Embedding)을 κ°€μ†ν•˜μ—¬ 수백 개의 차원을 κ°€μ§€λŠ” 데이터λ₯Ό λΉ λ₯Έ μ‹œκ°„ 내에 μ‚¬μ˜ν•  수 μžˆλ‹€. μœ„μ˜ 두 μ‹œμŠ€ν…œκ³Ό μ•Œκ³ λ¦¬μ¦˜μ΄ λ°μ΄ν„°μ˜ ν–‰ λ˜λŠ” μ—΄μ˜ 개수둜 μΈν•œ ν™•μž₯μ„± 문제λ₯Ό ν•΄κ²°ν•˜κ³ μž ν–ˆλ‹€λ©΄, μ„Έ 번째 μ‹œμŠ€ν…œμ—μ„œλŠ” 점진적 μ‹œκ°μ  λΆ„μ„μ˜ 신뒰도 문제λ₯Ό κ°œμ„ ν•˜κ³ μž ν•œλ‹€. 점진적 μ‹œκ°μ  λΆ„μ„μ—μ„œ μ‚¬μš©μžμ—κ²Œ μ£Όμ–΄μ§€λŠ” 쀑간 계산 κ²°κ³ΌλŠ” μ΅œμ’… 결과의 κ·Όμ‚¬μΉ˜μ΄λ―€λ‘œ λΆˆν™•μ‹€μ„±μ΄ μ‘΄μž¬ν•œλ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” μ„Έμ΄ν”„κ°€λ“œλ₯Ό μ΄μš©ν•œ 점진적 μ‹œκ°μ  뢄석(Progressive Visual Analytics with Safeguards)μ΄λΌλŠ” μƒˆλ‘œμš΄ κ°œλ…μ„ μ œμ•ˆν•œλ‹€. 이 κ°œλ…μ€ μ‚¬μš©μžκ°€ 점진적 νƒμƒ‰μ—μ„œ λ§ˆμ£Όν•˜λŠ” λΆˆν™•μ‹€ν•œ 쀑간 지식에 μ„Έμ΄ν”„κ°€λ“œλ₯Ό 남길 수 μžˆλ„λ‘ ν•˜μ—¬ νƒμƒ‰μ—μ„œ 얻은 μ§€μ‹μ˜ 정확도λ₯Ό μΆ”ν›„ 검증할 수 μžˆλ„λ‘ ν•œλ‹€. λ˜ν•œ, μ΄λŸ¬ν•œ κ°œλ…μ„ μ‹€μ œλ‘œ κ΅¬ν˜„ν•˜μ—¬ νƒ‘μž¬ν•œ ProReveal μ‹œμŠ€ν…œμ„ μ†Œκ°œν•œλ‹€. ProRevealλ₯Ό μ΄μš©ν•œ μ‚¬μš©μž μ‹€ν—˜μ—μ„œ μ‚¬μš©μžλ“€μ€ μ„Έμ΄ν”„κ°€λ“œλ₯Ό μ„±κ³΅μ μœΌλ‘œ λ§Œλ“€ 수 μžˆμ—ˆμ„ 뿐만 μ•„λ‹ˆλΌ, 쀑간 μ§€μ‹μ˜ λΆˆν™•μ‹€μ„±μ„ 닀루기 μœ„ν•΄ μ„Έμ΄ν”„κ°€λ“œλ₯Ό 자발적으둜 μ΄μš©ν•œλ‹€λŠ” 것을 μ•Œ 수 μžˆμ—ˆλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, μœ„ μ„Έ 가지 μ—°κ΅¬μ˜ κ²°κ³Όλ₯Ό μ’…ν•©ν•˜μ—¬ 점진적 μ‹œκ°μ  뢄석 μ‹œμŠ€ν…œμ„ κ΅¬ν˜„ν•  λ•Œμ˜ λ””μžμΈμ  λ‚œμ œμ™€ ν–₯ν›„ 연ꡬ λ°©ν–₯을 λͺ¨μƒ‰ν•œλ‹€.CHAPTER1. Introduction 2 1.1 Background and Motivation 2 1.2 Thesis Statement and Research Questions 5 1.3 Thesis Contributions 5 1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6 1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7 1.3.3 Progressive Visual Analytics with Safeguards 8 1.4 Structure of Dissertation 9 CHAPTER2. Related Work 11 2.1 Progressive Visual Analytics 11 2.1.1 Definitions 11 2.1.2 System Latency and Human Factors 13 2.1.3 Users, Tasks, and Models 15 2.1.4 Techniques, Algorithms, and Systems. 17 2.1.5 Uncertainty Visualization 19 2.2 Approaches for Scalable Visualization Systems 20 2.3 The k-Nearest Neighbor (KNN) Problem 22 2.4 t-Distributed Stochastic Neighbor Embedding 26 CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28 3.1 The SwiTuna Design 31 3.1.1 Design Considerations 32 3.1.2 System Overview 33 3.1.3 Scalable Visualization Components 36 3.1.4 Visualization Cards 40 3.1.5 User Interface and Interaction 42 3.2 Responsive Querying 44 3.2.1 Querying Pipeline 44 3.2.2 Prompt Responses 47 3.2.3 Incremental Processing 47 3.3 Evaluation: Performance Benchmark 49 3.3.1 Study Design 49 3.3.2 Results and Discussion 52 3.4 Implementation 56 3.5 Summary 56 CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58 4.1 Approximate k-Nearest Neighbor 61 4.1.1 A Sequential Algorithm 62 4.1.2 An Online Algorithm 63 4.1.3 A Progressive Algorithm 66 4.1.4 Filtered AKNN Search 71 4.2 k-Nearest Neighbor Lookup Table 72 4.3 Benchmark. 78 4.3.1 Online and Progressive k-d Trees 78 4.3.2 k-Nearest Neighbor Lookup Tables 83 4.4 Applications 85 4.4.1 Progressive Regression and Density Estimation 85 4.4.2 Responsive t-SNE 87 4.5 Implementation 92 4.6 Discussion 92 4.7 Summary 93 CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95 5.1 Progressive Visual Analytics with Safeguards 98 5.1.1 Definition 98 5.1.2 Examples 101 5.1.3 Design Considerations 103 5.2 ProReveal 105 5.3 Evaluation 121 5.4 Discussion 127 5.5 Summary 130 CHAPTER6. Discussion 132 6.1 Lessons Learned 132 6.2 Limitations 135 CHAPTER7. Conclusion 137 7.1 Thesis Contributions Revisited 137 7.2 Future Research Agenda 139 7.3 Final Remarks 141 Abstract (Korean) 155 Acknowledgments (Korean) 157Docto
    • …
    corecore