Search CORE

338 research outputs found

Approximate Data Analytics Systems

Author: Le Quoc Do
Publication venue
Publication date: 22/01/2018
Field of study

Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Moore’s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: • StreamApprox—a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. • IncApprox—a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. • PrivApprox—a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves user’s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. • ApproxJoin—an approximate distributed joins system. This system improves the performance of joins — critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications

Technische Universität Dresden: Qucosa

Efficient Implementation of Extended Learned Bloom Filter

Author: 양수현
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2021.8. 김형주.기존의 자료구조는 데이터의 분포와 무관하게 일정한 성능을 가지는 것을 목표로 하고 있다. 하지만 데이터의 분포를 사용한다면 성능을 개선할 수 있다는 연구가 학습 인덱스라는 이름으로 진행되고 있다. 본 연구에서는 학습 인덱스의 종류 중 하나인 학습 블룸 필터를 확장하고 구현하는데 초점을 둔다. 이를 확장 학습 블룸 필터라고 부르며, 이는 기존의 학습 블룸 필터의 구조에 학습 해시 함수를 추가한 자료구조이다. 해당 자료구조는 하이퍼파라미터 ɑ를 통해서 학습 해시 함수와 보조 필터의 비율을 조정할 수 있으며, 기존의 학습 블룸 필터에 비해서 거짓 양성 비율을 개선시킬 수 있음을 실험을 통해 보인다. 추가적으로 확장 학습 블룸 필터를 구현하는 도중에 발생했던 모델의 정밀도 문제를 소개하고, 이를 64비트 부동소수점으로 해결할 수 있음을 보인다. 그 외에도 모델 조정을 통해서 확장 학습 블룸 필터의 성능이 개선될 수 있음을 보이고, 학습 해시 함수가 성능 개선에 기여하는 방법을 이해하고자 한다.Existing data structures aim to have constant performance regardless of data distribution. However, a study is being conducted under the name of learned index, that the performance can be improved by the use of data distribution. In this study, we focus on extending and implementing learned bloom filter, which is one of the types of learned index. This is called as an extended learned bloom filter, and it is a data structure which adds a learned hash function into the structure of learned bloom filter. Experiments show that false positive rate can be improved through changing the ratio of learned hash function and the auxiliary filter using the hyperparameter ɑ. Additionally, we introduce the model precision problem that occurred during the implementation of the extended learned bloom filter, which can be solved by using 64-bit floating point. In addition, we show that the performance of the extended learned bloom filter can be improved by tuning the model, and we try to understand how the learned hash function contributes to performance improvement.제 1 장 서 론 1 제 2 장 배경 지식 3 2.1 블룸 필터의 개념 3 2.2 블룸 필터의 어플리케이션 7 2.3 학습 블룸 필터 12 2.4 학습 블룸 필터의 어플리케이션 19 제 3 장 제안 모델 23 3.1 학습 블룸 필터 관련 연구 23 3.2 확장 학습 블룸 필터 25 제 4 장 구 현 30 4.1 하이퍼파라미터 탐색 30 4.2 모델 정밀도 31 4.3 모델 조정 32 4.4 학습 해시 함수의 이해 33 제 5 장 실 험 36 5.1 실험 환경 36 5.2 하이퍼파라미터 탐색 실험 37 5.3 모델 정밀도 실험 39 5.4 모델 조정 실험 41 5.5 학습 해시 함수의 이해 실험 44 제 6 장 결론 및 향후 연구 46 참고 문헌 47 Appendix 50 Abstract 55석

SNU Open Repository and Archive

Bridging the gap between algorithmic and learned index structures

Author: Hadian Ali
Publication venue: Computing, Imperial College London
Publication date: 01/07/2022
Field of study

Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion. In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine. Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream. Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces

Spiral - Imperial College Digital Repository

Weiterentwicklung analytischer Datenbanksysteme

Author: Kipf Andreas Michael
Publication venue: Technische Universität München
Publication date
Field of study

This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trägt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventströmen besser zu unterstützen. Wir stellen eine neue Indexstruktur für Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermöglicht. Zudem präsentieren wir einen neuen Ansatz für Kardinalitätsschätzungen mittels maschinellen Lernens

Fast Succinct Retrieval and Approximate Membership Using Ribbon

Author: Dillinger Peter C.
Hübschle-Schneider Lorenz
Sanders Peter
Walzer Stefan
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH
Publication date: 28/07/2022
Field of study

A retrieval data structure for a static function f: S → {0,1}^r supports queries that return f(x) for any x ∈ S. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ), i.e., a Bloom filter alternative, with false positive rate 2^{-r}. The information-theoretic lower bound for both tasks is r|S| bits. While succinct theoretical constructions using (1+o(1))r|S| bits were known, these could not achieve very small overheads in practice because they have an unfavorable space-time tradeoff hidden in the asymptotic costs or because small overheads would only be reached for physically impossible input sizes. With bumped ribbon retrieval (BuRR), we present the first practical succinct retrieval data structure. In an extensive experimental evaluation BuRR achieves space overheads well below 1% while being faster than most previously used retrieval data structures (typically with space overheads at least an order of magnitude larger) and faster than classical Bloom filters (with space overhead ≥ 44%). This efficiency, including favorable constants, stems from a combination of simplicity, word parallelism, and high locality. We additionally describe homogeneous ribbon filter AMQs, which are even simpler and faster at the price of slightly larger space overhead

KITopen