338 research outputs found

    Approximate Data Analytics Systems

    Get PDF
    Today, most modern online services make use of big data analytics systems to extract useful information from the raw digital data. The data normally arrives as a continuous data stream at a high speed and in huge volumes. The cost of handling this massive data can be significant. Providing interactive latency in processing the data is often impractical due to the fact that the data is growing exponentially and even faster than Mooreโ€™s law predictions. To overcome this problem, approximate computing has recently emerged as a promising solution. Approximate computing is based on the observation that many modern applications are amenable to an approximate, rather than the exact output. Unlike traditional computing, approximate computing tolerates lower accuracy to achieve lower latency by computing over a partial subset instead of the entire input data. Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics and cannot provide low-latency guarantees in the context of stream processing, where new data continuously arrives as an unbounded stream. In this thesis, we design and implement approximate computing techniques for processing and interacting with high-speed and large-scale stream data to achieve low latency and efficient utilization of resources. To achieve these goals, we have designed and built the following approximate data analytics systems: โ€ข StreamApproxโ€”a data stream analytics system for approximate computing. This system supports approximate computing for low-latency stream analytics in a transparent way and has an ability to adapt to rapid fluctuations of input data streams. In this system, we designed an online adaptive stratified reservoir sampling algorithm to produce approximate output with bounded error. โ€ข IncApproxโ€”a data analytics system for incremental approximate computing. This system adopts approximate and incremental computing in stream processing to achieve high-throughput and low-latency with efficient resource utilization. In this system, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. โ€ข PrivApproxโ€”a data stream analytics system for privacy-preserving and approximate computing. This system supports high utility and low-latency data analytics and preserves userโ€™s privacy at the same time. The system is based on the combination of privacy-preserving data analytics and approximate computing. โ€ข ApproxJoinโ€”an approximate distributed joins system. This system improves the performance of joins โ€” critical but expensive operations in big data systems. In this system, we employed a sketching technique (Bloom filter) to avoid shuffling non-joinable data items through the network as well as proposed a novel sampling mechanism that executes during the join to obtain an unbiased representative sample of the join output. Our evaluation based on micro-benchmarks and real world case studies shows that these systems can achieve significant performance speedup compared to state-of-the-art systems by tolerating negligible accuracy loss of the analytics output. In addition, our systems allow users to systematically make a trade-off between accuracy and throughput/latency and require no/minor modifications to the existing applications

    Efficient Implementation of Extended Learned Bloom Filter

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. ๊น€ํ˜•์ฃผ.๊ธฐ์กด์˜ ์ž๋ฃŒ๊ตฌ์กฐ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ผ์ •ํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ํ•™์Šต ์ธ๋ฑ์Šค๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํ•™์Šต ์ธ๋ฑ์Šค์˜ ์ข…๋ฅ˜ ์ค‘ ํ•˜๋‚˜์ธ ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ๋ฅผ ํ™•์žฅํ•˜๊ณ  ๊ตฌํ˜„ํ•˜๋Š”๋ฐ ์ดˆ์ ์„ ๋‘”๋‹ค. ์ด๋ฅผ ํ™•์žฅ ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ๋ผ๊ณ  ๋ถ€๋ฅด๋ฉฐ, ์ด๋Š” ๊ธฐ์กด์˜ ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ์˜ ๊ตฌ์กฐ์— ํ•™์Šต ํ•ด์‹œ ํ•จ์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•œ ์ž๋ฃŒ๊ตฌ์กฐ์ด๋‹ค. ํ•ด๋‹น ์ž๋ฃŒ๊ตฌ์กฐ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ษ‘๋ฅผ ํ†ตํ•ด์„œ ํ•™์Šต ํ•ด์‹œ ํ•จ์ˆ˜์™€ ๋ณด์กฐ ํ•„ํ„ฐ์˜ ๋น„์œจ์„ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ธฐ์กด์˜ ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ์— ๋น„ํ•ด์„œ ๊ฑฐ์ง“ ์–‘์„ฑ ๋น„์œจ์„ ๊ฐœ์„ ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ์‹คํ—˜์„ ํ†ตํ•ด ๋ณด์ธ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ํ™•์žฅ ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ๋„์ค‘์— ๋ฐœ์ƒํ–ˆ๋˜ ๋ชจ๋ธ์˜ ์ •๋ฐ€๋„ ๋ฌธ์ œ๋ฅผ ์†Œ๊ฐœํ•˜๊ณ , ์ด๋ฅผ 64๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๊ทธ ์™ธ์—๋„ ๋ชจ๋ธ ์กฐ์ •์„ ํ†ตํ•ด์„œ ํ™•์žฅ ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ์˜ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ด๊ณ , ํ•™์Šต ํ•ด์‹œ ํ•จ์ˆ˜๊ฐ€ ์„ฑ๋Šฅ ๊ฐœ์„ ์— ๊ธฐ์—ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๊ณ ์ž ํ•œ๋‹ค.Existing data structures aim to have constant performance regardless of data distribution. However, a study is being conducted under the name of learned index, that the performance can be improved by the use of data distribution. In this study, we focus on extending and implementing learned bloom filter, which is one of the types of learned index. This is called as an extended learned bloom filter, and it is a data structure which adds a learned hash function into the structure of learned bloom filter. Experiments show that false positive rate can be improved through changing the ratio of learned hash function and the auxiliary filter using the hyperparameter ษ‘. Additionally, we introduce the model precision problem that occurred during the implementation of the extended learned bloom filter, which can be solved by using 64-bit floating point. In addition, we show that the performance of the extended learned bloom filter can be improved by tuning the model, and we try to understand how the learned hash function contributes to performance improvement.์ œ 1 ์žฅ ์„œ ๋ก  1 ์ œ 2 ์žฅ ๋ฐฐ๊ฒฝ ์ง€์‹ 3 2.1 ๋ธ”๋ฃธ ํ•„ํ„ฐ์˜ ๊ฐœ๋… 3 2.2 ๋ธ”๋ฃธ ํ•„ํ„ฐ์˜ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ 7 2.3 ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ 12 2.4 ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ์˜ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ 19 ์ œ 3 ์žฅ ์ œ์•ˆ ๋ชจ๋ธ 23 3.1 ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ ๊ด€๋ จ ์—ฐ๊ตฌ 23 3.2 ํ™•์žฅ ํ•™์Šต ๋ธ”๋ฃธ ํ•„ํ„ฐ 25 ์ œ 4 ์žฅ ๊ตฌ ํ˜„ 30 4.1 ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰ 30 4.2 ๋ชจ๋ธ ์ •๋ฐ€๋„ 31 4.3 ๋ชจ๋ธ ์กฐ์ • 32 4.4 ํ•™์Šต ํ•ด์‹œ ํ•จ์ˆ˜์˜ ์ดํ•ด 33 ์ œ 5 ์žฅ ์‹ค ํ—˜ 36 5.1 ์‹คํ—˜ ํ™˜๊ฒฝ 36 5.2 ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰ ์‹คํ—˜ 37 5.3 ๋ชจ๋ธ ์ •๋ฐ€๋„ ์‹คํ—˜ 39 5.4 ๋ชจ๋ธ ์กฐ์ • ์‹คํ—˜ 41 5.5 ํ•™์Šต ํ•ด์‹œ ํ•จ์ˆ˜์˜ ์ดํ•ด ์‹คํ—˜ 44 ์ œ 6 ์žฅ ๊ฒฐ๋ก  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ 46 ์ฐธ๊ณ  ๋ฌธํ—Œ 47 Appendix 50 Abstract 55์„

    Bridging the gap between algorithmic and learned index structures

    Get PDF
    Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion. In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine. Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream. Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces

    Weiterentwicklung analytischer Datenbanksysteme

    Get PDF
    This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trรคgt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventstrรถmen besser zu unterstรผtzen. Wir stellen eine neue Indexstruktur fรผr Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermรถglicht. Zudem prรคsentieren wir einen neuen Ansatz fรผr Kardinalitรคtsschรคtzungen mittels maschinellen Lernens

    Fast Succinct Retrieval and Approximate Membership Using Ribbon

    Get PDF
    A retrieval data structure for a static function f: S โ†’ {0,1}^r supports queries that return f(x) for any x โˆˆ S. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ), i.e., a Bloom filter alternative, with false positive rate 2^{-r}. The information-theoretic lower bound for both tasks is r|S| bits. While succinct theoretical constructions using (1+o(1))r|S| bits were known, these could not achieve very small overheads in practice because they have an unfavorable space-time tradeoff hidden in the asymptotic costs or because small overheads would only be reached for physically impossible input sizes. With bumped ribbon retrieval (BuRR), we present the first practical succinct retrieval data structure. In an extensive experimental evaluation BuRR achieves space overheads well below 1% while being faster than most previously used retrieval data structures (typically with space overheads at least an order of magnitude larger) and faster than classical Bloom filters (with space overhead โ‰ฅ 44%). This efficiency, including favorable constants, stems from a combination of simplicity, word parallelism, and high locality. We additionally describe homogeneous ribbon filter AMQs, which are even simpler and faster at the price of slightly larger space overhead
    • โ€ฆ
    corecore