Search CORE

838 research outputs found

The Case for Learned Index Structures

Author: Abadi M.
Armbrust M.
Böhm M.
Chang F.
Goodfellow I.
Grossi R.
Lehman T. J.
Litwin W.
Magdon-Ismail M.
Miller D. J.
Moerkotte G.
Sutskever I.
You S.
Publication venue
Publication date: 30/04/2018
Field of study

Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible

arXiv.org e-Print Archive

Crossref

Data Structures and Algorithms for Scalable NDN Forwarding

Author: Yuan Haowei
Publication venue: Washington University Open Scholarship
Publication date: 15/12/2015
Field of study

Named Data Networking (NDN) is a recently proposed general-purpose network architecture that aims to address the limitations of the Internet Protocol (IP), while maintaining its strengths. NDN takes an information-centric approach, focusing on named data rather than computer addresses. In NDN, the content is identified by its name, and each NDN packet has a name that specifies the content it is fetching or delivering. Since there are no source and destination addresses in an NDN packet, it is forwarded based on a lookup of its name in the forwarding plane, which consists of the Forwarding Information Base (FIB), Pending Interest Table (PIT), and Content Store (CS). In addition, as an in-network caching element, a scalable Repository (Repo) design is needed to provide large-scale long-term content storage in NDN networks. Scalable NDN forwarding is a challenge. Compared to the well-understood approaches to IP forwarding, NDN forwarding performs lookups on packet names, which have variable and unbounded lengths, increasing the lookup complexity. The lookup tables are larger than in IP, requiring more memory space. Moreover, NDN forwarding has a read-write data plane, requiring per-packet updates at line rates. Designing and evaluating a scalable NDN forwarding node architecture is a major effort within the overall NDN research agenda. The goal of this dissertation is to demonstrate that scalable NDN forwarding is feasible with the proposed data structures and algorithms. First, we propose a FIB lookup design based on the binary search of hash tables that provides a reliable longest name prefix lookup performance baseline for future NDN research. We have demonstrated 10 Gbps forwarding throughput with 256-byte packets and one billion synthetic forwarding rules, each containing up to seven name components. Second, we explore data structures and algorithms to optimize the FIB design based on the specific characteristics of real-world forwarding datasets. Third, we propose a fingerprint-only PIT design that reduces the memory requirements in the core routers. Lastly, we discuss the Content Store design issues and demonstrate that the NDN Repo implementation can leverage many of the existing databases and storage systems to improve performance

Washington University St. Louis: Open Scholarship

Network Function Modeling and Performance Estimation

Author: Baldi Mario
Sapio Amedeo
Publication venue: IAES
Publication date: 01/01/2018
Field of study

This work introduces a methodology for the modelization of network functions focused on the identification of recurring execution patterns as basic building blocks and aimed at providing a platform independent representation. By mapping each modeling building block on specific hardware, the performance of the network function can be estimated in termsof maximum throughput that the network function can achieve on the specific execution platform. The approach is such that once the basic modeling building blocks have been mapped, the estimate can be computed automatically for any modeled network function. Experimental results on several sample network functions show that although our approach cannot be very accurate without taking in consideration traffic characteristics, it is very valuable for those application where even loose estimates are key. One such example is orchestration in network functions virtualization (NFV) platforms, as well as in general virtualization platforms where virtual machine placement is based also on the performanceof network services offered to them. Being able to automatically estimate the performance of a virtualized network function (VNF) on different execution hardware, enables optimal placement of VNFs themselves as well as the virtual hosts they serve, while efficiently utilizing available resources

IAES journal

ZENODO

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Open Access Repository

Institute of Advanced Engineering and Science

Loom: Query-aware Partitioning of Online Graphs

Author: Aiston Jack
Firth Hugo
Missier Paolo
Publication venue
Publication date: 17/11/2017
Field of study

As with general graph processing systems, partitioning data over a cluster of machines improves the scalability of graph database management systems. However, these systems will incur additional network cost during the execution of a query workload, due to inter-partition traversals. Workload-agnostic partitioning algorithms typically minimise the likelihood of any edge crossing partition boundaries. However, these partitioners are sub-optimal with respect to many workloads, especially queries, which may require more frequent traversal of specific subsets of inter-partition edges. Furthermore, they largely unsuited to operating incrementally on dynamic, growing graphs. We present a new graph partitioning algorithm, Loom, that operates on a stream of graph updates and continuously allocates the new vertices and edges to partitions, taking into account a query workload of graph pattern expressions along with their relative frequencies. First we capture the most common patterns of edge traversals which occur when executing queries. We then compare sub-graphs, which present themselves incrementally in the graph update stream, against these common patterns. Finally we attempt to allocate each match to single partitions, reducing the number of inter-partition edges within frequently traversed sub-graphs and improving average query performance. Loom is extensively evaluated over several large test graphs with realistic query workloads and various orderings of the graph updates. We demonstrate that, given a workload, our prototype produces partitionings of significantly better quality than existing streaming graph partitioning algorithms Fennel and LDG

arXiv.org e-Print Archive

University of Birmingham Research Portal

대용량 데이터 탐색을 위한 점진적 시각화 시스템 설계

Author: 조재민
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 서진욱.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)—a visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these results—to address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.현대 데이터 사이언스에서 인터랙티브한 시각화를 통해 데이터를 이해하는 것은 필수적인 분석 방법 중 하나이다. 그러나, 최근 데이터의 크기가 폭발적으로 증가하면서 데이터 크기로 인해 발생하는 지연 시간이 인터랙티브한 시각적 분석에 큰 걸림돌이 되었다. 본 연구에서는 이러한 확장성 문제를 해결하기 위해 점진적 시각적 분석(Progressive Visual Analytics)을 지원하는 일련의 시스템을 디자인하고 개발한다. 이러한 점진적 시각적 분석 시스템은 데이터 처리가 완전히 끝나지 않더라도 중간 분석 결과를 사용자에게 제공함으로써 데이터의 크기로 인해 발생하는 지연 시간 문제를 완화할 수 있다. 첫째로, 수십억 건의 행을 가지는 데이터를 시각적으로 탐색할 수 있는 SwiftTuna 시스템을 제안한다. 데이터 처리 및 시각적 표현의 확장성을 목표로 개발된 이 시스템은, 약 40억 건의 행을 가진 데이터에 대한 시각화를 전처리 없이 수 초마다 업데이트할 수 있는 것으로 나타났다. 둘째로, 근사적 k-최근접점(Approximate k-Nearest Neighbor) 문제를 점진적으로 계산하는 PANENE 알고리즘을 제안한다. 근사적 k-최근접점 문제는 여러 기계 학습 기법에서 쓰임에도 불구하고 초기 계산 시간이 길어서 인터랙티브한 시스템에 적용하기 힘든 한계가 있었다. PANENE 알고리즘은 이러한 긴 초기 계산 시간을 획기적으로 개선하여 다양한 기계 학습 기법을 시각적 분석에 활용할 수 있도록 한다. 특히, 유용한 비선형적 차원 감소 기법인 t-분포 확률적 임베딩(t-Distributed Stochastic Neighbor Embedding)을 가속하여 수백 개의 차원을 가지는 데이터를 빠른 시간 내에 사영할 수 있다. 위의 두 시스템과 알고리즘이 데이터의 행 또는 열의 개수로 인한 확장성 문제를 해결하고자 했다면, 세 번째 시스템에서는 점진적 시각적 분석의 신뢰도 문제를 개선하고자 한다. 점진적 시각적 분석에서 사용자에게 주어지는 중간 계산 결과는 최종 결과의 근사치이므로 불확실성이 존재한다. 본 연구에서는 세이프가드를 이용한 점진적 시각적 분석(Progressive Visual Analytics with Safeguards)이라는 새로운 개념을 제안한다. 이 개념은 사용자가 점진적 탐색에서 마주하는 불확실한 중간 지식에 세이프가드를 남길 수 있도록 하여 탐색에서 얻은 지식의 정확도를 추후 검증할 수 있도록 한다. 또한, 이러한 개념을 실제로 구현하여 탑재한 ProReveal 시스템을 소개한다. ProReveal를 이용한 사용자 실험에서 사용자들은 세이프가드를 성공적으로 만들 수 있었을 뿐만 아니라, 중간 지식의 불확실성을 다루기 위해 세이프가드를 자발적으로 이용한다는 것을 알 수 있었다. 마지막으로, 위 세 가지 연구의 결과를 종합하여 점진적 시각적 분석 시스템을 구현할 때의 디자인적 난제와 향후 연구 방향을 모색한다.CHAPTER1. Introduction 2 1.1 Background and Motivation 2 1.2 Thesis Statement and Research Questions 5 1.3 Thesis Contributions 5 1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6 1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7 1.3.3 Progressive Visual Analytics with Safeguards 8 1.4 Structure of Dissertation 9 CHAPTER2. Related Work 11 2.1 Progressive Visual Analytics 11 2.1.1 Definitions 11 2.1.2 System Latency and Human Factors 13 2.1.3 Users, Tasks, and Models 15 2.1.4 Techniques, Algorithms, and Systems. 17 2.1.5 Uncertainty Visualization 19 2.2 Approaches for Scalable Visualization Systems 20 2.3 The k-Nearest Neighbor (KNN) Problem 22 2.4 t-Distributed Stochastic Neighbor Embedding 26 CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28 3.1 The SwiTuna Design 31 3.1.1 Design Considerations 32 3.1.2 System Overview 33 3.1.3 Scalable Visualization Components 36 3.1.4 Visualization Cards 40 3.1.5 User Interface and Interaction 42 3.2 Responsive Querying 44 3.2.1 Querying Pipeline 44 3.2.2 Prompt Responses 47 3.2.3 Incremental Processing 47 3.3 Evaluation: Performance Benchmark 49 3.3.1 Study Design 49 3.3.2 Results and Discussion 52 3.4 Implementation 56 3.5 Summary 56 CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58 4.1 Approximate k-Nearest Neighbor 61 4.1.1 A Sequential Algorithm 62 4.1.2 An Online Algorithm 63 4.1.3 A Progressive Algorithm 66 4.1.4 Filtered AKNN Search 71 4.2 k-Nearest Neighbor Lookup Table 72 4.3 Benchmark. 78 4.3.1 Online and Progressive k-d Trees 78 4.3.2 k-Nearest Neighbor Lookup Tables 83 4.4 Applications 85 4.4.1 Progressive Regression and Density Estimation 85 4.4.2 Responsive t-SNE 87 4.5 Implementation 92 4.6 Discussion 92 4.7 Summary 93 CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95 5.1 Progressive Visual Analytics with Safeguards 98 5.1.1 Definition 98 5.1.2 Examples 101 5.1.3 Design Considerations 103 5.2 ProReveal 105 5.3 Evaluation 121 5.4 Discussion 127 5.5 Summary 130 CHAPTER6. Discussion 132 6.1 Lessons Learned 132 6.2 Limitations 135 CHAPTER7. Conclusion 137 7.1 Thesis Contributions Revisited 137 7.2 Future Research Agenda 139 7.3 Final Remarks 141 Abstract (Korean) 155 Acknowledgments (Korean) 157Docto

SNU Open Repository and Archive