1,479 research outputs found

    Efficient Subgraph Similarity Search on Large Probabilistic Graph Databases

    Full text link
    Many studies have been conducted on seeking the efficient solution for subgraph similarity search over certain (deterministic) graphs due to its wide application in many fields, including bioinformatics, social network analysis, and Resource Description Framework (RDF) data management. All these works assume that the underlying data are certain. However, in reality, graphs are often noisy and uncertain due to various factors, such as errors in data extraction, inconsistencies in data integration, and privacy preserving purposes. Therefore, in this paper, we study subgraph similarity search on large probabilistic graph databases. Different from previous works assuming that edges in an uncertain graph are independent of each other, we study the uncertain graphs where edges' occurrences are correlated. We formally prove that subgraph similarity search over probabilistic graphs is #P-complete, thus, we employ a filter-and-verify framework to speed up the search. In the filtering phase,we develop tight lower and upper bounds of subgraph similarity probability based on a probabilistic matrix index, PMI. PMI is composed of discriminative subgraph features associated with tight lower and upper bounds of subgraph isomorphism probability. Based on PMI, we can sort out a large number of probabilistic graphs and maximize the pruning capability. During the verification phase, we develop an efficient sampling algorithm to validate the remaining candidates. The efficiency of our proposed solutions has been verified through extensive experiments.Comment: VLDB201

    ProS: Data Series Progressive k-NN Similarity Search and Classification with Probabilistic Quality Guarantees

    Full text link
    Existing systems dealing with the increasing volume of data series cannot guarantee interactive response times, even for fundamental tasks such as similarity search. Therefore, it is necessary to develop analytic approaches that support exploration and decision making by providing progressive results, before the final and exact ones have been computed. Prior works lack both efficiency and accuracy when applied to large-scale data series collections. We present and experimentally evaluate ProS, a new probabilistic learning-based method that provides quality guarantees for progressive Nearest Neighbor (NN) query answering. We develop our method for k-NN queries and demonstrate how it can be applied with the two most popular distance measures, namely, Euclidean and Dynamic Time Warping (DTW). We provide both initial and progressive estimates of the final answer that are getting better during the similarity search, as well suitable stopping criteria for the progressive queries. Moreover, we describe how this method can be used in order to develop a progressive algorithm for data series classification (based on a k-NN classifier), and we additionally propose a method designed specifically for the classification task. Experiments with several and diverse synthetic and real datasets demonstrate that our prediction methods constitute the first practical solutions to the problem, significantly outperforming competing approaches. This paper was published in the VLDB Journal (2022)

    λŒ€μš©λŸ‰ 데이터 탐색을 μœ„ν•œ 점진적 μ‹œκ°ν™” μ‹œμŠ€ν…œ 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀,2020. 2. μ„œμ§„μš±.Understanding data through interactive visualization, also known as visual analytics, is a common and necessary practice in modern data science. However, as data sizes have increased at unprecedented rates, the computation latency of visualization systems becomes a significant hurdle to visual analytics. The goal of this dissertation is to design a series of systems for progressive visual analytics (PVA)β€”a visual analytics paradigm that can provide intermediate results during computation and allow visual exploration of these resultsβ€”to address the scalability hurdle. To support the interactive exploration of data with billions of records, we first introduce SwiftTuna, an interactive visualization system with scalable visualization and computation components. Our performance benchmark demonstrates that it can handle data with four billion records, giving responsive feedback every few seconds without precomputation. Second, we present PANENE, a progressive algorithm for the Approximate k-Nearest Neighbor (AKNN) problem. PANENE brings useful machine learning methods into visual analytics, which has been challenging due to their long initial latency resulting from AKNN computation. In particular, we accelerate t-Distributed Stochastic Neighbor Embedding (t-SNE), a popular non-linear dimensionality reduction technique, which enables the responsive visualization of data with a few hundred columns. Each of these two contributions aims to address the scalability issues stemming from a large number of rows or columns in data, respectively. Third, from the users' perspective, we focus on improving the trustworthiness of intermediate knowledge gained from uncertain results in PVA. We propose a novel PVA concept, Progressive Visual Analytics with Safeguards, and introduce PVA-Guards, safeguards people can leave on uncertain intermediate knowledge that needs to be verified. We also present a proof-of-concept system, ProReveal, designed and developed to integrate seven safeguards into progressive data exploration. Our user study demonstrates that people not only successfully created PVA-Guards on ProReveal but also voluntarily used PVA-Guards to manage the uncertainty of their knowledge. Finally, summarizing the three studies, we discuss design challenges for progressive systems as well as future research agendas for PVA.ν˜„λŒ€ 데이터 μ‚¬μ΄μ–ΈμŠ€μ—μ„œ μΈν„°λž™ν‹°λΈŒν•œ μ‹œκ°ν™”λ₯Ό 톡해 데이터λ₯Ό μ΄ν•΄ν•˜λŠ” 것은 ν•„μˆ˜μ μΈ 뢄석 방법 쀑 ν•˜λ‚˜μ΄λ‹€. κ·ΈλŸ¬λ‚˜, 졜근 λ°μ΄ν„°μ˜ 크기가 폭발적으둜 μ¦κ°€ν•˜λ©΄μ„œ 데이터 크기둜 인해 λ°œμƒν•˜λŠ” 지연 μ‹œκ°„μ΄ μΈν„°λž™ν‹°λΈŒν•œ μ‹œκ°μ  뢄석에 큰 걸림돌이 λ˜μ—ˆλ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” μ΄λŸ¬ν•œ ν™•μž₯μ„± 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 점진적 μ‹œκ°μ  뢄석(Progressive Visual Analytics)을 μ§€μ›ν•˜λŠ” 일련의 μ‹œμŠ€ν…œμ„ λ””μžμΈν•˜κ³  κ°œλ°œν•œλ‹€. μ΄λŸ¬ν•œ 점진적 μ‹œκ°μ  뢄석 μ‹œμŠ€ν…œμ€ 데이터 μ²˜λ¦¬κ°€ μ™„μ „νžˆ λλ‚˜μ§€ μ•Šλ”λΌλ„ 쀑간 뢄석 κ²°κ³Όλ₯Ό μ‚¬μš©μžμ—κ²Œ μ œκ³΅ν•¨μœΌλ‘œμ¨ λ°μ΄ν„°μ˜ 크기둜 인해 λ°œμƒν•˜λŠ” 지연 μ‹œκ°„ 문제λ₯Ό μ™„ν™”ν•  수 μžˆλ‹€. 첫째둜, μˆ˜μ‹­μ–΅ 건의 행을 κ°€μ§€λŠ” 데이터λ₯Ό μ‹œκ°μ μœΌλ‘œ 탐색할 수 μžˆλŠ” SwiftTuna μ‹œμŠ€ν…œμ„ μ œμ•ˆν•œλ‹€. 데이터 처리 및 μ‹œκ°μ  ν‘œν˜„μ˜ ν™•μž₯성을 λͺ©ν‘œλ‘œ 개발된 이 μ‹œμŠ€ν…œμ€, μ•½ 40μ–΅ 건의 행을 가진 데이터에 λŒ€ν•œ μ‹œκ°ν™”λ₯Ό μ „μ²˜λ¦¬ 없이 수 μ΄ˆλ§ˆλ‹€ μ—…λ°μ΄νŠΈν•  수 μžˆλŠ” κ²ƒμœΌλ‘œ λ‚˜νƒ€λ‚¬λ‹€. λ‘˜μ§Έλ‘œ, 근사적 k-μ΅œκ·Όμ ‘μ (Approximate k-Nearest Neighbor) 문제λ₯Ό μ μ§„μ μœΌλ‘œ κ³„μ‚°ν•˜λŠ” PANENE μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•œλ‹€. 근사적 k-μ΅œκ·Όμ ‘μ  λ¬Έμ œλŠ” μ—¬λŸ¬ 기계 ν•™μŠ΅ κΈ°λ²•μ—μ„œ μ“°μž„μ—λ„ λΆˆκ΅¬ν•˜κ³  초기 계산 μ‹œκ°„μ΄ κΈΈμ–΄μ„œ μΈν„°λž™ν‹°λΈŒν•œ μ‹œμŠ€ν…œμ— μ μš©ν•˜κΈ° νž˜λ“  ν•œκ³„κ°€ μžˆμ—ˆλ‹€. PANENE μ•Œκ³ λ¦¬μ¦˜μ€ μ΄λŸ¬ν•œ κΈ΄ 초기 계산 μ‹œκ°„μ„ 획기적으둜 κ°œμ„ ν•˜μ—¬ λ‹€μ–‘ν•œ 기계 ν•™μŠ΅ 기법을 μ‹œκ°μ  뢄석에 ν™œμš©ν•  수 μžˆλ„λ‘ ν•œλ‹€. 특히, μœ μš©ν•œ λΉ„μ„ ν˜•μ  차원 κ°μ†Œ 기법인 t-뢄포 ν™•λ₯ μ  μž„λ² λ”©(t-Distributed Stochastic Neighbor Embedding)을 κ°€μ†ν•˜μ—¬ 수백 개의 차원을 κ°€μ§€λŠ” 데이터λ₯Ό λΉ λ₯Έ μ‹œκ°„ 내에 μ‚¬μ˜ν•  수 μžˆλ‹€. μœ„μ˜ 두 μ‹œμŠ€ν…œκ³Ό μ•Œκ³ λ¦¬μ¦˜μ΄ λ°μ΄ν„°μ˜ ν–‰ λ˜λŠ” μ—΄μ˜ 개수둜 μΈν•œ ν™•μž₯μ„± 문제λ₯Ό ν•΄κ²°ν•˜κ³ μž ν–ˆλ‹€λ©΄, μ„Έ 번째 μ‹œμŠ€ν…œμ—μ„œλŠ” 점진적 μ‹œκ°μ  λΆ„μ„μ˜ 신뒰도 문제λ₯Ό κ°œμ„ ν•˜κ³ μž ν•œλ‹€. 점진적 μ‹œκ°μ  λΆ„μ„μ—μ„œ μ‚¬μš©μžμ—κ²Œ μ£Όμ–΄μ§€λŠ” 쀑간 계산 κ²°κ³ΌλŠ” μ΅œμ’… 결과의 κ·Όμ‚¬μΉ˜μ΄λ―€λ‘œ λΆˆν™•μ‹€μ„±μ΄ μ‘΄μž¬ν•œλ‹€. λ³Έ μ—°κ΅¬μ—μ„œλŠ” μ„Έμ΄ν”„κ°€λ“œλ₯Ό μ΄μš©ν•œ 점진적 μ‹œκ°μ  뢄석(Progressive Visual Analytics with Safeguards)μ΄λΌλŠ” μƒˆλ‘œμš΄ κ°œλ…μ„ μ œμ•ˆν•œλ‹€. 이 κ°œλ…μ€ μ‚¬μš©μžκ°€ 점진적 νƒμƒ‰μ—μ„œ λ§ˆμ£Όν•˜λŠ” λΆˆν™•μ‹€ν•œ 쀑간 지식에 μ„Έμ΄ν”„κ°€λ“œλ₯Ό 남길 수 μžˆλ„λ‘ ν•˜μ—¬ νƒμƒ‰μ—μ„œ 얻은 μ§€μ‹μ˜ 정확도λ₯Ό μΆ”ν›„ 검증할 수 μžˆλ„λ‘ ν•œλ‹€. λ˜ν•œ, μ΄λŸ¬ν•œ κ°œλ…μ„ μ‹€μ œλ‘œ κ΅¬ν˜„ν•˜μ—¬ νƒ‘μž¬ν•œ ProReveal μ‹œμŠ€ν…œμ„ μ†Œκ°œν•œλ‹€. ProRevealλ₯Ό μ΄μš©ν•œ μ‚¬μš©μž μ‹€ν—˜μ—μ„œ μ‚¬μš©μžλ“€μ€ μ„Έμ΄ν”„κ°€λ“œλ₯Ό μ„±κ³΅μ μœΌλ‘œ λ§Œλ“€ 수 μžˆμ—ˆμ„ 뿐만 μ•„λ‹ˆλΌ, 쀑간 μ§€μ‹μ˜ λΆˆν™•μ‹€μ„±μ„ 닀루기 μœ„ν•΄ μ„Έμ΄ν”„κ°€λ“œλ₯Ό 자발적으둜 μ΄μš©ν•œλ‹€λŠ” 것을 μ•Œ 수 μžˆμ—ˆλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, μœ„ μ„Έ 가지 μ—°κ΅¬μ˜ κ²°κ³Όλ₯Ό μ’…ν•©ν•˜μ—¬ 점진적 μ‹œκ°μ  뢄석 μ‹œμŠ€ν…œμ„ κ΅¬ν˜„ν•  λ•Œμ˜ λ””μžμΈμ  λ‚œμ œμ™€ ν–₯ν›„ 연ꡬ λ°©ν–₯을 λͺ¨μƒ‰ν•œλ‹€.CHAPTER1. Introduction 2 1.1 Background and Motivation 2 1.2 Thesis Statement and Research Questions 5 1.3 Thesis Contributions 5 1.3.1 Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 6 1.3.2 ProgressiveComputation of Approximate k-Nearest Neighbors and Responsive t-SNE 7 1.3.3 Progressive Visual Analytics with Safeguards 8 1.4 Structure of Dissertation 9 CHAPTER2. Related Work 11 2.1 Progressive Visual Analytics 11 2.1.1 Definitions 11 2.1.2 System Latency and Human Factors 13 2.1.3 Users, Tasks, and Models 15 2.1.4 Techniques, Algorithms, and Systems. 17 2.1.5 Uncertainty Visualization 19 2.2 Approaches for Scalable Visualization Systems 20 2.3 The k-Nearest Neighbor (KNN) Problem 22 2.4 t-Distributed Stochastic Neighbor Embedding 26 CHAPTER3. SwiTuna: Responsive and Incremental Visual Exploration of Large-scale Multidimensional Data 28 3.1 The SwiTuna Design 31 3.1.1 Design Considerations 32 3.1.2 System Overview 33 3.1.3 Scalable Visualization Components 36 3.1.4 Visualization Cards 40 3.1.5 User Interface and Interaction 42 3.2 Responsive Querying 44 3.2.1 Querying Pipeline 44 3.2.2 Prompt Responses 47 3.2.3 Incremental Processing 47 3.3 Evaluation: Performance Benchmark 49 3.3.1 Study Design 49 3.3.2 Results and Discussion 52 3.4 Implementation 56 3.5 Summary 56 CHAPTER4. PANENE:AProgressive Algorithm for IndexingandQuerying Approximate k-Nearest Neighbors 58 4.1 Approximate k-Nearest Neighbor 61 4.1.1 A Sequential Algorithm 62 4.1.2 An Online Algorithm 63 4.1.3 A Progressive Algorithm 66 4.1.4 Filtered AKNN Search 71 4.2 k-Nearest Neighbor Lookup Table 72 4.3 Benchmark. 78 4.3.1 Online and Progressive k-d Trees 78 4.3.2 k-Nearest Neighbor Lookup Tables 83 4.4 Applications 85 4.4.1 Progressive Regression and Density Estimation 85 4.4.2 Responsive t-SNE 87 4.5 Implementation 92 4.6 Discussion 92 4.7 Summary 93 CHAPTER5. ProReveal: Progressive Visual Analytics with Safeguards 95 5.1 Progressive Visual Analytics with Safeguards 98 5.1.1 Definition 98 5.1.2 Examples 101 5.1.3 Design Considerations 103 5.2 ProReveal 105 5.3 Evaluation 121 5.4 Discussion 127 5.5 Summary 130 CHAPTER6. Discussion 132 6.1 Lessons Learned 132 6.2 Limitations 135 CHAPTER7. Conclusion 137 7.1 Thesis Contributions Revisited 137 7.2 Future Research Agenda 139 7.3 Final Remarks 141 Abstract (Korean) 155 Acknowledgments (Korean) 157Docto
    • …
    corecore