922 research outputs found

    Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques

    Get PDF
    Research on similarity join techniques is becoming one of the growing practical areas for study, especially with the increasing E-availability of vast amounts of digital data from more and more source systems. This research is focused on pre-processing clustering-based techniques to improve existing similarity join approaches. Identifying and extracting the same real-world entities from different data sources is still a big challenge and a significant task in the digital information era. Dissimilar extracts may indeed represent the same real-world entity because of inconsistent values and naming conventions, incorrect or missing data values, or incomplete information. Therefore discovering efficient and accurate approaches to determine the similarity of data objects or values is of theoretical as well as practical significance. Semantic problems are raised even on the concept of similarity regarding its usage and foundation. Existing similarity join approaches often have a very specific view of similarity measures and pre-defined predicates that represent a narrow focus on the context of similarity for a given scenario. The predicates have been assumed to be a group of clustering [MSW 72] related attributes on the join. To identify those entities for data integration purposes requires a broader view of similarity; for instance a number of generic similarity measures are useful in a given data integration systems. This study focused on string similarity join, namely based on the Levenshtein or edit distance and Q-gram. Proposed effective and efficient pre-processing clustering-based techniques were the focus of this study to identify clustering related predicates based on either attribute value or data value that improve existing similarity join techniques in enterprise data integration scenarios

    The space complexity of inner product filters

    Get PDF
    Motivated by the problem of filtering candidate pairs in inner product similarity joins we study the following inner product estimation problem: Given parameters dNd\in {\bf N}, α>β0\alpha>\beta\geq 0 and unit vectors x,yRdx,y\in {\bf R}^{d} consider the task of distinguishing between the cases x,yβ\langle x, y\rangle\leq\beta and x,yα\langle x, y\rangle\geq \alpha where x,y=i=1dxiyi\langle x, y\rangle = \sum_{i=1}^d x_i y_i is the inner product of vectors xx and yy. The goal is to distinguish these cases based on information on each vector encoded independently in a bit string of the shortest length possible. In contrast to much work on compressing vectors using randomized dimensionality reduction, we seek to solve the problem deterministically, with no probability of error. Inner product estimation can be solved in general via estimating x,y\langle x, y\rangle with an additive error bounded by ε=αβ\varepsilon = \alpha - \beta. We show that dlog2(1βε)±Θ(d)d \log_2 \left(\tfrac{\sqrt{1-\beta}}{\varepsilon}\right) \pm \Theta(d) bits of information about each vector is necessary and sufficient. Our upper bound is constructive and improves a known upper bound of dlog2(1/ε)+O(d)d \log_2(1/\varepsilon) + O(d) by up to a factor of 2 when β\beta is close to 11. The lower bound holds even in a stronger model where one of the vectors is known exactly, and an arbitrary estimation function is allowed.Comment: To appear at ICDT 202

    PLF-Join: 벡터 유사 조인을 위한 효율적인 맵리듀스 알고리즘

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 이상구.Vector similarity join is a problem of finding all pairs of vectors which has a similarity measure that exceeds a given threshold from a set of vectors. Vector similarity join is used in many applications such as near duplication detection in web pages, recommendation, and mining social data. However, it requires O(n^2) complexity where n is the number of vectors. This impractical time complexity makes it hard to utilize Vector similarity join on many real world problems. Hence, a lot of the Hadoop MapReduce algorithms were proposed to quickly compute Vector similarity join. The state-of-the-art algorithm considers prefix filtering and length filtering methods to reduce the time taken for Vector similarity join operation. To even further reduce this time complexity, we propose a variation of an algorithm that can be used to reduce the overhead involved in the network I/O cost. Along with a MapReduce algorithm we propose an efficient pre-processing technique which facilitates Vector similarity join calculation.Abstract i Contents iii List of Figures v Chapter 1 Introduction 1 Chapter 2 Preliminary 4 2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Filtering Predicate . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Prefix Filtering . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Length Filtering . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Related Works 10 3.1 V-SMART Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 VCL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Bjoin Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 4 Pre-Processing 16 4.1 Pre-Processing Method of Previous Researches . . . . . . . . . . 16 4.2 StdSort : Sorting with Standard Deviation . . . . . . . . . . . . . 17 Chapter 5 PLF-Join 22 5.1 Job 1 : Filter Dissimilar Pairs . . . . . . . . . . . . . . . . . . . . 22 5.2 Job 2 : Re-import the Vectors . . . . . . . . . . . . . . . . . . . . 25 5.3 Job 3 : Calculate Similarity . . . . . . . . . . . . . . . . . . . . . 28 5.4 Example of PLF-join . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 6 Experiment 31 6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.2 Time Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.3 StdSort Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.4 Combining PLF-join and StdSort . . . . . . . . . . . . . . . . . . 34 Chapter 7 Conclusion 39 Bibliography 41 요약 43 감사의글 44Maste

    Answering Complex Questions by Joining Multi-Document Evidence with Quasi Knowledge Graphs

    No full text
    Direct answering of questions that involve multiple entities and relations is a challenge for text-based QA. This problem is most pronounced when answers can be found only by joining evidence from multiple documents. Curated knowledge graphs (KGs) may yield good answers, but are limited by their inherent incompleteness and potential staleness. This paper presents QUEST, a method that can answer complex questions directly from textual sources on-the-fly, by computing similarity joins over partial results from different documents. Our method is completely unsupervised, avoiding training-data bottlenecks and being able to cope with rapidly evolving ad hoc topics and formulation style in user questions. QUEST builds a noisy quasi KG with node and edge weights, consisting of dynamically retrieved entity names and relational phrases. It augments this graph with types and semantic alignments, and computes the best answers by an algorithm for Group Steiner Trees. We evaluate QUEST on benchmarks of complex questions, and show that it substantially outperforms state-of-the-art baselines

    Towards Analytics Aware Ontology Based Access to Static and Streaming Data (Extended Version)

    Full text link
    Real-time analytics that requires integration and aggregation of heterogeneous and distributed streaming and static data is a typical task in many industrial scenarios such as diagnostics of turbines in Siemens. OBDA approach has a great potential to facilitate such tasks; however, it has a number of limitations in dealing with analytics that restrict its use in important industrial applications. Based on our experience with Siemens, we argue that in order to overcome those limitations OBDA should be extended and become analytics, source, and cost aware. In this work we propose such an extension. In particular, we propose an ontology, mapping, and query language for OBDA, where aggregate and other analytical functions are first class citizens. Moreover, we develop query optimisation techniques that allow to efficiently process analytical tasks over static and streaming data. We implement our approach in a system and evaluate our system with Siemens turbine data

    Neo: A Learned Query Optimizer

    Full text link
    Query optimization is one of the most challenging problems in database systems. Despite the progress made over the past decades, query optimizers remain extremely complex components that require a great deal of hand-tuning for specific workloads and datasets. Motivated by this shortcoming and inspired by recent advances in applying machine learning to data management challenges, we introduce Neo (Neural Optimizer), a novel learning-based query optimizer that relies on deep neural networks to generate query executions plans. Neo bootstraps its query optimization model from existing optimizers and continues to learn from incoming queries, building upon its successes and learning from its failures. Furthermore, Neo naturally adapts to underlying data patterns and is robust to estimation errors. Experimental results demonstrate that Neo, even when bootstrapped from a simple optimizer like PostgreSQL, can learn a model that offers similar performance to state-of-the-art commercial optimizers, and in some cases even surpass them
    corecore