Search CORE

922 research outputs found

Clustering-Based Pre-Processing Approaches To Improve Similarity Join Techniques

Author: Tan Yufen
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2010
Field of study

Research on similarity join techniques is becoming one of the growing practical areas for study, especially with the increasing E-availability of vast amounts of digital data from more and more source systems. This research is focused on pre-processing clustering-based techniques to improve existing similarity join approaches. Identifying and extracting the same real-world entities from different data sources is still a big challenge and a significant task in the digital information era. Dissimilar extracts may indeed represent the same real-world entity because of inconsistent values and naming conventions, incorrect or missing data values, or incomplete information. Therefore discovering efficient and accurate approaches to determine the similarity of data objects or values is of theoretical as well as practical significance. Semantic problems are raised even on the concept of similarity regarding its usage and foundation. Existing similarity join approaches often have a very specific view of similarity measures and pre-defined predicates that represent a narrow focus on the context of similarity for a given scenario. The predicates have been assumed to be a group of clustering [MSW 72] related attributes on the join. To identify those entities for data integration purposes requires a broader view of similarity; for instance a number of generic similarity measures are useful in a given data integration systems. This study focused on string similarity join, namely based on the Levenshtein or edit distance and Q-gram. Proposed effective and efficient pre-processing clustering-based techniques were the focus of this study to identify clustering related predicates based on either attribute value or data value that improve existing similarity join techniques in enterprise data integration scenarios

Digital Commons@Wayne State University

Efficient Neighborhood Graph Construction for Sparse High Dimensional Data

Author: Anastasiu David
Publication venue: SJSU ScholarWorks
Publication date: 08/02/2017
Field of study

Scholar Commons - Santa Clara University

SJSU ScholarWorks

The space complexity of inner product filters

Author: Pagh Rasmus
Sivertsen Johan von Tangen
Publication venue
Publication date: 01/01/2020
Field of study

Motivated by the problem of filtering candidate pairs in inner product similarity joins we study the following inner product estimation problem: Given parameters

d\in {\bf N}

\alpha>\beta\geq 0

and unit vectors

x,y\in {\bf R}^{d}

consider the task of distinguishing between the cases

\langle x, y\rangle\leq\beta

and

\langle x, y\rangle\geq \alpha

where

\langle x, y\rangle = \sum_{i=1}^d x_i y_i

is the inner product of vectors

x

and

y

. The goal is to distinguish these cases based on information on each vector encoded independently in a bit string of the shortest length possible. In contrast to much work on compressing vectors using randomized dimensionality reduction, we seek to solve the problem deterministically, with no probability of error. Inner product estimation can be solved in general via estimating

\langle x, y\rangle

with an additive error bounded by

\varepsilon = \alpha - \beta

. We show that

d \log_2 \left(\tfrac{\sqrt{1-\beta}}{\varepsilon}\right) \pm \Theta(d)

bits of information about each vector is necessary and sufficient. Our upper bound is constructive and improves a known upper bound of

d \log_2(1/\varepsilon) + O(d)

by up to a factor of 2 when

\beta

is close to

1

. The lower bound holds even in a stronger model where one of the vectors is known exactly, and an arbitrary estimation function is allowed.Comment: To appear at ICDT 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

The IT University of Copenhagen's Repository

PLF-Join: 벡터 유사 조인을 위한 효율적인 맵리듀스 알고리즘

Author: 김현준
Publication venue: 서울대학교 대학원
Publication date: 01/02/2015
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 이상구.Vector similarity join is a problem of finding all pairs of vectors which has a similarity measure that exceeds a given threshold from a set of vectors. Vector similarity join is used in many applications such as near duplication detection in web pages, recommendation, and mining social data. However, it requires O(n^2) complexity where n is the number of vectors. This impractical time complexity makes it hard to utilize Vector similarity join on many real world problems. Hence, a lot of the Hadoop MapReduce algorithms were proposed to quickly compute Vector similarity join. The state-of-the-art algorithm considers prefix filtering and length filtering methods to reduce the time taken for Vector similarity join operation. To even further reduce this time complexity, we propose a variation of an algorithm that can be used to reduce the overhead involved in the network I/O cost. Along with a MapReduce algorithm we propose an efficient pre-processing technique which facilitates Vector similarity join calculation.Abstract i Contents iii List of Figures v Chapter 1 Introduction 1 Chapter 2 Preliminary 4 2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Filtering Predicate . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Prefix Filtering . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Length Filtering . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Related Works 10 3.1 V-SMART Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 VCL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Bjoin Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 4 Pre-Processing 16 4.1 Pre-Processing Method of Previous Researches . . . . . . . . . . 16 4.2 StdSort : Sorting with Standard Deviation . . . . . . . . . . . . . 17 Chapter 5 PLF-Join 22 5.1 Job 1 : Filter Dissimilar Pairs . . . . . . . . . . . . . . . . . . . . 22 5.2 Job 2 : Re-import the Vectors . . . . . . . . . . . . . . . . . . . . 25 5.3 Job 3 : Calculate Similarity . . . . . . . . . . . . . . . . . . . . . 28 5.4 Example of PLF-join . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 6 Experiment 31 6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.2 Time Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.3 StdSort Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.4 Combining PLF-join and StdSort . . . . . . . . . . . . . . . . . . 34 Chapter 7 Conclusion 39 Bibliography 41 요약 43 감사의글 44Maste

SNU Open Repository and Archive

Answering Complex Questions by Joining Multi-Document Evidence with Quasi Knowledge Graphs

Author: Abujabal A.
Lu X.
Pramanik S.
Saha Roy R.
Wang Y.
Weikum G.
Publication venue
Publication date: 01/01/2019
Field of study

Direct answering of questions that involve multiple entities and relations is a challenge for text-based QA. This problem is most pronounced when answers can be found only by joining evidence from multiple documents. Curated knowledge graphs (KGs) may yield good answers, but are limited by their inherent incompleteness and potential staleness. This paper presents QUEST, a method that can answer complex questions directly from textual sources on-the-fly, by computing similarity joins over partial results from different documents. Our method is completely unsupervised, avoiding training-data bottlenecks and being able to cope with rapidly evolving ad hoc topics and formulation style in user questions. QUEST builds a noisy quasi KG with node and edge weights, consisting of dynamically retrieved entity names and relational phrases. It augments this graph with types and semantic alignments, and computes the best answers by an algorithm for Group Steiner Trees. We evaluate QUEST on benchmarks of complex questions, and show that it substantially outperforms state-of-the-art baselines

MPG.PuRe

Towards Analytics Aware Ontology Based Access to Static and Streaming Data (Extended Version)

Author: Brandt Sebastian
Horrocks Ian
Ioannidis Yannis
Kharlamov Evgeny
Kotidis Yannis
Lamparter Steffen
Mailis Theofilos
Möller Ralf
Neuenstadt Christian
Nikolaou Charalampos
Svingos Christoforos
Zheleznyakov Dmitriy
Özcep Özgür
Publication venue
Publication date: 01/01/2016
Field of study

Real-time analytics that requires integration and aggregation of heterogeneous and distributed streaming and static data is a typical task in many industrial scenarios such as diagnostics of turbines in Siemens. OBDA approach has a great potential to facilitate such tasks; however, it has a number of limitations in dealing with analytics that restrict its use in important industrial applications. Based on our experience with Siemens, we argue that in order to overcome those limitations OBDA should be extended and become analytics, source, and cost aware. In this work we propose such an extension. In particular, we propose an ontology, mapping, and query language for OBDA, where aggregate and other analytical functions are first class citizens. Moreover, we develop query optimisation techniques that allow to efficiently process analytical tasks over static and streaming data. We implement our approach in a system and evaluate our system with Siemens turbine data

arXiv.org e-Print Archive

Oxford University Research Archive

Neo: A Learned Query Optimizer

Author: Alizadeh Mohammad
Kraska Tim
Mao Hongzi
Marcus Ryan
Negi Parimarjan
Papaemmanouil Olga
Tatbul Nesime
Zhang Chi
Publication venue: 'VLDB Endowment'
Publication date: 07/04/2019
Field of study

Query optimization is one of the most challenging problems in database systems. Despite the progress made over the past decades, query optimizers remain extremely complex components that require a great deal of hand-tuning for specific workloads and datasets. Motivated by this shortcoming and inspired by recent advances in applying machine learning to data management challenges, we introduce Neo (Neural Optimizer), a novel learning-based query optimizer that relies on deep neural networks to generate query executions plans. Neo bootstraps its query optimization model from existing optimizers and continues to learn from incoming queries, building upon its successes and learning from its failures. Furthermore, Neo naturally adapts to underlying data patterns and is robust to estimation errors. Experimental results demonstrate that Neo, even when bootstrapped from a simple optimizer like PostgreSQL, can learn a model that offers similar performance to state-of-the-art commercial optimizers, and in some cases even surpass them

arXiv.org e-Print Archive

DSpace@MIT