Search CORE

3,211 research outputs found

Subgraph Pattern Matching over Uncertain Graphs with Identity Linkage Uncertainty

Author: Deshpande Amol
Getoor Lise
Kimmig Angelika
Moustafa Walaa Eldin
Publication venue
Publication date: 30/05/2013
Field of study

There is a growing need for methods which can capture uncertainties and answer queries over graph-structured data. Two common types of uncertainty are uncertainty over the attribute values of nodes and uncertainty over the existence of edges. In this paper, we combine those with identity uncertainty. Identity uncertainty represents uncertainty over the mapping from objects mentioned in the data, or references, to the underlying real-world entities. We propose the notion of a probabilistic entity graph (PEG), a probabilistic graph model that defines a distribution over possible graphs at the entity level. The model takes into account node attribute uncertainty, edge existence uncertainty, and identity uncertainty, and thus enables us to systematically reason about all three types of uncertainties in a uniform manner. We introduce a general framework for constructing a PEG given uncertain data at the reference level and develop highly efficient algorithms to answer subgraph pattern matching queries in this setting. Our algorithms are based on two novel ideas: context-aware path indexing and reduction by join-candidates, which drastically reduce the query search space. A comprehensive experimental evaluation shows that our approach outperforms baseline implementations by orders of magnitude

arXiv.org e-Print Archive

CiteSeerX

Crossref

Online Research @ Cardiff

End-to-End Entity Resolution for Big Data: A Survey

Author: Christophides Vassilis
Efthymiou Vasilis
Palpanas Themis
Papadakis George
Stefanidis Kostas
Publication venue
Publication date: 01/02/1988
Field of study

One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

arXiv.org e-Print Archive

University of Richmond

Entity linkage for heterogeneous, uncertain, and volatile data

Author: Ioannou Ekaterini
Publication venue: Hannover : Gottfried Wilhelm Leibniz Universität Hannover
Publication date: 01/01/2011
Field of study

[no abstract

Institutionelles Repositorium der Leibniz Universität Hannover

Indeterministic Handling of Uncertain Decisions in Deduplication

Author: Batini C.
Benjelloun O.
Dechter R.
Fabian Panse
Koch C.
Koudas N.
Maurice van Keulen
Norbert Ritter
Ravikumar P. D.
Sen P.
Wang Y. R.
Widom J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Entity Identity Reconciliation based Big Data Federation A MDE approach

Author: Domínguez Mayo Francisco José
Escalona Cuaresma María José
García García Julián Alberto
González Enríquez José
Goto Masatomo
Lee Vivian
Publication venue: Association for Information Systems (AIS)
Publication date: 01/01/2015
Field of study

“Information is power” is a sentence attributed to Francis Bacon that acquired a high important in the current era of the information. However, too much information can be a negative aspect. The term of “Infoxication” refers to the difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information. With the increasing of relevance of open data and big database, the application of mechanisms and solutions to manage information is critical. This paper introduces the problem of unique identification and data reconciliation and offers a discussion about how to solve this problem in big and open data environment. The problem of data reconciliation in multiple databases and the unique identification of entities is not a new problem, but, how effective are classical mechanisms in the new internet environment? In this paper a solution based on model-driven engineering and virtual graph is presented in order to improve the processing of information in big open repositories. The paper illustrates the idea with a real example for the right exploitation of heritage information in the south of Spain.Ministerio de Ciencia e Innovación TIN2013-46928-C3-3-

idUS. Depósito de Investigación Universidad de Sevilla

Entity Identity Reconciliation based Big Data Federation-A MDE approach

Author: Domínguez-Mayo Francisco Jose
Enríquez Jose Gonzalez
Escalona María José
García García Julián Alberto
Goto Masatomo
Lee Vivian
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2015
Field of study

AIS Electronic Library (AISeL)

idUS. Depósito de Investigación Universidad de Sevilla

Dynamic sorted neighborhood indexing for real-time entity resolution

Author: Christen P.
Gayler R. W.
Liang Huizhi
Ramadan B.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2015
Field of study

Real-time Entity Resolution (ER) is the process of matching query records in subsecond time with records in a database that represent the same real-world entity. Indexing techniques are generally used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are to be compared with the query record in more detail. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has been successfully used for ER of large static databases. However, because it is based on static sorted arrays and is designed for batch ER that resolves all records in a database rather than resolving those relating to a single query record, this technique is not suitable for real-time ER on dynamic databases that are constantly updated. We propose a tree-based technique that facilitates dynamic indexing based on the sorted neighborhood method, which can be used for real-time ER, and investigate both static and adaptive window approaches. We propose an approach to reduce query matching times by precalculating the similarities between attribute values stored in neighboring tree nodes. We also propose a multitree solution where different sorting keys are used to reduce the effects of errors and variations in attribute values on matching quality by building several distinct index trees. We experimentally evaluate our proposed techniques on large real datasets, as well as on synthetic data with different data quality characteristics. Our results show that as the index grows, no appreciable increase occurs in both record insertion and query times, and that using multiple trees gives noticeable improvements on matching quality with only a small increase in query time. Compared to earlier indexing techniques for real-time ER, our approach achieves significantly reduced indexing and query matching times while maintaining high matching accuracy

Central Archive at the University of Reading

The Australian National University

Towards trajectory anonymization: a generalization-based approach

Author: Atzori Maurizio
Guc Baris
Güç Barış
Nergiz Mehmet Ercan
Saygin Yucel
Saygın Yücel
Publication venue: IIIA-CSIC
Publication date: 01/01/2009
Field of study

Trajectory datasets are becoming popular due to the massive usage of GPS and locationbased services. In this paper, we address privacy issues regarding the identification of individuals in static trajectory datasets. We first adopt the notion of k-anonymity to trajectories and propose a novel generalization-based approach for anonymization of trajectories. We further show that releasing anonymized trajectories may still have some privacy leaks. Therefore we propose a randomization based reconstruction algorithm for releasing anonymized trajectory data and also present how the underlying techniques can be adapted to other anonymity standards. The experimental results on real and synthetic trajectory datasets show the effectiveness of the proposed techniques

Archivio istituzionale della ricerca - Università di Cagliari

Sabanci University Research Database