One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used

Keijzer, A. de

Keulen, M. van

English

van Keulen, Maurice

de Keijzer, Ander

University of Twente Research Information

Qualitative Effects of Knowledge Rules in Probabilistic Data Integration

NARCIS 

Abstract — One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10 % hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough ’ integration that can be meaningfully used. I

Maurice Van Keulen

Ander De Keijzer

CiteSeerX

Qualitative effects of knowledge rules in probabilistic data integration

A probabilistic relational data model,”

A probabilistic XML approach to data integration,”

An introduction to ULDBs and the Trio system.”

Building structured web community portals: A top-down, compositional, and incremental approach,”

Data integration with uncertainty,”

EntityRank: Searching entities directly and holistically,”

Generic entity resolution in the SERF project,”

IMPrECISE: Good-is-good-enough data integration,”

Learning to match the schemas of data sources: A multistrategy approach,”

MayBMS: Managing incomplete information with probabilistic world-set decompositions,”

Merging uncertain information with semantic heterogeneity in XML,”

Modern Information Retrieval.

MYSTIQ: a system for ﬁnding more answers by using probabilities,”

ProbView: a ﬂexible probabilistic database system,”

PXML: A probabilistic semistructured data model and algebra,”

Qualitative effects of knowledge rules in probabilistic data integration,”

Quality measures in uncertain data management,”

Querying and updating probabilistic information in XML,”

Reconciling schemas of disparate data sources: A machine-learning approach.”

Semantic integration research in the database community: A brief survey,” AI Magazine,

Taming data explosion in probabilistic information integration,”

U-DBMS: A database system for managing constantly-evolving data,”

Uncertainty in data integration: current approaches and open problems,”

User feedback in probabilistic integration,”

Qualitative Effects of Knowledge Rules in Probabilistic Data Integration

Abstract

Similar works

Full text

Available Versions

University of Twente Research Information

NARCIS

CiteSeerX