40 research outputs found

    Indeterministic Handling of Uncertain Decisions in Duplicate Detection

    Get PDF
    In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way

    Duplicate Detection in Probabilistic Data

    Get PDF
    Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data

    Datenunvollständigkeit aufgrund der mangelnden Modellierungsmächtigkeit aktuell dominierender Datenmodelle

    Get PDF
    In den am weitesten verbreiteten Datenmodellen, speziell dem relationalen Datenmodell, werden Informationen über die Ausprägungen einzelner Objekteigenschaften in Attributen gespeichert. In vielen Fällen (z.B. bei partiellen Informationen) ist eine Darstellung durch einzelne Elemente des dem Attribut zugehörigen Wertebereichs jedoch nicht möglich und erfordert die Anwendung spezieller Konzepte (z.B. Nullwerte). In aktuell verwendeten Modellen sind diese Konzepte jedoch nur unzureichend auf die notwendigen Erfordernisse ausgelegt. Die ursprünglich vorliegenden Informationen lassen sich daher oft nicht wieder aus den gespeicherten Daten zurückgewinnen. Bisherige Ansätze zur Behebung dieses Problems haben sich aus unterschiedlichen Gründen nicht durchsetzen können. Die hier beschriebene Arbeit enthält daher einen entsprechenden Vorschlag, der sowohl den Informationsverlust während der Datenspeicherung verringern als auch die Schwächen der bisherigen Lösungsansätze hinsichtlich eines fehlenden Durchsetzungsvermögen vermeiden soll. Ersteres wird durch eine Verwendung mehrerer Nullwerten ermöglicht, letzteres beruht hauptsächlich auf der Vermeidung gravierender Abweichungen von den aktuell vorherrschenden Modellen. Da dies wiederum eine Beibehaltung wichtiger und fundamentaler Konzepte erfordert, muss die Auswertung der verschiedenen Nullwerte in der dreiwertigen Logik erfolgen. Neben einer Kompatibilität zu den vorherrschenden Modellen, bietet dieses Vorgehen zudem die Vorteile einer geringen Modellkomplexität und ermöglicht eine intuitive Handhabung der auf dem neu entworfenen Modell basierenden Systeme

    Indeterministic Handling of Uncertain Decisions in Deduplication

    Get PDF
    In current research and practice, deduplication is usually considered as a deterministic approach in which tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons

    Duplicate Detection in Probabilistic Data

    No full text
    Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities
    corecore