43 research outputs found

    Certainty of outlier and boundary points processing in data mining

    Full text link
    Data certainty is one of the issues in the real-world applications which is caused by unwanted noise in data. Recently, more attentions have been paid to overcome this problem. We proposed a new method based on neutrosophic set (NS) theory to detect boundary and outlier points as challenging points in clustering methods. Generally, firstly, a certainty value is assigned to data points based on the proposed definition in NS. Then, certainty set is presented for the proposed cost function in NS domain by considering a set of main clusters and noise cluster. After that, the proposed cost function is minimized by gradient descent method. Data points are clustered based on their membership degrees. Outlier points are assigned to noise cluster and boundary points are assigned to main clusters with almost same membership degrees. To show the effectiveness of the proposed method, two types of datasets including 3 datasets in Scatter type and 4 datasets in UCI type are used. Results demonstrate that the proposed cost function handles boundary and outlier points with more accurate membership degrees and outperforms existing state of the art clustering methods.Comment: Conference Paper, 6 page

    Duplicate Detection in Probabilistic Data

    Get PDF
    Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data

    Indeterministic Handling of Uncertain Decisions in Duplicate Detection

    Get PDF
    In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way

    Provenance Circuits for Trees and Treelike Instances (Extended Version)

    Full text link
    Query evaluation in monadic second-order logic (MSO) is tractable on trees and treelike instances, even though it is hard for arbitrary instances. This tractability result has been extended to several tasks related to query evaluation, such as counting query results [3] or performing query evaluation on probabilistic trees [10]. These are two examples of the more general problem of computing augmented query output, that is referred to as provenance. This article presents a provenance framework for trees and treelike instances, by describing a linear-time construction of a circuit provenance representation for MSO queries. We show how this provenance can be connected to the usual definitions of semiring provenance on relational instances [20], even though we compute it in an unusual way, using tree automata; we do so via intrinsic definitions of provenance for general semirings, independent of the operational details of query evaluation. We show applications of this provenance to capture existing counting and probabilistic results on trees and treelike instances, and give novel consequences for probability evaluation.Comment: 48 pages. Presented at ICALP'1

    The Dichotomy of Conjunctive Queries on Probabilistic Structures

    Full text link
    We show that for every conjunctive query, the complexity of evaluating it on a probabilistic database is either \PTIME or #\P-complete, and we give an algorithm for deciding whether a given conjunctive query is \PTIME or #\P-complete. The dichotomy property is a fundamental result on query evaluation on probabilistic databases and it gives a complete classification of the complexity of conjunctive queries

    SEMISTRUCTURED PROBABILISTIC OBJECT QUERY LANGUAGE (A Query Language for Semistructured Probabilistic Data)

    Get PDF
    This work presents SPOQL, a structured query language for Semistructured Probabilistic Object (SPO) model [4]. The original query language for semistructured probabilistic database management system [20], SP-Algebra [4], has limitations such as complex functional notation and unfamiliarity to application programmers. SPOQL alleviates these problems by providing a user friendly and familiar SQL-like declarative syntax for writing queries against SPDBMS. We show that parsing SPOQL queries is a more involving task than parsing SQL queries. We describe the evaluation algorithm for SPOQL queries that we have implemented
    corecore