43 research outputs found
Certainty of outlier and boundary points processing in data mining
Data certainty is one of the issues in the real-world applications which is
caused by unwanted noise in data. Recently, more attentions have been paid to
overcome this problem. We proposed a new method based on neutrosophic set (NS)
theory to detect boundary and outlier points as challenging points in
clustering methods. Generally, firstly, a certainty value is assigned to data
points based on the proposed definition in NS. Then, certainty set is presented
for the proposed cost function in NS domain by considering a set of main
clusters and noise cluster. After that, the proposed cost function is minimized
by gradient descent method. Data points are clustered based on their membership
degrees. Outlier points are assigned to noise cluster and boundary points are
assigned to main clusters with almost same membership degrees. To show the
effectiveness of the proposed method, two types of datasets including 3
datasets in Scatter type and 4 datasets in UCI type are used. Results
demonstrate that the proposed cost function handles boundary and outlier points
with more accurate membership degrees and outperforms existing state of the art
clustering methods.Comment: Conference Paper, 6 page
Duplicate Detection in Probabilistic Data
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data
Indeterministic Handling of Uncertain Decisions in Duplicate Detection
In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way
Provenance Circuits for Trees and Treelike Instances (Extended Version)
Query evaluation in monadic second-order logic (MSO) is tractable on trees
and treelike instances, even though it is hard for arbitrary instances. This
tractability result has been extended to several tasks related to query
evaluation, such as counting query results [3] or performing query evaluation
on probabilistic trees [10]. These are two examples of the more general problem
of computing augmented query output, that is referred to as provenance. This
article presents a provenance framework for trees and treelike instances, by
describing a linear-time construction of a circuit provenance representation
for MSO queries. We show how this provenance can be connected to the usual
definitions of semiring provenance on relational instances [20], even though we
compute it in an unusual way, using tree automata; we do so via intrinsic
definitions of provenance for general semirings, independent of the operational
details of query evaluation. We show applications of this provenance to capture
existing counting and probabilistic results on trees and treelike instances,
and give novel consequences for probability evaluation.Comment: 48 pages. Presented at ICALP'1
The Dichotomy of Conjunctive Queries on Probabilistic Structures
We show that for every conjunctive query, the complexity of evaluating it on
a probabilistic database is either \PTIME or #\P-complete, and we give an
algorithm for deciding whether a given conjunctive query is \PTIME or
#\P-complete. The dichotomy property is a fundamental result on query
evaluation on probabilistic databases and it gives a complete classification of
the complexity of conjunctive queries
SEMISTRUCTURED PROBABILISTIC OBJECT QUERY LANGUAGE (A Query Language for Semistructured Probabilistic Data)
This work presents SPOQL, a structured query language for Semistructured Probabilistic Object (SPO) model [4]. The original query language for semistructured probabilistic database management system [20], SP-Algebra [4], has limitations such as complex functional notation and unfamiliarity to application programmers. SPOQL alleviates these problems by providing a user friendly and familiar SQL-like declarative syntax for writing queries against SPDBMS. We show that parsing SPOQL queries is a more involving task than parsing SQL queries. We describe the evaluation algorithm for SPOQL queries that we have implemented