15,383 research outputs found
Certainty of outlier and boundary points processing in data mining
Data certainty is one of the issues in the real-world applications which is
caused by unwanted noise in data. Recently, more attentions have been paid to
overcome this problem. We proposed a new method based on neutrosophic set (NS)
theory to detect boundary and outlier points as challenging points in
clustering methods. Generally, firstly, a certainty value is assigned to data
points based on the proposed definition in NS. Then, certainty set is presented
for the proposed cost function in NS domain by considering a set of main
clusters and noise cluster. After that, the proposed cost function is minimized
by gradient descent method. Data points are clustered based on their membership
degrees. Outlier points are assigned to noise cluster and boundary points are
assigned to main clusters with almost same membership degrees. To show the
effectiveness of the proposed method, two types of datasets including 3
datasets in Scatter type and 4 datasets in UCI type are used. Results
demonstrate that the proposed cost function handles boundary and outlier points
with more accurate membership degrees and outperforms existing state of the art
clustering methods.Comment: Conference Paper, 6 page
08421 Abstracts Collection -- Uncertainty Management in Information Systems
From October 12 to 17, 2008 the Dagstuhl Seminar 08421 \u27`Uncertainty Management in Information Systems \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. The abstracts of the plenary and session talks given during the seminar as well as those of the shown demos are put together in this paper
Enhancing Mobile Object Classification Using Geo-referenced Maps and Evidential Grids
Evidential grids have recently shown interesting properties for mobile object
perception. Evidential grids are a generalisation of Bayesian occupancy grids
using Dempster- Shafer theory. In particular, these grids can handle
efficiently partial information. The novelty of this article is to propose a
perception scheme enhanced by geo-referenced maps used as an additional source
of information, which is fused with a sensor grid. The paper presents the key
stages of such a data fusion process. An adaptation of conjunctive combination
rule is presented to refine the analysis of the conflicting information. The
method uses temporal accumulation to make the distinction between stationary
and mobile objects, and applies contextual discounting for modelling
information obsolescence. As a result, the method is able to better
characterise the occupied cells by differentiating, for instance, moving
objects, parked cars, urban infrastructure and buildings. Experiments carried
out on real- world data illustrate the benefits of such an approach.Comment: 6 pp. arXiv admin note: substantial text overlap with arXiv:1207.101
Big data and the SP theory of intelligence
This article is about how the "SP theory of intelligence" and its realisation
in the "SP machine" may, with advantage, be applied to the management and
analysis of big data. The SP system -- introduced in the article and fully
described elsewhere -- may help to overcome the problem of variety in big data:
it has potential as "a universal framework for the representation and
processing of diverse kinds of knowledge" (UFK), helping to reduce the
diversity of formalisms and formats for knowledge and the different ways in
which they are processed. It has strengths in the unsupervised learning or
discovery of structure in data, in pattern recognition, in the parsing and
production of natural language, in several kinds of reasoning, and more. It
lends itself to the analysis of streaming data, helping to overcome the problem
of velocity in big data. Central in the workings of the system is lossless
compression of information: making big data smaller and reducing problems of
storage and management. There is potential for substantial economies in the
transmission of data, for big cuts in the use of energy in computing, for
faster processing, and for smaller and lighter computers. The system provides a
handle on the problem of veracity in big data, with potential to assist in the
management of errors and uncertainties in data. It lends itself to the
visualisation of knowledge structures and inferential processes. A
high-parallel, open-source version of the SP machine would provide a means for
researchers everywhere to explore what can be done with the system and to
create new versions of it.Comment: Accepted for publication in IEEE Acces
Similarity search and data mining techniques for advanced database systems.
Modern automated methods for measurement, collection, and analysis of data in industry and science are providing more and more data with drastically increasing structure complexity. On the one hand, this growing complexity is justified by the need for a richer and more precise description of real-world objects, on the other hand it is justified by the rapid progress in measurement and analysis techniques that allow the user a versatile exploration of objects. In order to manage the huge volume of such complex data, advanced database systems are employed. In contrast to conventional database systems that support exact match queries, the user of these advanced database systems focuses on applying similarity search and data mining techniques.
Based on an analysis of typical advanced database systems — such as biometrical, biological, multimedia, moving, and CAD-object database systems — the following three challenging characteristics of complexity are detected: uncertainty (probabilistic feature vectors), multiple instances (a set of homogeneous feature vectors), and multiple representations (a set of heterogeneous feature vectors). Therefore, the goal of this thesis is to develop similarity search and data mining techniques that are capable of handling uncertain, multi-instance, and multi-represented objects.
The first part of this thesis deals with similarity search techniques. Object identification is a similarity search technique that is typically used for the recognition of objects from image, video, or audio data. Thus, we develop a novel probabilistic model for object identification. Based on it, two novel types of identification queries are defined. In order to process the novel query types efficiently, we introduce an index structure called Gauss-tree. In addition, we specify further probabilistic models and query types for uncertain multi-instance objects and uncertain spatial objects. Based on the index structure, we develop algorithms for an efficient processing of these query types. Practical benefits of using probabilistic feature vectors are demonstrated on a real-world application for video similarity search. Furthermore, a similarity search technique is presented that is based on aggregated multi-instance objects, and that is suitable for video similarity search. This technique takes multiple representations into account in order to achieve better effectiveness.
The second part of this thesis deals with two major data mining techniques: clustering and classification. Since privacy preservation is a very important demand of distributed advanced applications, we propose using uncertainty for data obfuscation in order to provide privacy preservation during clustering. Furthermore, a model-based and a density-based clustering method for multi-instance objects are developed. Afterwards, original extensions and enhancements of the density-based clustering algorithms DBSCAN and OPTICS for handling multi-represented objects are introduced. Since several advanced database systems like biological or multimedia database systems handle predefined, very large class systems, two novel classification techniques for large class sets that benefit from using multiple representations are defined. The first classification method is based on the idea of a k-nearest-neighbor classifier. It employs a novel density-based technique to reduce training instances and exploits the entropy impurity of the local neighborhood in order to weight a given representation. The second technique addresses hierarchically-organized class systems. It uses a novel hierarchical, supervised method for the reduction of large multi-instance objects, e.g. audio or video, and applies support vector machines for efficient hierarchical classification of multi-represented objects. User benefits of this technique are demonstrated by a prototype that performs a classification of large music collections.
The effectiveness and efficiency of all proposed techniques are discussed and verified by comparison with conventional approaches in versatile experimental evaluations on real-world datasets
Similarity search and data mining techniques for advanced database systems.
Modern automated methods for measurement, collection, and analysis of data in industry and science are providing more and more data with drastically increasing structure complexity. On the one hand, this growing complexity is justified by the need for a richer and more precise description of real-world objects, on the other hand it is justified by the rapid progress in measurement and analysis techniques that allow the user a versatile exploration of objects. In order to manage the huge volume of such complex data, advanced database systems are employed. In contrast to conventional database systems that support exact match queries, the user of these advanced database systems focuses on applying similarity search and data mining techniques.
Based on an analysis of typical advanced database systems — such as biometrical, biological, multimedia, moving, and CAD-object database systems — the following three challenging characteristics of complexity are detected: uncertainty (probabilistic feature vectors), multiple instances (a set of homogeneous feature vectors), and multiple representations (a set of heterogeneous feature vectors). Therefore, the goal of this thesis is to develop similarity search and data mining techniques that are capable of handling uncertain, multi-instance, and multi-represented objects.
The first part of this thesis deals with similarity search techniques. Object identification is a similarity search technique that is typically used for the recognition of objects from image, video, or audio data. Thus, we develop a novel probabilistic model for object identification. Based on it, two novel types of identification queries are defined. In order to process the novel query types efficiently, we introduce an index structure called Gauss-tree. In addition, we specify further probabilistic models and query types for uncertain multi-instance objects and uncertain spatial objects. Based on the index structure, we develop algorithms for an efficient processing of these query types. Practical benefits of using probabilistic feature vectors are demonstrated on a real-world application for video similarity search. Furthermore, a similarity search technique is presented that is based on aggregated multi-instance objects, and that is suitable for video similarity search. This technique takes multiple representations into account in order to achieve better effectiveness.
The second part of this thesis deals with two major data mining techniques: clustering and classification. Since privacy preservation is a very important demand of distributed advanced applications, we propose using uncertainty for data obfuscation in order to provide privacy preservation during clustering. Furthermore, a model-based and a density-based clustering method for multi-instance objects are developed. Afterwards, original extensions and enhancements of the density-based clustering algorithms DBSCAN and OPTICS for handling multi-represented objects are introduced. Since several advanced database systems like biological or multimedia database systems handle predefined, very large class systems, two novel classification techniques for large class sets that benefit from using multiple representations are defined. The first classification method is based on the idea of a k-nearest-neighbor classifier. It employs a novel density-based technique to reduce training instances and exploits the entropy impurity of the local neighborhood in order to weight a given representation. The second technique addresses hierarchically-organized class systems. It uses a novel hierarchical, supervised method for the reduction of large multi-instance objects, e.g. audio or video, and applies support vector machines for efficient hierarchical classification of multi-represented objects. User benefits of this technique are demonstrated by a prototype that performs a classification of large music collections.
The effectiveness and efficiency of all proposed techniques are discussed and verified by comparison with conventional approaches in versatile experimental evaluations on real-world datasets
- …