17,042 research outputs found
Analysis of approximate nearest neighbor searching with clustered point sets
We present an empirical analysis of data structures for approximate nearest
neighbor searching. We compare the well-known optimized kd-tree splitting
method against two alternative splitting methods. The first, called the
sliding-midpoint method, which attempts to balance the goals of producing
subdivision cells of bounded aspect ratio, while not producing any empty cells.
The second, called the minimum-ambiguity method is a query-based approach. In
addition to the data points, it is also given a training set of query points
for preprocessing. It employs a simple greedy algorithm to select the splitting
plane that minimizes the average amount of ambiguity in the choice of the
nearest neighbor for the training points. We provide an empirical analysis
comparing these two methods against the optimized kd-tree construction for a
number of synthetically generated data and query sets. We demonstrate that for
clustered data and query sets, these algorithms can provide significant
improvements over the standard kd-tree construction for approximate nearest
neighbor searching.Comment: 20 pages, 8 figures. Presented at ALENEX '99, Baltimore, MD, Jan
15-16, 199
Efficient mining of frequent item sets on large uncertain databases
The data handled in emerging applications like location-based services, sensor monitoring systems, and data integration, are often inexact in nature. In this paper, we study the important problem of extracting frequent item sets from a large uncertain database, interpreted under the Possible World Semantics (PWS). This issue is technically challenging, since an uncertain database contains an exponential number of possible worlds. By observing that the mining process can be modeled as a Poisson binomial distribution, we develop an approximate algorithm, which can efficiently and accurately discover frequent item sets in a large uncertain database. We also study the important issue of maintaining the mining result for a database that is evolving (e.g., by inserting a tuple). Specifically, we propose incremental mining algorithms, which enable Probabilistic Frequent Item set (PFI) results to be refreshed. This reduces the need of re-executing the whole mining algorithm on the new database, which is often more expensive and unnecessary. We examine how an existing algorithm that extracts exact item sets, as well as our approximate algorithm, can support incremental mining. All our approaches support both tuple and attribute uncertainty, which are two common uncertain database models. We also perform extensive evaluation on real and synthetic data sets to validate our approaches. © 1989-2012 IEEE.published_or_final_versio
Efficient Analysis of Pattern and Association Rule Mining Approaches
The process of data mining produces various patterns from a given data
source. The most recognized data mining tasks are the process of discovering
frequent itemsets, frequent sequential patterns, frequent sequential rules and
frequent association rules. Numerous efficient algorithms have been proposed to
do the above processes. Frequent pattern mining has been a focused topic in
data mining research with a good number of references in literature and for
that reason an important progress has been made, varying from performant
algorithms for frequent itemset mining in transaction databases to complex
algorithms, such as sequential pattern mining, structured pattern mining,
correlation mining. Association Rule mining (ARM) is one of the utmost current
data mining techniques designed to group objects together from large databases
aiming to extract the interesting correlation and relation among huge amount of
data. In this article, we provide a brief review and analysis of the current
status of frequent pattern mining and discuss some promising research
directions. Additionally, this paper includes a comparative study between the
performance of the described approaches.Comment: 14 pages, 3 figures. arXiv admin note: text overlap with
arXiv:1312.4800; and with arXiv:1109.2427 by other author
NCeSS Project : Data mining for social scientists
We will discuss the work being undertaken on the NCeSS data mining project, a one year project at the University of Manchester which began at the start of 2007, to develop data mining tools of value to the social science community. Our primary goal is to produce a
suite of data mining codes, supported by a web interface, to allow social scientists to mine their datasets in a straightforward way and hence, gain new insights into their data. In order to fully define the requirements, we are looking at a range of typical datasets to find out what
forms they take and the applications and algorithms that will be required. In this paper, we will describe a number of these datasets and will discuss how easily data mining techniques can be used to extract information from the data that would either not be possible or would be
too time consuming by more standard methods
Approximate k-nearest neighbour based spatial clustering using k-d tree
Different spatial objects that vary in their characteristics, such as
molecular biology and geography, are presented in spatial areas. Methods to
organize, manage, and maintain those objects in a structured manner are
required. Data mining raised different techniques to overcome these
requirements. There are many major tasks of data mining, but the mostly used
task is clustering. Data set within the same cluster share common features that
give each cluster its characteristics. In this paper, an implementation of
Approximate kNN-based spatial clustering algorithm using the K-d tree is
proposed. The major contribution achieved by this research is the use of the
k-d tree data structure for spatial clustering, and comparing its performance
to the brute-force approach. The results of the work performed in this paper
revealed better performance using the k-d tree, compared to the traditional
brute-force approach
- …
