2,578 research outputs found
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
Mining Aircraft Telemetry Data With Evolutionary Algorithms
The Ganged Phased Array Radar - Risk Mitigation System (GPAR-RMS) was a
mobile ground-based sense-and-avoid system for Unmanned Aircraft System (UAS)
operations developed by the University of North Dakota. GPAR-RMS detected proximate
aircraft with various sensor systems, including a 2D radar and an Automatic Dependent
Surveillance - Broadcast (ADS-B) receiver. Information about those aircraft was then
displayed to UAS operators via visualization software developed by the University of
North Dakota. The Risk Mitigation (RM) subsystem for GPAR-RMS was designed to
estimate the current risk of midair collision, between the Unmanned Aircraft (UA) and a
General Aviation (GA) aircraft flying under Visual Flight Rules (VFR) in the surrounding
airspace, for UAS operations in Class E airspace (i.e. below 18,000 feet MSL). However,
accurate probabilistic models for the behavior of pilots of GA aircraft flying under VFR
in Class E airspace were needed before the RM subsystem could be implemented.
In this dissertation the author presents the results of data mining an aircraft
telemetry data set from a consecutive nine month period in 2011. This aircraft telemetry
data set consisted of Flight Data Monitoring (FDM) data obtained from Garmin G1000
devices onboard every Cessna 172 in the University of North Dakota\u27s training fleet.
Data from aircraft which were potentially within the controlled airspace surrounding
controlled airports were excluded. Also, GA aircraft in the FDM data flying in Class E
airspace were assumed to be flying under VFR, which is usually a valid assumption.
Complex subpaths were discovered from the aircraft telemetry data set using a novel
application of an ant colony algorithm. Then, probabilistic models were data mined from
those subpaths using extensions of the Genetic K-Means (GKA) and Expectation-
Maximization (EM) algorithms.
The results obtained from the subpath discovery and data mining suggest a pilot
flying a GA aircraft near to an uncontrolled airport will perform different maneuvers than
a pilot flying a GA aircraft far from an uncontrolled airport, irrespective of the altitude of
the GA aircraft. However, since only aircraft telemetry data from the University of North
Dakota\u27s training fleet were data mined, these results are not likely to be applicable to GA
aircraft operating in a non-training environment
A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering
In this paper we target the class of modal clustering methods where clusters
are defined in terms of the local modes of the probability density function
which generates the data. The most well-known modal clustering method is the
k-means clustering. Mean Shift clustering is a generalization of the k-means
clustering which computes arbitrarily shaped clusters as defined as the basins
of attraction to the local modes created by the density gradient ascent paths.
Despite its potential, the Mean Shift approach is a computationally expensive
method for unsupervised learning. Thus, we introduce two contributions aiming
to provide clustering algorithms with a linear time complexity, as opposed to
the quadratic time complexity for the exact Mean Shift clustering. Firstly we
propose a scalable procedure to approximate the density gradient ascent.
Second, our proposed scalable cluster labeling technique is presented. Both
propositions are based on Locality Sensitive Hashing (LSH) to approximate
nearest neighbors. These two techniques may be used for moderate sized
datasets. Furthermore, we show that using our proposed approximations of the
density gradient ascent as a pre-processing step in other clustering methods
can also improve dedicated classification metrics. For the latter, a
distributed implementation, written for the Spark/Scala ecosystem is proposed.
For all these considered clustering methods, we present experimental results
illustrating their labeling accuracy and their potential to solve concrete
problems.Comment: Algorithms are available at
https://github.com/Clustering4Ever/Clustering4Eve
Evaluating tradeoff between recall and perfomance of GPU permutation index
Query-by-content, by means of similarity search, is a fundamental operation for applications that deal with multimedia data. For this kind of query it is meaningless to look for elements exactly equal to a given one as query. Instead, we need to measure the dissimilarity between the query object and each database object. This search problem can be formalized with the concept of metric space. In this scenario, the search efficiency is understood as minimizing the number of distance calculations required to answer them. Building an index can be a solution, but with very large metric databases is not enough, it is also necessary to speed up the queries by using high performance computing, as GPU, and in some cases is reasonable to accept a fast answer although it was inexact. In this work we evaluate the tradeoff between the answer quality and time performance of our implementation of Permutation Index, on a pure GPU architecture, used to solve in parallel multiple approximate similarity searches on metric databases.WPDP- XIII Workshop procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI
- …