2,578 research outputs found

    Indexing Metric Spaces for Exact Similarity Search

    Full text link
    With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

    Mining Aircraft Telemetry Data With Evolutionary Algorithms

    Get PDF
    The Ganged Phased Array Radar - Risk Mitigation System (GPAR-RMS) was a mobile ground-based sense-and-avoid system for Unmanned Aircraft System (UAS) operations developed by the University of North Dakota. GPAR-RMS detected proximate aircraft with various sensor systems, including a 2D radar and an Automatic Dependent Surveillance - Broadcast (ADS-B) receiver. Information about those aircraft was then displayed to UAS operators via visualization software developed by the University of North Dakota. The Risk Mitigation (RM) subsystem for GPAR-RMS was designed to estimate the current risk of midair collision, between the Unmanned Aircraft (UA) and a General Aviation (GA) aircraft flying under Visual Flight Rules (VFR) in the surrounding airspace, for UAS operations in Class E airspace (i.e. below 18,000 feet MSL). However, accurate probabilistic models for the behavior of pilots of GA aircraft flying under VFR in Class E airspace were needed before the RM subsystem could be implemented. In this dissertation the author presents the results of data mining an aircraft telemetry data set from a consecutive nine month period in 2011. This aircraft telemetry data set consisted of Flight Data Monitoring (FDM) data obtained from Garmin G1000 devices onboard every Cessna 172 in the University of North Dakota\u27s training fleet. Data from aircraft which were potentially within the controlled airspace surrounding controlled airports were excluded. Also, GA aircraft in the FDM data flying in Class E airspace were assumed to be flying under VFR, which is usually a valid assumption. Complex subpaths were discovered from the aircraft telemetry data set using a novel application of an ant colony algorithm. Then, probabilistic models were data mined from those subpaths using extensions of the Genetic K-Means (GKA) and Expectation- Maximization (EM) algorithms. The results obtained from the subpath discovery and data mining suggest a pilot flying a GA aircraft near to an uncontrolled airport will perform different maneuvers than a pilot flying a GA aircraft far from an uncontrolled airport, irrespective of the altitude of the GA aircraft. However, since only aircraft telemetry data from the University of North Dakota\u27s training fleet were data mined, these results are not likely to be applicable to GA aircraft operating in a non-training environment

    A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering

    Full text link
    In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems.Comment: Algorithms are available at https://github.com/Clustering4Ever/Clustering4Eve

    Evaluating tradeoff between recall and perfomance of GPU permutation index

    Get PDF
    Query-by-content, by means of similarity search, is a fundamental operation for applications that deal with multimedia data. For this kind of query it is meaningless to look for elements exactly equal to a given one as query. Instead, we need to measure the dissimilarity between the query object and each database object. This search problem can be formalized with the concept of metric space. In this scenario, the search efficiency is understood as minimizing the number of distance calculations required to answer them. Building an index can be a solution, but with very large metric databases is not enough, it is also necessary to speed up the queries by using high performance computing, as GPU, and in some cases is reasonable to accept a fast answer although it was inexact. In this work we evaluate the tradeoff between the answer quality and time performance of our implementation of Permutation Index, on a pure GPU architecture, used to solve in parallel multiple approximate similarity searches on metric databases.WPDP- XIII Workshop procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI
    corecore