398 research outputs found

    Towards a Scalable Dynamic Spatial Database System

    Get PDF
    With the rise of GPS-enabled smartphones and other similar mobile devices, massive amounts of location data are available. However, no scalable solutions for soft real-time spatial queries on large sets of moving objects have yet emerged. In this paper we explore and measure the limits of actual algorithms and implementations regarding different application scenarios. And finally we propose a novel distributed architecture to solve the scalability issues.Comment: (2012

    Advance of the Access Methods

    Get PDF
    The goal of this paper is to outline the advance of the access methods in the last ten years as well as to make review of all available in the accessible bibliography methods

    Linking Spatial Video and GIS

    Get PDF
    Spatial Video is any form of geographically referenced videographic data. The forms in which it is acquired, stored and used vary enormously; as does the standard of accuracy in the spatial data and the quality of the video footage. This research deals with a specific form of Spatial Video where these data have been captured from a moving road-network survey vehicle. The spatial data are GPS sentences while the video orientation is approximately orthogonal and coincident with the direction of travel. GIS that use these data are usually bespoke standalone systems or third party extensions to existing platforms. They specialise in using the video as a visual enhancement with limited spatial functionality and interoperability. While enormous amounts of these data exist, they do not have a generalised, cross-platform spatial data structure that is suitable for use within a GIS. The objectives of this research have been to define, develop and implement a novel Spatial Video data structure and demonstrate how this can achieve a spatial approach to the study of video. This data structure is called a Viewpoint and represents the capture location and geographical extent of each video frame. It is generalised to represent any form or format of Spatial Video. It is shown how a Viewpoint improves on existing data structure methodologies and how it can be theoretically defined in 3D space. A 2D implementation is then developed where Viewpoints are constructed from the spatial and camera parameters of each survey in the study area. A number of problems are defined and solutions provided towards the implementation of a post-processing system to calculate, index and store each video frame Viewpoint in a centralised spatial database. From this spatial database a number of geospatial analysis approaches are demonstrated that represent novel ways of using and studying Spatial Video based on the Viewpoint data structure. Also, a unique application is developed where the Viewpoints are used as a spatial control to dynamically access and play video in a location aware system. While video has been to date largely ignored as a GIS spatial data source; it is shown through this novel Viewpoint implementation and the geospatial analysis demonstrations that this need not be the case anymore

    Generalized and efficient outlier detection for spatial, temporal, and high-dimensional data mining

    Get PDF
    Knowledge Discovery in Databases (KDD) ist der Prozess, nicht-triviale Muster aus großen Datenbanken zu extrahieren, mit dem Ziel, dass diese bisher unbekannt, potentiell nützlich, statistisch fundiert und verständlich sind. Der Prozess umfasst mehrere Schritte wie die Selektion, Vorverarbeitung, Evaluierung und den Analyseschritt, der als Data-Mining bekannt ist. Eine der zentralen Aufgabenstellungen im Data-Mining ist die Ausreißererkennung, das Identifizieren von Beobachtungen, die ungewöhnlich sind und mit der Mehrzahl der Daten inkonsistent erscheinen. Solche seltene Beobachtungen können verschiedene Ursachen haben: Messfehler, ungewöhnlich starke (aber dennoch genuine) Abweichungen, beschädigte oder auch manipulierte Daten. In den letzten Jahren wurden zahlreiche Verfahren zur Erkennung von Ausreißern vorgeschlagen, die sich oft nur geringfügig zu unterscheiden scheinen, aber in den Publikationen experimental als ``klar besser'' dargestellt sind. Ein Schwerpunkt dieser Arbeit ist es, die unterschiedlichen Verfahren zusammenzuführen und in einem gemeinsamen Formalismus zu modularisieren. Damit wird einerseits die Analyse der Unterschiede vereinfacht, andererseits aber die Flexibilität der Verfahren erhöht, indem man Module hinzufügen oder ersetzen und damit die Methode an geänderte Anforderungen und Datentypen anpassen kann. Um die Vorteile der modularisierten Struktur zu zeigen, werden (i) zahlreiche bestehende Algorithmen in dem Schema formalisiert, (ii) neue Module hinzugefügt, um die Robustheit, Effizienz, statistische Aussagekraft und Nutzbarkeit der Bewertungsfunktionen zu verbessern, mit denen die existierenden Methoden kombiniert werden können, (iii) Module modifiziert, um bestehende und neue Algorithmen auf andere, oft komplexere, Datentypen anzuwenden wie geographisch annotierte Daten, Zeitreihen und hochdimensionale Räume, (iv) mehrere Methoden in ein Verfahren kombiniert, um bessere Ergebnisse zu erzielen, (v) die Skalierbarkeit auf große Datenmengen durch approximative oder exakte Indizierung verbessert. Ausgangspunkt der Arbeit ist der Algorithmus Local Outlier Factor (LOF). Er wird zunächst mit kleinen Erweiterungen modifiziert, um die Robustheit und die Nutzbarkeit der Bewertung zu verbessern. Diese Methoden werden anschließend in einem gemeinsamen Rahmen zur Erkennung lokaler Ausreißer formalisiert, um die entsprechenden Vorteile auch in anderen Algorithmen nutzen zu können. Durch Abstraktion von einem einzelnen Vektorraum zu allgemeinen Datentypen können auch räumliche und zeitliche Beziehungen analysiert werden. Die Verwendung von Unterraum- und Korrelations-basierten Nachbarschaften ermöglicht dann, einen neue Arten von Ausreißern in beliebig orientierten Projektionen zu erkennen. Verbesserungen bei den Bewertungsfunktionen erlauben es, die Bewertung mit der statistischen Intuition einer Wahrscheinlichkeit zu interpretieren und nicht nur eine Ausreißer-Rangfolge zu erstellen wie zuvor. Verbesserte Modelle generieren auch Erklärungen, warum ein Objekt als Ausreißer bewertet wurde. Anschließend werden für verschiedene Module Verbesserungen eingeführt, die unter anderem ermöglichen, die Algorithmen auf wesentlich größere Datensätze anzuwenden -- in annähernd linearer statt in quadratischer Zeit --, indem man approximative Nachbarschaften bei geringem Verlust an Präzision und Effektivität erlaubt. Des weiteren wird gezeigt, wie mehrere solcher Algorithmen mit unterschiedlichen Intuitionen gleichzeitig benutzt und die Ergebnisse in einer Methode kombiniert werden können, die dadurch unterschiedliche Arten von Ausreißern erkennen kann. Schließlich werden für reale Datensätze neue Ausreißeralgorithmen konstruiert, die auf das spezifische Problem angepasst sind. Diese neuen Methoden erlauben es, so aufschlussreiche Ergebnisse zu erhalten, die mit den bestehenden Methoden nicht erreicht werden konnten. Da sie aus den Bausteinen der modularen Struktur entwickelt wurden, ist ein direkter Bezug zu den früheren Ansätzen gegeben. Durch Verwendung der Indexstrukturen können die Algorithmen selbst auf großen Datensätzen effizient ausgeführt werden.Knowledge Discovery in Databases (KDD) is the process of extracting non-trivial patterns in large data bases, with the focus of extracting novel, potentially useful, statistically valid and understandable patterns. The process involves multiple phases including selection, preprocessing, evaluation and the analysis step which is known as Data Mining. One of the key techniques of Data Mining is outlier detection, that is the identification of observations that are unusual and seemingly inconsistent with the majority of the data set. Such rare observations can have various reasons: they can be measurement errors, unusually extreme (but valid) measurements, data corruption or even manipulated data. Over the previous years, various outlier detection algorithms have been proposed that often appear to be only slightly different than previous but ``clearly outperform'' the others in the experiments. A key focus of this thesis is to unify and modularize the various approaches into a common formalism to make the analysis of the actual differences easier, but at the same time increase the flexibility of the approaches by allowing the addition and replacement of modules to adapt the methods to different requirements and data types. To show the benefits of the modularized structure, (i) several existing algorithms are formalized within the new framework (ii) new modules are added that improve the robustness, efficiency, statistical validity and score usability and that can be combined with existing methods (iii) modules are modified to allow existing and new algorithms to run on other, often more complex data types including spatial, temporal and high-dimensional data spaces (iv) the combination of multiple algorithm instances into an ensemble method is discussed (v) the scalability to large data sets is improved using approximate as well as exact indexing. The starting point is the Local Outlier Factor (LOF) algorithm, which is extended with slight modifications to increase robustness and the usability of the produced scores. In order to get the same benefits for other methods, these methods are abstracted to a general framework for local outlier detection. By abstracting from a single vector space, other data types that involve spatial and temporal relationships can be analyzed. The use of subspace and correlation neighborhoods allows the algorithms to detect new kinds of outliers in arbitrarily oriented subspaces. Improvements in the score normalization bring back a statistic intuition of probabilities to the outlier scores that previously were only useful for ranking objects, while improved models also offer explanations of why an object was considered to be an outlier. Subsequently, for different modules found in the framework improved modules are presented that for example allow to run the same algorithms on significantly larger data sets -- in approximately linear complexity instead of quadratic complexity -- by accepting approximated neighborhoods at little loss in precision and effectiveness. Additionally, multiple algorithms with different intuitions can be run at the same time, and the results combined into an ensemble method that is able to detect outliers of different types. Finally, new outlier detection methods are constructed; customized for the specific problems of these real data sets. The new methods allow to obtain insightful results that could not be obtained with the existing methods. Since being constructed from the same building blocks, there however exists a strong and explicit connection to the previous approaches, and by using the indexing strategies introduced earlier, the algorithms can be executed efficiently even on large data sets

    A Sweep-Plane Algorithm for Calculating the Isolation of Mountains

    Get PDF
    One established metric to classify the significance of a mountain peak is its isolation. It specifies the distance between a peak and the closest point of higher elevation. Peaks with high isolation dominate their surroundings and provide a nice view from the top. With the availability of worldwide Digital Elevation Models (DEMs), the isolation of all mountain peaks can be computed automatically. Previous algorithms run in worst case time that is quadratic in the input size. We present a novel sweep-plane algorithm that runs in time ?(nlog n+pT_NN) where n is the input size, p the number of considered peaks and T_NN the time for a 2D nearest-neighbor query in an appropriate geometric search tree. We refine this to a two-level approach that has high locality and good parallel scalability. Our implementation reduces the time for calculating the isolation of every peak on Earth from hours to minutes while improving precision

    High-Dimensional Spatio-Temporal Indexing

    Get PDF
    There exist numerous indexing methods which handle either spatio-temporal or high-dimensional data well. However, those indexing methods which handle spatio-temporal data well have certain drawbacks when confronted with high-dimensional data. As the most efficient spatio-temporal indexing methods are based on the R-tree and its variants, they face the well known problems in high-dimensional space. Furthermore, most high-dimensional indexing methods try to reduce the number of dimensions in the data being indexed and compress the information given by all dimensions into few dimensions but are not able to store now - relative data. One of the most efficient high-dimensional indexing methods, the Pyramid Technique, is able to handle high-dimensional point-data only. Nonetheless, we take this technique and extend it such that it is able to handle spatio-temporal data as well. We introduce a technique for querying in this structure with spatio-temporal queries. We compare our technique, the Spatio-Temporal Pyramid Adapter (STPA), to the RST-tree for in-memory and on-disk applications. We show that for high dimensions, the extra query-cost for reducing the dimensionality in the Pyramid Technique is clearly exceeded by the rising query-cost in the RST-tree. Concluding, we address the main drawbacks and advantages of our technique

    SVS-JOIN : efficient spatial visual similarity join for geo-multimedia

    Get PDF
    In the big data era, massive amount of multimedia data with geo-tags has been generated and collected by smart devices equipped with mobile communications module and position sensor module. This trend has put forward higher request on large-scale geo-multimedia retrieval. Spatial similarity join is one of the significant problems in the area of spatial database. Previous works focused on spatial textual document search problem, rather than geo-multimedia retrieval. In this paper, we investigate a novel geo-multimedia retrieval paradigm named spatial visual similarity join (SVS-JOIN for short), which aims to search similar geo-image pairs in both aspects of geo-location and visual content. Firstly, the definition of SVS-JOIN is proposed and then we present the geographical similarity and visual similarity measurement. Inspired by the approach for textual similarity join, we develop an algorithm named SVS-JOIN B by combining the PPJOIN algorithm and visual similarity. Besides, an extension of it named SVS-JOIN G is developed, which utilizes spatial grid strategy to improve the search efficiency. To further speed up the search, a novel approach called SVS-JOIN Q is carefully designed, in which a quadtree and a global inverted index are employed. Comprehensive experiments are conducted on two geo-image datasets and the results demonstrate that our solution can address the SVS-JOIN problem effectively and efficiently

    Scaling k-Nearest Neighbors Queries (The Right Way)

    Get PDF
    Recently parallel / distributed processing approaches have been proposed for processing k-Nearest Neighbours (kNN) queries over very large (multidimensional) datasets aiming to ensure scalability. However, this is typically achieved at the expense of efficiency. With this paper we offer a novel approach that alleviates the performance problems associated with state of the art methods. The essence of our approach, which differentiates it from related research, rests on (i) adopting a coordinator-based distributed processing algorithm, instead of those employed over data-parallel executionengines (such as Hadoop/MapReduce or Spark), and (ii) on a way to organize data, to structure computation, and to index the stored datasets that ensures that only a very small number of data items are retrieved from the underlying data store, communicated over the network, and processed by the coordinatorfor every kNN query. Our approach also pays special attention to ensuring scalability in addition to low query processing times. Overall, kNN queries can be processed in just tens of milliseconds (as opposed to the tens of) seconds required by state of the art. We have implemented our approach, usinga NoSQL DB (HBase) as the data store, and we compare it against the state-of-the-art: the Hadoop-based Spatial Hadoop (SHadoop) and the Spark-based Simba methods. We employ different datasets of various sizes, showcasing the contributed performance advantages. Our approach outperforms the stateof the art, by 2-3 orders of magnitude, and consistently for dataset sizes ranging from hundreds of millions to hundreds of billions of data points. We also show that the key constituent performance overheads incurred during query processing (such as the number of data items retrieved from the data store, the required network bandwidth, and the processing time at the coordinator) scale very well, ensuring the overall scalability of the approach

    Multilateratin index.

    Get PDF
    We present an alternative method for pre-processing and storing point data, particularly for Geospatial points, by storing multilateration distances to fixed points rather than coordinates such as Latitude and Longitude. We explore the use of this data to improve query performance for some distance related queries such as nearest neighbor and query-within-radius (i.e. “find all points in a set P within distance d of query point q”). Further, we discuss the problem of “Network Adequacy” common to medical and communications businesses, to analyze questions such as “are at least 90% of patients living within 50 miles of a covered emergency room.” This is in fact the class of question that led to the creation of our pre-processing and algorithms, and is a generalization of a class of Nearest-Neighbor problems. We hypothesize that storing the distances from fixed points (typically three, as in trilateration) as an alternative to Latitude and Longitude can be used to improve performance on distance functions when large numbers of points are involved, allowing algorithms that are efficient for Nearest Neighbor and Network Adequacy queries. This effectively creates a coordinate system where the coordinates are the trilateration distances. We explore this alternative coordinate system and the theoretical, technical, and practical implications of using it. Multilateration itself is a common technique in surveying and geo-location widely used in cartography, surveying, and orienteering, although algorithmic use of these concepts for NN-style problems are scarce. GPS uses the concept of detecting the distance of a device to multiple satellites to determine the location of the device; a concept known as true-range multilateration. However while the approach is common, the distance values from multilateration are typically immediately converted to Latitude/Longitude and then discarded. Here we attempt to use those intermediate distance values to computational benefit. Conceptually, our multilateration construction is applicable to metric spaces in any number of dimensions. Rather than requiring the complex pre-calculated tree structures (as in Ball and KD-Trees)(Liu, Moore, and Gray 2006), or high cost pre-calculated nearest-neighbor graphs (as in FAISS)(Johnson, Douze, and Jégou 2017), we rely only on sorted arrays as indexes. This approach also allows for processing computationally intensive distance queries (such as nearest-neighbor) in a way that is easily implemented with data manipulation languages such as SQL. We experiment with simple algorithms using the multilateration index to exploit these features. Weset up experiments for Nearest Neighbor and Network Adequacy on high computational cost distance functions, on various sized data sets to compare our performance to other existing algorithms. Our results include a roughly 10x performance improvement to existing query logic using SQL engines, and a 30x performance gain in Cython - compared to other NN algorithms using the popular ann-benchmark tool - when the cost of the atomic distance calculation itself is high, such as with geodesic distances on earth requiring high precision. While we focus primarily on geospatial data, potential applications to this approach extend to any distance-measured n-dimensional metric space where the distance function itself has a high computational cost
    corecore