146 research outputs found

    Enhancing In-Memory Spatial Indexing with Learned Search

    Get PDF
    Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enableddevices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g.,location-tagged posts on Facebook, Twitter, and Instagram). This exponential growth in spatial data has led the research communityto build systems and applications for efficient spatial data processing.In this study, we apply a recently developed machine-learned search technique for single-dimensional sorted data to spatial indexing.Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search withineach partition to support point, range, distance, and spatial join queries. Adhering to the latest research trends, we tune the partitioningtechniques to be instance-optimized. By tuning each partitioning technique for optimal performance, we demonstrate that: (i) grid-basedindex structures outperform tree-based index structures (from 1.23× to 2.47×), (ii) learning-enhanced variants of commonly used spatialindex structures outperform their original counterparts (from 1.44× to 53.34× faster), (iii) machine-learned search within a partitionis faster than binary search by 11.79% - 39.51% when filtering on one dimension, (iv) the benefit of machine-learned search diminishesin the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, andpoint-in-polygon tests), and (v) index lookup is the bottleneck for tree-based structures, which could potentially be reduced by linearizingthe indexed partitions.Additional Key Words and Phrases: spatial data, indexing, machine-learning, spatial queries, geospatia

    Interactive visual data exploration with subjective feedback : an information-theoretic approach

    Get PDF
    Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing projection methods for data visualization use predefined criteria to choose the representation of data. There is a lack of methods that (i) use information on what the user has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns can be input as knowledge to the system. The knowledge syntax here is intuitive, such as "this set of points forms a cluster", and requires no knowledge of maths. This background knowledge is used to find a maximum entropy distribution of the data, after which the user is provided with data projections for which the data and the maximum entropy distribution differ the most, hence showing the user aspects of data that are maximally informative given the background knowledge. We study the computational performance of our model and present use cases on synthetic and real data. We find that the model allows the user to learn information efficiently from various data sources and works sufficiently fast in practice. In addition, we provide an open source EDA demonstrator system implementing our model with tailored interactive visualizations. We conclude that the information theoretic approach to EDA where patterns observed by a user are formalized as constraints provides a principled, intuitive, and efficient basis for constructing an EDA system.Peer reviewe

    What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization

    Full text link
    Knowledge graphs (KGs) store highly heterogeneous information about the world in the structure of a graph, and are useful for tasks such as question answering and reasoning. However, they often contain errors and are missing information. Vibrant research in KG refinement has worked to resolve these issues, tailoring techniques to either detect specific types of errors or complete a KG. In this work, we introduce a unified solution to KG characterization by formulating the problem as unsupervised KG summarization with a set of inductive, soft rules, which describe what is normal in a KG, and thus can be used to identify what is abnormal, whether it be strange or missing. Unlike first-order logic rules, our rules are labeled, rooted graphs, i.e., patterns that describe the expected neighborhood around a (seen or unseen) node, based on its type, and information in the KG. Stepping away from the traditional support/confidence-based rule mining techniques, we propose KGist, Knowledge Graph Inductive SummarizaTion, which learns a summary of inductive rules that best compress the KG according to the Minimum Description Length principle---a formulation that we are the first to use in the context of KG rule mining. We apply our rules to three large KGs (NELL, DBpedia, and Yago), and tasks such as compression, various types of error detection, and identification of incomplete information. We show that KGist outperforms task-specific, supervised and unsupervised baselines in error detection and incompleteness identification, (identifying the location of up to 93% of missing entities---over 10% more than baselines), while also being efficient for large knowledge graphs.Comment: 10 pages, plus 2 pages of references. 5 figures. Accepted at The Web Conference 202

    State Management for Efficient Event Pattern Detection

    Get PDF
    Event Stream Processing (ESP) Systeme überwachen kontinuierliche Datenströme, um benutzerdefinierte Queries auszuwerten. Die Herausforderung besteht darin, dass die Queryverarbeitung zustandsbehaftet ist und die Anzahl von Teilübereinstimmungen mit der Größe der verarbeiteten Events exponentiell anwächst. Die Dynamik von Streams und die Notwendigkeit, entfernte Daten zu integrieren, erschweren die Zustandsverwaltung. Erstens liefern heterogene Eventquellen Streams mit unvorhersehbaren Eingaberaten und Queryselektivitäten. Während Spitzenzeiten ist eine erschöpfende Verarbeitung unmöglich, und die Systeme müssen auf eine Best-Effort-Verarbeitung zurückgreifen. Zweitens erfordern Queries möglicherweise externe Daten, um ein bestimmtes Event für eine Query auszuwählen. Solche Abhängigkeiten sind problematisch: Das Abrufen der Daten unterbricht die Stream-Verarbeitung. Ohne eine Eventauswahl auf Grundlage externer Daten wird das Wachstum von Teilübereinstimmungen verstärkt. In dieser Dissertation stelle ich Strategien für optimiertes Zustandsmanagement von ESP Systemen vor. Zuerst ermögliche ich eine Best-Effort-Verarbeitung mittels Load Shedding. Dabei werden sowohl Eingabeeevents als auch Teilübereinstimmungen systematisch verworfen, um eine Latenzschwelle mit minimalem Qualitätsverlust zu garantieren. Zweitens integriere ich externe Daten, indem ich das Abrufen dieser von der Verwendung in der Queryverarbeitung entkoppele. Mit einem effizienten Caching-Mechanismus vermeide ich Unterbrechungen durch Übertragungslatenzen. Dazu werden externe Daten basierend auf ihrer erwarteten Verwendung vorab abgerufen und mittels Lazy Evaluation bei der Eventauswahl berücksichtigt. Dabei wird ein Kostenmodell verwendet, um zu bestimmen, wann welche externen Daten abgerufen und wie lange sie im Cache aufbewahrt werden sollen. Ich habe die Effektivität und Effizienz der vorgeschlagenen Strategien anhand von synthetischen und realen Daten ausgewertet und unter Beweis gestellt.Event stream processing systems continuously evaluate queries over event streams to detect user-specified patterns with low latency. However, the challenge is that query processing is stateful and it maintains partial matches that grow exponentially in the size of processed events. State management is complicated by the dynamicity of streams and the need to integrate remote data. First, heterogeneous event sources yield dynamic streams with unpredictable input rates, data distributions, and query selectivities. During peak times, exhaustive processing is unreasonable, and systems shall resort to best-effort processing. Second, queries may require remote data to select a specific event for a pattern. Such dependencies are problematic: Fetching the remote data interrupts the stream processing. Yet, without event selection based on remote data, the growth of partial matches is amplified. In this dissertation, I present strategies for optimised state management in event pattern detection. First, I enable best-effort processing with load shedding that discards both input events and partial matches. I carefully select the shedding elements to satisfy a latency bound while striving for a minimal loss in result quality. Second, to efficiently integrate remote data, I decouple the fetching of remote data from its use in query evaluation by a caching mechanism. To this end, I hide the transmission latency by prefetching remote data based on anticipated use and by lazy evaluation that postpones the event selection based on remote data to avoid interruptions. A cost model is used to determine when to fetch which remote data items and how long to keep them in the cache. I evaluated the above techniques with queries over synthetic and real-world data. I show that the load shedding technique significantly improves the recall of pattern detection over baseline approaches, while the technique for remote data integration significantly reduces the pattern detection latency
    corecore