2,333 research outputs found

    Overview of query optimization in XML database systems

    Get PDF

    Fast and Tiny Structural Self-Indexes for XML

    Full text link
    XML document markup is highly repetitive and therefore well compressible using dictionary-based methods such as DAGs or grammars. In the context of selectivity estimation, grammar-compressed trees were used before as synopsis for structural XPath queries. Here a fully-fledged index over such grammars is presented. The index allows to execute arbitrary tree algorithms with a slow-down that is comparable to the space improvement. More interestingly, certain algorithms execute much faster over the index (because no decompression occurs). E.g., for structural XPath count queries, evaluating over the index is faster than previous XPath implementations, often by two orders of magnitude. The index also allows to serialize XML results (including texts) faster than previous systems, by a factor of ca. 2-3. This is due to efficient copy handling of grammar repetitions, and because materialization is totally avoided. In order to compare with twig join implementations, we implemented a materializer which writes out pre-order numbers of result nodes, and show its competitiveness.Comment: 13 page

    Selectivity estimation on set containment search

    Full text link
    © Springer Nature Switzerland AG 2019. In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S. The problem has many important applications in commercial fields and scientific studies. To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques

    The European road safety decision support system. A clearinghouse of road safety risks and measures, Deliverable 8.3 of the H2020 project SafetyCube

    Get PDF
    Safety CaUsation, Benefits and Efficiency (SafetyCube) is a European Commission supported Horizon 2020 project with the objective of developing an innovative road safety Decision Support System (DSS) that will enable policy-makers and stakeholders to select and implement the most appropriate strategies, measures and cost-effective approaches to reduce casualties of all road user types and all severities. The core of the SafetyCube project is a comprehensive analysis of accident risks and the effectiveness and cost-benefit of safety measures, focusing on road users, infrastructure, vehicles and post-impace care, framed within a Safe System approach ,with road safety stakeholders at the national level, EU and beyond having involvement at all stages. The present Deliverable (8.3) outlines the methods and outputs of SafetyCube Task 8.3 - ‘Decision Support System of road safety risks and measures’. A Glossary of the SafetyCube DSS is available to the Appendix of this report. The identification and assessment of user needs for a road safety DSS was conducted on the basis of a broad stakeholders’ consultation. Dedicated stakeholder workshops yielded comments and input on the SafetyCube methodology, the structure of the DSS and identification of road safety "hot topics" for human behaviour, infrastructure and vehicles. Additionally, a review of existing decision support systems, was carried out; their functions and contents were assessed, indicating that despite their usefulness they are of relatively narrow scope.... continue

    Low-latency, query-driven analytics over voluminous multidimensional, spatiotemporal datasets

    Get PDF
    2017 Summer.Includes bibliographical references.Ubiquitous data collection from sources such as remote sensing equipment, networked observational devices, location-based services, and sales tracking has led to the accumulation of voluminous datasets; IDC projects that by 2020 we will generate 40 zettabytes of data per year, while Gartner and ABI estimate 20-35 billion new devices will be connected to the Internet in the same time frame. The storage and processing requirements of these datasets far exceed the capabilities of modern computing hardware, which has led to the development of distributed storage frameworks that can scale out by assimilating more computing resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of study: extracting knowledge, insights, and relationships from the underlying datasets. The basic building block of this knowledge discovery process is analytic queries, encompassing both query instrumentation and evaluation. This dissertation is centered around query-driven exploratory and predictive analytics over voluminous, multidimensional datasets. Both of these types of analysis represent a higher-level abstraction over classical query models; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset (including time series and geospatial aspects), and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. This requires specialized data structures and partitioning algorithms, along with adaptive reductions in the search space and management of the inherent trade-off between timeliness and accuracy. The algorithms presented in this dissertation were evaluated empirically on real-world geospatial time-series datasets in a production environment, and are broadly applicable across other storage frameworks

    Keyword-based search in peer-to-peer networks

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Versioned index data structures for time travel text search

    Get PDF
    In this work we develop a system for letting users search versioned documents – i.e.,\ncollections containing multiple versions for the same document – specifying their validity\nrange by means of time intervals. To this end, we decide to enhance the widely-used Terrier\nopen-source IR system by means of two strategies for index versioning: (i) the Baseline\nApproach (BA) and, (ii) the Mapping Approach (MA)ope

    A Scalable Clustering Algorithm for High-dimensional Data Streams over Sliding Windows

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 공과대학 전기·컴퓨터공학부, 2017. 8. 이상구.Data stream clustering over sliding windows generates clustering results whenever a window moves. However, iterative clustering using all data in a window is highly inefficient in terms of memory and computation time. In this thesis, we address problem of data stream clustering over sliding windows using sliding window aggregation and nearest neighbor search techniques. Our algorithm constructs and maintains temporal group features as a summary of the window using the sliding window aggregation technique. The technique divides a window into disjoint chunks, computes partial aggregates over each chunk, and merges the partial aggregates to compute overall aggregates. To maintain constant size of the summary, the algorithm reduces the size of summary by joining the nearest neighbor. We exploit Locality-Sensitive Hashing for fast nearest neighbor search. We show that Locality-Sensitive Hashing can serve as an effective method for reducing synopses while minimizing the impact on quality. In addition, we also suggest re-clustering policy, which decides whether to append new summary to pre-existing clusters or to perform clustering on whole summary. Our experiments on real-world and synthetic datasets demonstrate that our algorithm can achieve a significant improvement when performing continuous clustering on data streams with sliding windows.1 Introduction 1 2. Preliminaries and Related Work 7 2.1 Data Streams 7 2.2 Window Models 7 2.3 kMeans Clustering 11 2.4 Coreset 12 2.5 Group Features 14 2.6 Related Work 16 2.7 Problem Statement 31 3. GFCS: Group Featurebased Data Stream Clustering with Sliding Windows 35 3.1 2-Level Coresets Construction 35 3.2 2-Level Coresets Maintenance 38 3.3 Clustering on 2-Level Coresets 40 4. CSCS: Coresetbased Data Stream Clustering with Sliding Windows 46 4.1 Coreset Construction based on Nearest Neighbor Search 47 4.2 Coreset Construction based on LocalitySensitive Hashing 60 4.3 Reclustering Policy 66 5. Empirical Evaluation of Data Stream Clustering with Sliding Windows 69 5.1 Experimental Setup 69 5.2 Experimental Results 71 6. Application: Documents Clustering 78 6.1 Vector Representation of Documents 78 6.2 Extension to Other Clustering Algorithms 83 6.3 Evaluation 88 7. Conclusion 95 A. Appendix 109 A.1 Experimental Results of GFCS and CSCS 109 A.2 Experimental Results of Document Clustering 117Docto
    corecore