8 research outputs found

    Ranked Queries in Index Data Structures

    Get PDF
    A ranked query is a query which returns the top-ranking elements of a set, sorted by rank, where the rank corresponds to some sort of preference function defined on the items of the set. This thesis investigates the problem of adding rank query capabilities to several index data structures on top of their existing functionality. First, we introduce the concept of rank-sensitive data structures, based on the existing concept of output-sensitive data structures. Rank-sensitive data structures are output-sensitive data structures which are additionally given a ranking of the items stored and as a result of a query return only the k best-ranking items satisfying the given query, sorted according to rank, where k is specified at query time. We explore several ways of adding rank-sensitivity to different data structures and the different trade-offs which this incurs. The second part of the work deals with the first efficient dynamic version of the Cartesian tree – a data structure intrinsically related to rank queries

    Implementation and analysis of a Top-K retrieval system for strings

    Get PDF
    Given text which is a union of d documents of strings, D = d1, d2,...., dd, the emphasis of this thesis is to provide a practical framework to retrieve the K most relevant documents for a given pattern P, which comes as a query. This cannot be done directly, as going through every occurrence of the query pattern may prove to be expensive if the number of documents that the pattern occurs in is much more than the number of documents (K) that we require. Some advanced query functionality will be required, as compared to listing the documents that the pattern occurs in, because a de_x000C_ned notion of most relevant must be provided. Therefore, an index needs to be built before hand on T so that the documents can be retrieved very quickly. Traditionally, inverted indexes have proven to be effective in retrieving the Top-K documents. However, inverted indexes have certain disadvantages, which can be overcome by using other data structures like suffix trees and suffix arrays. A framework was originally provided by Muthukrishnan [29] that takes advantage of the number of relevant documents being less than the occurence of the query pattern. He considered two metrics for relevance:frequency and proximity and provided a framework that took O(n log n) space. Recently, Hon et al [14] provided a framework that takes O(n) space to retrieve the Top-K documents with more optimal query times, O(P + K logK) for arbitrary score functions. In this thesis we study the practicality of this index and provide added functionalities, based on the index, to retrieve Top-K documents for specific cases like phrase searching. We also provide functionality to output the K most relevant documents(according to page rank) when two patterns are given as queries

    Optimal Path-Decomposition of Tries

    Get PDF
    In this thesis, we consider the path-decomposition representation of prefix trees. We show that given query probabilities for every word in the prefix tree, the heavy-path strategy produces the optimal trie with respect to the number of node accesses. We show how to implement the heavy-path strategy in O(N) time for a trie containing n words with total length N. To prove this result, we show a complete characterization of the choices made by the optimal decomposition strategy. Using this characterization, we describe how to efficiently support dynamic operations on the path-decomposed trie while preserving the optimality in O(sigma * |w|) time for an alphabet size of sigma and a word length of |w|. We also give entropy-based bounds of the node accesses per query for their respective probabilities. Finally, we show theoretical and experimental results on the performance of heavy-path versus max-score, another popular path-decomposition strategy

    Space-Efficient Data Structures for Collections of Textual Data

    Get PDF
    This thesis focuses on the design of succinct and compressed data structures for collections of string-based data, specifically sequences of semi-structured documents in textual format, sets of strings, and sequences of strings. The study of such collections is motivated by a large number of applications both in theory and practice. For textual semi-structured data, we introduce the concept of semi-index, a succinct construction that speeds up the access to documents encoded with textual semi-structured formats, such as JSON and XML, by storing separately a compact description of their parse trees, hence avoiding the need to re-parse the documents every time they are read. For string dictionaries, we describe a data structure based on a path decomposition of the compacted trie built on the string set. The tree topology is encoded using succinct data structures, while the node labels are compressed using a simple dictionary-based scheme. We also describe a variant of the path-decomposed trie for scored string sets, where each string has a score. This data structure can support efficiently top-k completion queries, that is, given a string p and an integer k, return the k highest scored strings among those prefixed by p. For sequences of strings, we introduce the problem of compressed indexed sequences of strings, that is, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while supporting supports random access, searching, and counting operations, both for exact matches and prefix search. We present a new data structure, the Wavelet Trie, that solves the problem by combining a Patricia trie with a wavelet tree. The Wavelet Trie improves on the state-of-the-art compressed data structures for sequences by supporting a dynamic alphabet and prefix queries. Finally, we discuss the issue of the practical implementation of the succinct primitives used throughout the thesis for the experiments. These primitives are implemented as part of a publicly available library, Succinct, using state-of-the-art algorithms along with some improvements

    Algorithms and Data Structures for Geometric Intersection Query Problems

    Get PDF
    University of Minnesota Ph.D. dissertation. September 2017. Major: Computer Science. Advisor: Ravi Janardan. 1 computer file (PDF); xi, 126 pages.The focus of this thesis is the topic of geometric intersection queries (GIQ) which has been very well studied by the computational geometry community and the database community. In a GIQ problem, the user is not interested in the entire input geometric dataset, but only in a small subset of it and requests an informative summary of that small subset of data. Formally, the goal is to preprocess a set A of n geometric objects into a data structure so that given a query geometric object q, a certain aggregation function can be applied efficiently on the objects of A intersecting q. The classical aggregation functions studied in the literature are reporting or counting the objects of A intersecting q. In many applications, the same set A is queried several times, in which case one would like to answer a query faster by preprocessing A into a data structure. The goal is to organize the data into a data structure which occupies a small amount of space and yet responds to any user query in real-time. In this thesis the study of the GIQ problems was conducted from the point-of-view of a computational geometry researcher. Given a model of computation and a GIQ problem, what are the best possible upper bounds (resp., lower bounds) on the space and the query time that can be achieved by a data structure? Also, what is the relative hardness of various GIQ problems and aggregate functions. Here relative hardness means that given two GIQ problems A and B (or, two aggregate functions f(A, q) and g(A, q)), which of them can be answered faster by a computer (assuming data structures for both of them occupy asymptotically the same amount of space)? This thesis presents results which increase our understanding of the above questions. For many GIQ problems, data structures with optimal (or near-optimal) space and query time bounds have been achieved. The geometric settings studied are primarily orthogonal range searching where the input is points and the query is an axes-aligned rectangle, and the dual setting of rectangle stabbing where the input is a set of axes-aligned rectangles and the query is a point. The aggregation functions studied are primarily reporting, top-k, and approximate counting. Most of the data structures are built for the internal memory model (word-RAM or pointer machine model), but in some settings they are generic enough to be efficient in the I/O-model as well

    Black women, gender & families

    No full text
    Abstract. Output-sensitive data structures result from preprocessing n items and are capable of reporting the items satisfying an on-line query in O(t(n) + ℓ) time, where t(n) is the cost of traversing the structure and ℓ ≤ n is the number of reported items satisfying the query. In this paper we focus on rank-sensitive data structures, which are additionally given a ranking of the n items, so that just the top k best-ranking items should be reported at query time, sorted in rank order, at a cost of O(t(n) + k) time. Note that k is part of the query as a parameter under the control of the user (as opposed to ℓ which is query-dependent). We explore the problem of adding rank-sensitivity to data structures such as suffix trees or range trees, where the ℓ items satisfying the query form O(polylog(n)) intervals of consecutive entries from which we choose the top k best-ranking ones. Letting s(n) be the number of items (including their copies) stored in the original data structures, we increase the space by an additional term of O(s(n) lg ǫ n) memory words of space, each of O(lg n) bits, for any positive constant ǫ < 1. We allow for changing the ranking on the fly during the lifetime of the data structures, with ranking values in 0... O(n). In this case, query time becomes O(t(n)+k) plus O(lg n/lg lg n) per interval; each change in the ranking and each insertion/deletion of an item takes O(lg n) time; the additional term in space occupancy increases to O(s(n) lg n/lg lg n).
    corecore