611 research outputs found

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Succinct Indices for Range Queries with applications to Orthogonal Range Maxima

    Full text link
    We consider the problem of preprocessing NN points in 2D, each endowed with a priority, to answer the following queries: given a axis-parallel rectangle, determine the point with the largest priority in the rectangle. Using the ideas of the \emph{effective entropy} of range maxima queries and \emph{succinct indices} for range maxima queries, we obtain a structure that uses O(N) words and answers the above query in O(logNloglogN)O(\log N \log \log N) time. This is a direct improvement of Chazelle's result from FOCS 1985 for this problem -- Chazelle required O(N/ϵ)O(N/\epsilon) words to answer queries in O((logN)1+ϵ)O((\log N)^{1+\epsilon}) time for any constant ϵ>0\epsilon > 0.Comment: To appear in ICALP 201

    Space-efficient data structures for string searching and retrieval

    Get PDF
    Let D = {d_1, d_2, ...} be a collection of string documents of n characters in total, which are drawn from an alphabet set Sigma =[sigma] ={1,2,3,...sigma}. The top-k document retrieval problem is to maintain D as a data structure, such that when ever a query Q=(P, k) comes, we can report (the identifiers of) those k documents that are most relevant to the pattern P (of p characters). The relevance of a document d_r with respect to a pattern P is captured by score(P, d_r), which can be any function of the set of locations where P occurs in d_r. Finding the most relevant documents to the user query is the central task of any web-search engine. In the case of web-data, the documents can be demarcated along word boundaries. All the search engines use inverted index as the back-bone data structure. For each word occurring in the document collection, the inverted index stores the list of documents where it appears. It is often augmented with relevance score and/or positional information. However, when data consists of strings (e.g., in bioinformatics or Asian language texts), there are no word demarcation boundaries and the queries are arbitrary substrings instead of being proper valid words. In this case, string data structures have to be used and central approach is to use suffix tree (or string B-tree) with appropriate augmenting data structures. The work by Hon, Shah and Vitter [FOCS 2009], and Navarro and Nekrich [SODA 2012] resulted in a linear space data structure with optimal O(p+k) query time solution for this problem. This was based on geometric interpretation of the query. We extend this central problem, in two important areas of massive data sets. First, we consider an external memory disk based index, where we give near optimal results. Next, we consider compression aspects of data structure, reducing the storage space. This is central goal of the active research field of succinct data structures. We present several results, which improve upon several previous results, and are currently the best known space-time trade-offs in this area

    Random Access in Persistent Strings and Segment Selection

    Full text link
    We consider compact representations of collections of similar strings that support random access queries. The collection of strings is given by a rooted tree where edges are labeled by an edit operation (inserting, deleting, or replacing a character) and a node represents the string obtained by applying the sequence of edit operations on the path from the root to the node. The goal is to compactly represent the entire collection while supporting fast random access to any part of a string in the collection. This problem captures natural scenarios such as representing the past history of an edited document or representing highly-repetitive collections. Given a tree with nn nodes, we show how to represent the corresponding collection in O(n)O(n) space and O(logn/loglogn)O(\log n/ \log \log n) query time. This improves the previous time-space trade-offs for the problem. Additionally, we show a lower bound proving that the query time is optimal for any solution using near-linear space. To achieve our bounds for random access in persistent strings we show how to reduce the problem to the following natural geometric selection problem on line segments. Consider a set of horizontal line segments in the plane. Given parameters ii and jj, a segment selection query returns the jjth smallest segment (the segment with the jjth smallest yy-coordinate) among the segments crossing the vertical line through xx-coordinate ii. The segment selection problem is to preprocess a set of horizontal line segments into a compact data structure that supports fast segment selection queries. We present a solution that uses O(n)O(n) space and support segment selection queries in O(logn/loglogn)O(\log n/ \log \log n) time, where nn is the number of segments. Furthermore, we prove that that this query time is also optimal for any solution using near-linear space.Comment: Extended abstract at ISAAC 202

    Succinct Color Searching in One Dimension

    Get PDF
    In this paper we study succinct data structures for one-dimensional color reporting and color counting problems. We are given a set of n points with integer coordinates in the range [1,m] and every point is assigned a color from the set {1,...sigma}. A color reporting query asks for the list of distinct colors that occur in a query interval [a,b] and a color counting query asks for the number of distinct colors in [a,b]. We describe a succinct data structure that answers approximate color counting queries in O(1) time and uses mathcal{B}(n,m) + O(n) + o(mathcal{B}(n,m)) bits, where mathcal{B}(n,m) is the minimum number of bits required to represent an arbitrary set of size n from a universe of m elements. Thus we show, somewhat counterintuitively, that it is not necessary to store colors of points in order to answer approximate color counting queries. In the special case when points are in the rank space (i.e., when n=m), our data structure needs only O(n) bits. Also, we show that Omega(n) bits are necessary in that case. Then we turn to succinct data structures for color reporting. We describe a data structure that uses mathcal{B}(n,m) + nH_d(S) + o(mathcal{B}(n,m)) + o(nlgsigma) bits and answers queries in O(k+1) time, where k is the number of colors in the answer, and nH_d(S) (d=log_sigma n) is the d-th order empirical entropy of the color sequence. Finally, we consider succinct color reporting under restricted updates. Our dynamic data structure uses nH_d(S)+o(nlgsigma) bits and supports queries in O(k+1) time

    Space-Efficient Data Structures in the Word-RAM and Bitprobe Models

    Get PDF
    This thesis studies data structures in the word-RAM and bitprobe models, with an emphasis on space efficiency. In the word-RAM model of computation the space cost of a data structure is measured in terms of the number of w-bit words stored in memory, and the cost of answering a query is measured in terms of the number of read, write, and arithmetic operations that must be performed. In the bitprobe model, like the word-RAM model, the space cost is measured in terms of the number of bits stored in memory, but the query cost is measured solely in terms of the number of bit accesses, or probes, that are performed. First, we examine the problem of succinctly representing a partially ordered set, or poset, in the word-RAM model with word size Theta(lg n) bits. A succinct representation of a combinatorial object is one that occupies space matching the information theoretic lower bound to within lower order terms. We show how to represent a poset on n vertices using a data structure that occupies n^2/4 + o(n^2) bits, and can answer precedence (i.e., less-than) queries in constant time. Since the transitive closure of a directed acyclic graph is a poset, this implies that we can support reachability queries on an arbitrary directed graph in the same space bound. As far as we are aware, this is the first representation of an arbitrary directed graph that supports reachability queries in constant time, and stores less than n choose 2 bits. We also consider several additional query operations. Second, we examine the problem of supporting range queries on strings of n characters (or, equivalently, arrays of n elements) in the word-RAM model with word size Theta(lg n) bits. We focus on the specific problem of answering range majority queries: i.e., given a range, report the character that is the majority among those in the range, if one exists. We show that these queries can be supported in constant time using a linear space (in words) data structure. We generalize this result in several directions, considering various frequency thresholds, geometric variants of the problem, and dynamism. These results are in stark contrast to recent work on the similar range mode problem, in which the query operation asks for the mode (i.e., most frequent) character in a given range. The current best data structures for the range mode problem take soft-Oh(n^(1/2)) time per query for linear space data structures. Third, we examine the deterministic membership (or dictionary) problem in the bitprobe model. This problem asks us to store a set of n elements drawn from a universe [1,u] such that membership queries can be always answered in t bit probes. We present several new fully explicit results for this problem, in particular for the case when n = 2, answering an open problem posed by Radhakrishnan, Shah, and Shannigrahi [ESA 2010]. We also present a general strategy for the membership problem that can be used to solve many related fundamental problems, such as rank, counting, and emptiness queries. Finally, we conclude with a list of open problems and avenues for future work