11 research outputs found

    Augmenting Suffix Trees, with Applications

    Full text link

    On Optimal Top-K String Retrieval

    Full text link
    Let D{\cal{D}} = {d1,d2,d3,...,dD}\{d_1, d_2, d_3, ..., d_D\} be a given set of DD (string) documents of total length nn. The top-kk document retrieval problem is to index D\cal{D} such that when a pattern PP of length pp, and a parameter kk come as a query, the index returns the kk most relevant documents to the pattern PP. Hon et. al. \cite{HSV09} gave the first linear space framework to solve this problem in O(p+klogk)O(p + k\log k) time. This was improved by Navarro and Nekrich \cite{NN12} to O(p+k)O(p + k). These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. In many applications like desktop or email search, the data resides on disk and hence disk-bound indexes are needed. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. Internal memory (or RAM) solution to this problem decomposes the problem into O(p)O(p) subproblems and thus incurs the additive factor of O(p)O(p). In external memory, these approaches will lead to O(p)O(p) I/Os instead of optimal O(p/B)O(p/B) I/O term where BB is the block-size. We re-interpret the problem independent of pp, as interval stabbing with priority over tree-shaped structure. This leads us to a linear space index in external memory supporting top-kk queries (with unsorted outputs) in near optimal O(p/B+logBn+log(h)n+k/B)O(p/B + \log_B n + \log^{(h)} n + k/B) I/Os for any constant hh{log(1)n=logn\log^{(1)}n =\log n and log(h)n=log(log(h1)n)\log^{(h)} n = \log (\log^{(h-1)} n)}. Then we get O(nlogn)O(n\log^*n) space index with optimal O(p/B+logBn+k/B)O(p/B+\log_B n + k/B) I/Os.Comment: 3 figure

    Implementation and analysis of a Top-K retrieval system for strings

    Get PDF
    Given text which is a union of d documents of strings, D = d1, d2,...., dd, the emphasis of this thesis is to provide a practical framework to retrieve the K most relevant documents for a given pattern P, which comes as a query. This cannot be done directly, as going through every occurrence of the query pattern may prove to be expensive if the number of documents that the pattern occurs in is much more than the number of documents (K) that we require. Some advanced query functionality will be required, as compared to listing the documents that the pattern occurs in, because a de_x000C_ned notion of most relevant must be provided. Therefore, an index needs to be built before hand on T so that the documents can be retrieved very quickly. Traditionally, inverted indexes have proven to be effective in retrieving the Top-K documents. However, inverted indexes have certain disadvantages, which can be overcome by using other data structures like suffix trees and suffix arrays. A framework was originally provided by Muthukrishnan [29] that takes advantage of the number of relevant documents being less than the occurence of the query pattern. He considered two metrics for relevance:frequency and proximity and provided a framework that took O(n log n) space. Recently, Hon et al [14] provided a framework that takes O(n) space to retrieve the Top-K documents with more optimal query times, O(P + K logK) for arbitrary score functions. In this thesis we study the practicality of this index and provide added functionalities, based on the index, to retrieve Top-K documents for specific cases like phrase searching. We also provide functionality to output the K most relevant documents(according to page rank) when two patterns are given as queries

    On the ACB compressor

    Get PDF
    Context-based compression methods are the most powerful approaches to squeeze arbitrary textual data. They offer a good predictive model for the subsequent data based on the already seen one, without assuming any probability distribution for the input source. In this thesis we analyze the adaptive ACB method (Buyanovsky, 94) which is mostly unexplored in the literature, although preliminary results showed compression ratios comparable (or even superior) to the best known data compression utilities. The novel feature of ACB consists of deploying both the previous context and the subsequent content to find a succinct encoding for the latter one. We perform a large set of experiments to study the experimental behavior of ACB and to compare it with known compressors, thus devising variations of the basic ACB-scheme that result promising for future developments

    Space-efficient data structures for string searching and retrieval

    Get PDF
    Let D = {d_1, d_2, ...} be a collection of string documents of n characters in total, which are drawn from an alphabet set Sigma =[sigma] ={1,2,3,...sigma}. The top-k document retrieval problem is to maintain D as a data structure, such that when ever a query Q=(P, k) comes, we can report (the identifiers of) those k documents that are most relevant to the pattern P (of p characters). The relevance of a document d_r with respect to a pattern P is captured by score(P, d_r), which can be any function of the set of locations where P occurs in d_r. Finding the most relevant documents to the user query is the central task of any web-search engine. In the case of web-data, the documents can be demarcated along word boundaries. All the search engines use inverted index as the back-bone data structure. For each word occurring in the document collection, the inverted index stores the list of documents where it appears. It is often augmented with relevance score and/or positional information. However, when data consists of strings (e.g., in bioinformatics or Asian language texts), there are no word demarcation boundaries and the queries are arbitrary substrings instead of being proper valid words. In this case, string data structures have to be used and central approach is to use suffix tree (or string B-tree) with appropriate augmenting data structures. The work by Hon, Shah and Vitter [FOCS 2009], and Navarro and Nekrich [SODA 2012] resulted in a linear space data structure with optimal O(p+k) query time solution for this problem. This was based on geometric interpretation of the query. We extend this central problem, in two important areas of massive data sets. First, we consider an external memory disk based index, where we give near optimal results. Next, we consider compression aspects of data structure, reducing the storage space. This is central goal of the active research field of succinct data structures. We present several results, which improve upon several previous results, and are currently the best known space-time trade-offs in this area

    String Searching with Ranking Constraints and Uncertainty

    Get PDF
    Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text mining and automated text classification. We propose succinct indexes for these problems. Strings with uncertainty and fuzzy information play an important role in increasingly many applications. We propose a general framework for indexing uncertain strings such that a deterministic query string can be searched efficiently. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We consider two basic problems of string searching, namely substring searching and string listing. We formulate these well known problems for uncertain strings paradigm and propose exact and approximate solution for them. We also discuss a constrained variation of orthogonal range searching. Given a set of points, the task of orthogonal range searching is to build a data structure such that all the points inside a orthogonal query region can be reported. We introduce a new variation, namely shared constraint range searching which naturally arises in constrained pattern matching applications. Shared constraint range searching is a special four sided range reporting query problem where two constraints has sharing among them, effectively reducing the number of independent constraints. For this problem, we propose a linear space index that can match the best known bound for three dimensional dominance reporting problem. We extend our data structure in the external memory model

    Efficient Indexing for Structured and Unstructured Data

    Get PDF
    The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation
    corecore