11 research outputs found
On Optimal Top-K String Retrieval
Let = be a given set of
(string) documents of total length . The top- document retrieval problem
is to index such that when a pattern of length , and a
parameter come as a query, the index returns the most relevant
documents to the pattern . Hon et. al. \cite{HSV09} gave the first linear
space framework to solve this problem in time. This was
improved by Navarro and Nekrich \cite{NN12} to . These results are
powerful enough to support arbitrary relevance functions like frequency,
proximity, PageRank, etc. In many applications like desktop or email search,
the data resides on disk and hence disk-bound indexes are needed. Despite of
continued progress on this problem in terms of theoretical, practical and
compression aspects, any non-trivial bounds in external memory model have so
far been elusive. Internal memory (or RAM) solution to this problem decomposes
the problem into subproblems and thus incurs the additive factor of
. In external memory, these approaches will lead to I/Os instead
of optimal I/O term where is the block-size. We re-interpret the
problem independent of , as interval stabbing with priority over tree-shaped
structure. This leads us to a linear space index in external memory supporting
top- queries (with unsorted outputs) in near optimal I/Os for any constant { and
}. Then we get space index
with optimal I/Os.Comment: 3 figure
Implementation and analysis of a Top-K retrieval system for strings
Given text which is a union of d documents of strings, D = d1, d2,...., dd, the emphasis of this thesis is to provide a practical framework to retrieve the K most relevant documents for a given pattern P, which comes as a query. This cannot be done directly, as going through every occurrence of the query pattern may prove to be expensive if the number of documents that the pattern occurs in is much more than the number of documents (K) that we require. Some advanced query functionality will be required, as compared to listing the documents that the pattern occurs in, because a de_x000C_ned notion of most relevant must be provided. Therefore, an index needs to be built before hand on T so that the documents can be retrieved very quickly. Traditionally, inverted indexes have proven to be effective in retrieving the Top-K documents. However, inverted indexes have certain disadvantages, which can be overcome by using other data structures like suffix trees and suffix arrays. A framework was originally provided by Muthukrishnan [29] that takes advantage of the number of relevant documents being less than the occurence of the query pattern. He considered two metrics for relevance:frequency and proximity and provided a framework that took O(n log n) space. Recently, Hon et al [14] provided a framework that takes O(n) space to retrieve the Top-K documents with more optimal query times, O(P + K logK) for arbitrary score functions. In this thesis we study the practicality of this index and provide added functionalities, based on the index, to retrieve Top-K documents for specific cases like phrase searching. We also provide functionality to output the K most relevant documents(according to page rank) when two patterns are given as queries
On the ACB compressor
Context-based compression methods are the most powerful approaches to squeeze arbitrary textual data. They offer a good predictive model for the subsequent data based on the already seen one, without assuming any probability distribution for the input source.
In this thesis we analyze the adaptive ACB method (Buyanovsky, 94) which is mostly unexplored in the literature, although preliminary results showed compression ratios comparable (or even superior) to the best known data compression utilities.
The novel feature of ACB consists of deploying both the previous context and the subsequent content to find a succinct encoding for the latter one. We perform a large set of experiments to study the experimental behavior of ACB and to compare it with known compressors, thus devising variations of the basic ACB-scheme that result promising for future developments
Space-efficient data structures for string searching and retrieval
Let D = {d_1, d_2, ...} be a collection of string documents of n characters in total, which are drawn from an alphabet set Sigma =[sigma] ={1,2,3,...sigma}. The top-k document retrieval problem is to maintain D as a data structure, such that when ever a query Q=(P, k) comes, we can report (the identifiers of) those k documents that are most relevant to the pattern P (of p characters). The relevance of a document d_r with respect to a pattern P is captured by score(P, d_r), which can be any function of the set of locations where P occurs in d_r. Finding the most relevant documents to the user query is the central task of any web-search engine. In the case of web-data, the documents can be demarcated along word boundaries. All the search engines use inverted index as the back-bone data structure. For each word occurring in the document collection, the inverted index stores the list of documents where it appears. It is often augmented with relevance score and/or positional information. However, when data consists of strings (e.g., in bioinformatics or Asian language texts), there are no word demarcation boundaries and the queries are arbitrary substrings instead of being proper valid words. In this case, string data structures have to be used and central approach is to use suffix tree (or string B-tree) with appropriate augmenting data structures. The work by Hon, Shah and Vitter [FOCS 2009], and Navarro and Nekrich [SODA 2012] resulted in a linear space data structure with optimal O(p+k) query time solution for this problem. This was based on geometric interpretation of the query. We extend this central problem, in two important areas of massive data sets. First, we consider an external memory disk based index, where we give near optimal results. Next, we consider compression aspects of data structure, reducing the storage space. This is central goal of the active research field of succinct data structures. We present several results, which improve upon several previous results, and are currently the best known space-time trade-offs in this area
String Searching with Ranking Constraints and Uncertainty
Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text mining and automated text classification. We propose succinct indexes for these problems. Strings with uncertainty and fuzzy information play an important role in increasingly many applications. We propose a general framework for indexing uncertain strings such that a deterministic query string can be searched efficiently. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We consider two basic problems of string searching, namely substring searching and string listing. We formulate these well known problems for uncertain strings paradigm and propose exact and approximate solution for them. We also discuss a constrained variation of orthogonal range searching. Given a set of points, the task of orthogonal range searching is to build a data structure such that all the points inside a orthogonal query region can be reported. We introduce a new variation, namely shared constraint range searching which naturally arises in constrained pattern matching applications. Shared constraint range searching is a special four sided range reporting query problem where two constraints has sharing among them, effectively reducing the number of independent constraints. For this problem, we propose a linear space index that can match the best known bound for three dimensional dominance reporting problem. We extend our data structure in the external memory model
Efficient Indexing for Structured and Unstructured Data
The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation