296 research outputs found

    Reverse-Safe Data Structures for Text Indexing

    Get PDF
    We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model

    Evaluating linear XPath expressions by pattern-matching automata

    Get PDF
    Abstract: We consider the problem of efficiently evaluating a large number of XPath expressions, especially in the case when they define subscriber profiles for filtering of XML documents. For each document in an XML document stream, the task is to determine those profiles that match the document. In this article we present a new general method for filtering with profiles expressed by linear XPath expressions with child operators (/), descendant operators (//), and wildcards ( * ). This new filtering algorithm is based on a backtracking deterministic finite automaton derived from the classic Aho-Corasick pattern-matching automaton. This automaton has a size linear in the sum of the sizes of the XPath filters, and the worst-case time bound of the algorithm is much less than the time bound of the simulation of linear-size nondeterministic automata. Our new algorithm has a predecessor that can handle child and descendant operators but not wildcards, and has been shown to be extremely efficient when a documenttype definition (DTD) has been used to prune out all the wildcards and most of the descendant operators. But in some cases, such as when the DTD is highly recursive, it may not be possible to prune out all wildcards without producing a too large set of filters. Then it is important to have the full generality of an evaluation algorithm, as presented in this article, that can also handle wildcards

    Motif Discovery with Compact Approaches - Design and Applications

    Get PDF
    In the post-genomic era, the ability to predict the behavior, the function, or the structure of biological entities, as well as interactions among them, plays a fundamental role in the discovery of information to help biologists to explain biological mechanisms. In this context, appropriate characterization of the structures under analysis, and the exploitation of combinatorial properties of sequences, are crucial steps towards the development of efficient algorithms and data structures to be able to perform the analysis of biological sequences. Similarity is a fundamental concept in Biology. Several functional and structural properties, and evolutionary mechanisms, can be predicted comparing new elements with already classified elements, or comparing elements with a similar structure of function to infer the common mechanism that is at the basis of the observed similar behavior. Such elements are commonly called motifs. Comparison-based methods for sequence analysis find their application in several biological contexts, such as identification of transcription factor binding sites, finding structural and functional similarities in proteins, and phylogeny. Therefore the development of adequate methodologies for motif discovery is of paramount interests for several fields in computational biology. In motif discovery in biosequences, it is common to assume that statistically significant candidates are those that are likely to hide some biologically significant property. For this purpose all the possible candidates are ranked according to some statistics on words (frequency, over/under representation, etc.). Then they are presented in output for further inspection by a biologist, who identifies the most promising subsequences, and tests them in laboratory to confirm their biological significance. Therefore, when designing algorithms for motif discovery, besides obviously aim at time and space efficiency, particular attention should be devoted to the output representation. In fact, even considering fixed length strings, the size of the candidate set become exponential if exhaustive enumeration is applied. This is already true when only exact matches are considered as candidate occurrences, and worsen if some kind of variability (for example a fixed number of mismatches is allowed). Alternatively, heuristics could be used, however without the warranty of finding the optimal solution. Computational power of nowadays computers can partially reduce these effects, in particular for short length candidates. However, if the size of the output is too big to be analyzed by human inspection the risk is to provide biologists with very fast, but useless tools. A possible solution relies on compact approaches. Compact approaches are based on the partition of the search space into classes. The classes must be designed in such a way that the score used to rank the candidates has a monotone behavior within each class. This allows the identification of a representative of each class, which is the element with the highest score. Consequently, it suffices to compute, and report in output, the score only for the representatives. In fact, we are guaranteed that for each element that has not been ranked there is another one (the representative of the class it belongs to) that is at least equally significant. The final user can then be presented with an output that has the size of the partition, rather than the size of the candidate space, with obvious advantages for the human-based analysis that follows the computer-based filtering of the pattern discovery algorithm. Compact approaches find applications both in searching and discovery frameworks. They can also be applied to several motif models: exact patterns, patterns with given mismatch distribution, patterns with unknown mismatch distribution, profiles (i.e. matrices), and under both i.i.d. and Markov distributions. The purpose of this chapter is to describe the basis of compact approaches, to provide the readers with the conceptual tools for applying compact approaches to the design of their algorithm for biosequence analysis. Moreover, examples of compact approaches that have been successfully developed for several motif models (e.g. exact words, co-occurrences, words with mismatches, etc) will be explained, and experimental results to discuss their power will be presented

    WAQS : a web-based approximate query system

    Get PDF
    The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval. In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language. Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation

    Data Structure Lower Bounds for Document Indexing Problems

    Get PDF
    We study data structure problems related to document indexing and pattern matching queries and our main contribution is to show that the pointer machine model of computation can be extremely useful in proving high and unconditional lower bounds that cannot be obtained in any other known model of computation with the current techniques. Often our lower bounds match the known space-query time trade-off curve and in fact for all the problems considered, there is a very good and reasonable match between the our lower bounds and the known upper bounds, at least for some choice of input parameters. The problems that we consider are set intersection queries (both the reporting variant and the semi-group counting variant), indexing a set of documents for two-pattern queries, or forbidden- pattern queries, or queries with wild-cards, and indexing an input set of gapped-patterns (or two-patterns) to find those matching a document given at the query time.Comment: Full version of the conference version that appeared at ICALP 2016, 25 page

    Structator: fast index-based search for RNA sequence-structure patterns

    Get PDF
    Background The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs. Results We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods. Conclusions The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator webcite.Deutsche Forschungsgemeinschaft (grant WI 3628/1-1

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Pattern Discovery from Biosequences

    Get PDF
    In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (http://ep.ebi.ac.uk/)

    A Novel Tree Structure for Pattern Matching in Biological Sequences

    Get PDF
    This dissertation proposes a novel tree structure, Error Tree (ET), to more efficiently solve the Approximate Pattern Matching problem, a fundamental problem in bioinformatics and information retrieval. The problem involves different matching measures such as the Hamming distance, edit distance, and wildcard matching. The input is usually a text of length n over a fixed alphabet of size Σ, a pattern P of length m, and an integer k. The output is those subsequences in the text that are at a distance ≤ k from P by Hamming distance, edit distance, or wildcard matching. An immediate application of the approximate pattern matching is the Planted Motif Search, an important problem in many biological applications such as finding promoters, enhancers, locus control regions, transcription factors, etc. The (l, d)-Planted Motif Search is defined as the following: Given n sequences over an alphabet of size Σ, each of length m, and two integers l and d, find a motif M of length l, where in each sequence there is at least an l-mer (substring of length l) at a Hamming distance of ≤ d from M. Based on the ET structure, our algorithm ET-Motif solves this problem efficiently in time and space. The thesis also discusses how the ET structure may add efficiency when it comes to Genome Assembly and DNA Sequence Compression. Current high-throughput sequencing technologies generate millions or billions of short reads (100-1000 bases) that are sequenced from a genome of millions or billions bases long. The De novo Genome Assembly problem is to assemble the original genome as long and accurate as possible. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter is costly to generate. Moreover, the recent GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with a very high coverage. This thesis introduces a novel Hierarchical Genome Assembly (HGA) method that takes further advantage of such high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads. We empirically evaluate this methodology for eight leading assemblers using seven GAGE-B bacterial datasets consisting of 100bp Illumina HiSeq and 250bp Illumina MiSeq reads with coverage ranging from 100x-∼200x. The results show that HGA leads to a significant improvement in the quality of the assembly for all evaluated assemblers and datasets. Still, the problem involves a major step which is overlapping the ends of the reads together and allowing few mismatches (i.e. the approximate matching problem). This requires computing the overlaps between the ends of all-against-all reads. The computation of such overlaps when allowing mismatches is intensive. The ET structure may further speed up this step. Lastly, due to the significant amount of DNA data generated by the Next- Generation-Sequencing machines, there is an increasing need to compress such data to reduce the storage space and transmission time. The Huffman encoding that incorporates DNA sequence characteristics proves to better compress DNA data. Different implementations of Huffman trees, centering on the selection of frequent repeats, are introduced in this thesis. Experimental results demonstrate improvement on the compression ratios for five genomes with lengths ranging from 5Mbp to 50Mbp, compared with the use of a standard Huffman tree algorithm. Hence, the thesis suggests an improvement on all DNA sequence compression algorithms that employ the conventional Huffman encoding. Moreover, approximate repeats can be compressed and further improve the results by encoding the Hamming or edit distance between these repeats. However, computing such distances requires additional costs in both time and space. These costs can be reduced by using the ET structure
    • …
    corecore