10 research outputs found

    Position Heaps for Parameterized Strings

    Get PDF
    We propose a new indexing structure for parameterized strings, called parameterized position heap. Parameterized position heap is applicable for parameterized pattern matching problem, where the pattern matches a substring of the text if there exists a bijective mapping from the symbols of the pattern to the symbols of the substring. We propose an online construction algorithm of parameterized position heap of a text and show that our algorithm runs in linear time with respect to the text size. We also show that by using parameterized position heap, we can find all occurrences of a pattern in the text in linear time with respect to the product of the pattern size and the alphabet size

    Computing regularities in strings: A survey

    Get PDF
    The aim of this survey is to provide insight into the sequential algorithms that have been proposed to compute exact “regularities” in strings; that is, covers (or quasiperiods), seeds, repetitions, runs (or maximal periodicities), and repeats. After outlining and evaluating the algorithms that have been proposed for their computation, I suggest possibly productive future directions of research

    Párhuzamos sztringalgoritmusok és kapcsolódó adatszerkezetek

    Get PDF
    A dolgozat párhuzamos sztring adatszerkezetek előállítását tárgyalja

    String Searching with Ranking Constraints and Uncertainty

    Get PDF
    Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text mining and automated text classification. We propose succinct indexes for these problems. Strings with uncertainty and fuzzy information play an important role in increasingly many applications. We propose a general framework for indexing uncertain strings such that a deterministic query string can be searched efficiently. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We consider two basic problems of string searching, namely substring searching and string listing. We formulate these well known problems for uncertain strings paradigm and propose exact and approximate solution for them. We also discuss a constrained variation of orthogonal range searching. Given a set of points, the task of orthogonal range searching is to build a data structure such that all the points inside a orthogonal query region can be reported. We introduce a new variation, namely shared constraint range searching which naturally arises in constrained pattern matching applications. Shared constraint range searching is a special four sided range reporting query problem where two constraints has sharing among them, effectively reducing the number of independent constraints. For this problem, we propose a linear space index that can match the best known bound for three dimensional dominance reporting problem. We extend our data structure in the external memory model

    Constructing minimal acyclic deterministic finite automata

    Get PDF
    This thesis is submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Ph.D) in the FASTAR group of the Department of Computer Science, University of Pretoria, South Africa. I present a number of algorithms for constructing minimal acyclic deterministic finite automata (MADFAs), most of which I originally derived/designed or co-discovered. Being acyclic, such automata represent finite languages and have proven useful in applications such as spellchecking, virus-searching and text indexing. In many of those applications, the automata grow to billions of states, making them difficult to store without using various compression techniques — the most important of which is minimization. Results from the late 1950’s show that minimization yields a unique automaton (for a given language), and later results show that minimization of acyclic automata is possible in time linear in the number of states. These two results make for a rich area of algorithmics research; automata and algorithmics research are relatively old fields of computing science and the discovery/invention of new algorithms in the field is an exciting result. I present both incremental and nonincremental algorithms. With nonincremental techniques, the unminimized acyclic deterministic finite automaton (ADFA) is first constructed and then minimized. As mentioned above, the unminimized ADFA can be very large indeed — often even too large to fit within the virtual memory space of the computer. As a result, incremental techniques for minimization (i.e. the ADFA is minimized during its construction) become interesting. Incremental algorithms frequently have some overhead: if the unminimized ADFA fits easily within physical memory, it may still be faster to use nonincremental techniques. The presentation used in this thesis has a few unusual characteristics: Few other presentations follow a correctness-by-construction style for presenting and deriving algorithms. The presentations given here include correctness arguments or sketches thereof. The presentation is taxonomic — emphasizing the similarities and differences between the algorithms at a fundamental level. While it is possible to present these algorithms in a formal-language-theoretic setting, this thesis remains somewhat closer to the actual implementation issues. In several chapters, new algorithms and interesting new variants of existing algorithms are presented. It gives new presentations of many existing algorithms — all in a common format with common examples. There are extensive links to the existing literature. Thesis (PhD)--University of Pretoria, 2010.Computer Scienceunrestricte

    35th Symposium on Theoretical Aspects of Computer Science: STACS 2018, February 28-March 3, 2018, Caen, France

    Get PDF

    28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland

    Get PDF
    Peer reviewe

    StringMasters 2011 Special Issue

    No full text