142 research outputs found

    Subsequence Automata with Default Transitions

    Get PDF
    Let SS be a string of length nn with characters from an alphabet of size σ\sigma. The \emph{subsequence automaton} of SS (often called the \emph{directed acyclic subsequence graph}) is the minimal deterministic finite automaton accepting all subsequences of SS. A straightforward construction shows that the size (number of states and transitions) of the subsequence automaton is O(nσ)O(n\sigma) and that this bound is asymptotically optimal. In this paper, we consider subsequence automata with \emph{default transitions}, that is, special transitions to be taken only if none of the regular transitions match the current character, and which do not consume the current character. We show that with default transitions, much smaller subsequence automata are possible, and provide a full trade-off between the size of the automaton and the \emph{delay}, i.e., the maximum number of consecutive default transitions followed before consuming a character. Specifically, given any integer parameter kk, 1<kσ1 < k \leq \sigma, we present a subsequence automaton with default transitions of size O(nklogkσ)O(nk\log_{k}\sigma) and delay O(logkσ)O(\log_k \sigma). Hence, with k=2k = 2 we obtain an automaton of size O(nlogσ)O(n \log \sigma) and delay O(logσ)O(\log \sigma). On the other extreme, with k=σk = \sigma, we obtain an automaton of size O(nσ)O(n \sigma) and delay O(1)O(1), thus matching the bound for the standard subsequence automaton construction. Finally, we generalize the result to multiple strings. The key component of our result is a novel hierarchical automata construction of independent interest.Comment: Corrected typo

    Compressed Subsequence Matching and Packed Tree Coloring

    Get PDF
    We present a new algorithm for subsequence matching in grammar compressed strings. Given a grammar of size nn compressing a string of size NN and a pattern string of size mm over an alphabet of size σ\sigma, our algorithm uses O(n+nσw)O(n+\frac{n\sigma}{w}) space and O(n+nσw+mlogNlogwocc)O(n+\frac{n\sigma}{w}+m\log N\log w\cdot occ) or O(n+nσwlogw+mlogNocc)O(n+\frac{n\sigma}{w}\log w+m\log N\cdot occ) time. Here ww is the word size and occocc is the number of occurrences of the pattern. Our algorithm uses less space than previous algorithms and is also faster for occ=o(nlogN)occ=o(\frac{n}{\log N}) occurrences. The algorithm uses a new data structure that allows us to efficiently find the next occurrence of a given character after a given position in a compressed string. This data structure in turn is based on a new data structure for the tree color problem, where the node colors are packed in bit strings.Comment: To appear at CPM '1

    Faster subsequence recognition in compressed strings

    Full text link
    Computation on compressed strings is one of the key approaches to processing massive data sets. We consider local subsequence recognition problems on strings compressed by straight-line programs (SLP), which is closely related to Lempel--Ziv compression. For an SLP-compressed text of length mˉ\bar m, and an uncompressed pattern of length nn, C{\'e}gielski et al. gave an algorithm for local subsequence recognition running in time O(mˉn2logn)O(\bar mn^2 \log n). We improve the running time to O(mˉn1.5)O(\bar mn^{1.5}). Our algorithm can also be used to compute the longest common subsequence between a compressed text and an uncompressed pattern in time O(mˉn1.5)O(\bar mn^{1.5}); the same problem with a compressed pattern is known to be NP-hard

    Discovering unbounded episodes in sequential data

    Get PDF
    One basic goal in the analysis of time-series data is to find frequent interesting episodes, i.e, collections of events occurring frequently together in the input sequence. Most widely-known work decide the interestingness of an episode from a fixed user-specified window width or interval, that bounds the subsequent sequential association rules. We present in this paper, a more intuitive definition that allows, in turn, interesting episodes to grow during the mining without any user-specified help. A convenient algorithm to efficiently discover the proposed unbounded episodes is also implemented. Experimental results confirm that our approach results useful and advantageous.Postprint (published version

    Compact Recognizers of Episode Sequences

    Get PDF
    Abstract Mikhail J. Atallah t Purdue University Given two strings T = at ... an and P = hI .. .h m over an alphabet E, the problem of testing whether P occurs as a subsequence of T is trivially solved in linear time. It is also known that a simple D(nlog lEI) time preprocessing ofT makes it easy to decide subsequently for any P and in at most IPJIog lEI character comparisons, whether P is a subsequence of T. These problems become more complicated if onc asks instead whether P occurs as a subsequence of some substring Y of T of bounded length. This paper presents an automaton built on the textstring T and capable of identifying all distinct minimal substrings Y of X having P as a subsequence. By a substring Y being minimal with respect to P, it is meant that P is not a subsequence of any proper substring of Y. For every minimal substring Y, the automaton recognizes the occurrence of P having lexicographically smallest sequence of symbol positions in Y. It is not difficult to realize such an automaton in time and space 0(n 2 ) for a text of n characters. One result of this paper consists of bringing those bounds down to linear or O(nlogn), respectively, depending on whether the alphabet is bounded or of arbitrary size, thereby matching the respective complexities of off-line exact string searching. Having built the automaton, the search for all lexicographically earliest occurrences of P in X is carried out in time O(n + k l rocc, . i . log n . log I~I), where rocc, is the number of distinct minimal substrings of T having b 1 ... b; as a subsequence. All log factors appearing in the above bounds can be further reduced to log log by resort to known integer-handling data structures. Index Terms -Algorithms, pattern matching, subsequence and episode searching, DAWG, suffix automaton, compact subsequence automaton, skip-edge DAWG, forward failure function, skip-link

    Bidirectional Growth based Mining and Cyclic Behaviour Analysis of Web Sequential Patterns

    Get PDF
    Web sequential patterns are important for analyzing and understanding users behaviour to improve the quality of service offered by the World Wide Web. Web Prefetching is one such technique that utilizes prefetching rules derived through Cyclic Model Analysis of the mined Web sequential patterns. The more accurate the prediction and more satisfying the results of prefetching if we use a highly efficient and scalable mining technique such as the Bidirectional Growth based Directed Acyclic Graph. In this paper, we propose a novel algorithm called Bidirectional Growth based mining Cyclic behavior Analysis of web sequential Patterns (BGCAP) that effectively combines these strategies to generate prefetching rules in the form of 2-sequence patterns with Periodicity and threshold of Cyclic Behaviour that can be utilized to effectively prefetch Web pages, thus reducing the users perceived latency. As BGCAP is based on Bidirectional pattern growth, it performs only (log n+1) levels of recursion for mining n Web sequential patterns. Our experimental results show that prefetching rules generated using BGCAP is 5-10 percent faster for different data sizes and 10-15% faster for a fixed data size than TD-Mine. In addition, BGCAP generates about 5-15 percent more prefetching rules than TD-Mine.Comment: 19 page

    Selected Topics in Network Optimization: Aligning Binary Decision Diagrams for a Facility Location Problem and a Search Method for Dynamic Shortest Path Interdiction

    Get PDF
    This work deals with three different combinatorial optimization problems: minimizing the total size of a pair of binary decision diagrams (BDDs) under a certain structural property, a variant of the facility location problem, and a dynamic version of the Shortest-Path Interdiction (DSPI) problem. However, these problems all have the following core idea in common: They all stem from representing an optimization problem as a decision diagram. We begin from cases in which such a diagram representation of reasonable size might exist, but finding a small diagram is difficult to achieve. The first problem develops a heuristic for enforcing a structural property for a collection of BDDs, which allows them to be merged into a single one efficiently. In the second problem, we consider a specific combinatorial problem that allows for a natural representation by a pair of BDDs. We use the previous result and ideas developed earlier in the literature to reformulate this problem as a linear program over a single BDD. This approach enables us to obtain sensitivity information, while often enjoying runtimes comparable to a mixed integer program solved with a commercial solver, after we pay the computational overhead of building the diagram (e.g., when re-solving the problem using different costs, but the same graph topology). In the last part, we examine DSPI, for which building the full decision diagram is generally impractical. We formalize the concept of a game tree for the DSPI and design a heuristic based on the idea of building only selected parts of this exponentially-sized decision diagram (which is not binary any more). We use a Monte Carlo Tree Search framework to establish policies that are near optimal. To mitigate the size of the game tree, we leverage previously derived bounds for the DSPI and employ an alpha–beta pruning technique for minimax optimization. We highlight the practicality of these ideas in a series of numerical experiments

    Are there any good digraph width measures?

    Get PDF
    Many width measures for directed graphs have been proposed in the last few years in pursuit of generalizing (the notion of) treewidth to directed graphs. However, none of these measures possesses, at the same time, the major properties of treewidth, namely, 1. being algorithmically useful , that is, admitting polynomial-time algorithms for a large class of problems on digraphs of bounded width (e.g. the problems definable in MSO1MSO1); 2. having nice structural properties such as being (at least nearly) monotone under taking subdigraphs and some form of arc contractions (property closely related to characterizability by particular cops-and-robber games). We investigate the question whether the search for directed treewidth counterparts has been unsuccessful by accident, or whether it has been doomed to fail from the beginning. Our main result states that any reasonable width measure for directed graphs which satisfies the two properties above must necessarily be similar to treewidth of the underlying undirected graph
    corecore