758 research outputs found

    On Sensitivity of Compact Directed Acyclic Word Graphs

    Full text link
    Compact directed acyclic word graphs (CDAWGs) [Blumer et al. 1987] are a fundamental data structure on strings with applications in text pattern searching, data compression, and pattern discovery. Intuitively, the CDAWG of a string TT is obtained by merging isomorphic subtrees of the suffix tree [Weiner 1973] of the same string TT, thus CDAWGs are a compact indexing structure. In this paper, we investigate the sensitivity of CDAWGs when a single character edit operation (insertion, deletion, or substitution) is performed at the left-end of the input string TT, namely, we are interested in the worst-case increase in the size of the CDAWG after a left-end edit operation. We prove that if ee is the number of edges of the CDAWG for string TT, then the number of new edges added to the CDAWG after a left-end edit operation on TT is less than ee. Further, we present almost matching lower bounds on the sensitivity of CDAWGs for all cases of insertion, deletion, and substitution.Comment: This is a full version of the paper accepted for WORDS 202

    The Rightmost Equal-Cost Position Problem

    Full text link
    LZ77-based compression schemes compress the input text by replacing factors in the text with an encoded reference to a previous occurrence formed by the couple (length, offset). For a given factor, the smallest is the offset, the smallest is the resulting compression ratio. This is optimally achieved by using the rightmost occurrence of a factor in the previous text. Given a cost function, for instance the minimum number of bits used to represent an integer, we define the Rightmost Equal-Cost Position (REP) problem as the problem of finding one of the occurrences of a factor which cost is equal to the cost of the rightmost one. We present the Multi-Layer Suffix Tree data structure that, for a text of length n, at any time i, it provides REP(LPF) in constant time, where LPF is the longest previous factor, i.e. the greedy phrase, a reference to the list of REP({set of prefixes of LPF}) in constant time and REP(p) in time O(|p| log log n) for any given pattern p

    Sliding Window Property Testing for Regular Languages

    Get PDF
    We study the problem of recognizing regular languages in a variant of the streaming model of computation, called the sliding window model. In this model, we are given a size of the sliding window n and a stream of symbols. At each time instant, we must decide whether the suffix of length n of the current stream ("the active window") belongs to a given regular language. Recent works [Moses Ganardi et al., 2018; Moses Ganardi et al., 2016] showed that the space complexity of an optimal deterministic sliding window algorithm for this problem is either constant, logarithmic or linear in the window size n and provided natural language theoretic characterizations of the space complexity classes. Subsequently, [Moses Ganardi et al., 2018] extended this result to randomized algorithms to show that any such algorithm admits either constant, double logarithmic, logarithmic or linear space complexity. In this work, we make an important step forward and combine the sliding window model with the property testing setting, which results in ultra-efficient algorithms for all regular languages. Informally, a sliding window property tester must accept the active window if it belongs to the language and reject it if it is far from the language. We show that for every regular language, there is a deterministic sliding window property tester that uses logarithmic space and a randomized sliding window property tester with two-sided error that uses constant space

    Improving the Performance of Thinning Algorithms with Directed Rooted Acyclic Graphs

    Get PDF
    In this paper we propose a strategy to optimize the performance of thinning algorithms. This solution is obtained by combining three proven strategies for binary images neighborhood exploration, namely modeling the problem with an optimal decision tree, reusing pixels from the previous step of the algorithm, and reducing the code footprint by means of Directed Rooted Acyclic Graphs. A complete and open-source benchmarking suite is also provided. Experimental results confirm that the proposed algorithms clearly outperform classical implementations

    Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs

    Get PDF
    Given a string T on an alphabet of size sigma, we describe a bidirectional Burrows-Wheeler index that takes O(|T| log sigma) bits of space, and that supports the addition and removal of one character, on the left or right side of any substring of T, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of T, but they could support removal only from specific substrings of T. We also describe an index that supports bidirectional addition and removal in O(log log |T|) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T. We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal

    PiCo: A Domain-Specific Language for Data Analytics Pipelines

    Get PDF
    In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world

    On some one-sided dynamics of cellular automata

    Get PDF
    A dynamical system consists of a space of all possible world states and a transformation of said space. Cellular automata are dynamical systems where the space is a set of one- or two-way infinite symbol sequences and the transformation is defined by a homogenous local rule. In the setting of cellular automata, the geometry of the underlying space allows one to define one-sided variants of some dynamical properties; this thesis considers some such one-sided dynamics of cellular automata. One main topic are the dynamical concepts of expansivity and that of pseudo-orbit tracing property. Expansivity is a strong form of sensitivity to the initial conditions while pseudo-orbit tracing property is a type of approximability. For cellular automata we define one-sided variants of both of these concepts. We give some examples of cellular automata with these properties and prove, for example, that right-expansive cellular automata are chain-mixing. We also show that left-sided pseudo-orbit tracing property together with right-sided expansivity imply that a cellular automaton has the pseudo-orbit tracing property. Another main topic is conjugacy. Two dynamical systems are conjugate if, in a dynamical sense, they are the same system. We show that for one-sided cellular automata conjugacy is undecidable. In fact the result is stronger and shows that the relations of being a factor or a susbsystem are undecidable, too

    Sliding suffix trees simplified

    Full text link
    Sliding suffix trees (Fiala & Greene, 1989) for an input text TT over an alphabet of size σ\sigma and a sliding window WW of TT can be maintained in O(Tlogσ)O(|T| \log \sigma) time and O(W)O(|W|) space. The two previous approaches that achieve this can be categorized into the credit-based approach of Fiala and Greene (1989) and Larsson (1996, 1999), or the batch-based approach proposed by Senft (2005). Brodnik and Jekovec (2018) showed that the sliding suffix tree can be supplemented with leaf pointers in order to find all occurrences of an online query pattern in the current window, and that leaf pointers can be maintained by credit-based arguments as well. The main difficulty in the credit-based approach is in the maintenance of index-pairs that represent each edge. In this paper, we show that valid edge index-pairs can be derived in constant time from leaf pointers, thus reducing the maintenance of edge index-pairs to the maintenance of leaf pointers. We further propose a new simple method which maintains leaf pointers without using credit-based arguments. Our algorithm and proof of correctness are much simpler compared to the credit-based approach, whose analyses were initially flawed (Senft 2005).Comment: 12 pages + 5 pages of appendix. 18 figures in tota

    Online Spectral Clustering on Network Streams

    Get PDF
    Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling
    corecore