369 research outputs found

    Adaptive Computation of the Swap-Insert Correction Distance

    Full text link
    The Swap-Insert Correction distance from a string SS of length nn to another string LL of length m≄nm\geq n on the alphabet [1..d][1..d] is the minimum number of insertions, and swaps of pairs of adjacent symbols, converting SS into LL. Contrarily to other correction distances, computing it is NP-Hard in the size dd of the alphabet. We describe an algorithm computing this distance in time within O(d2nmgd−1)O(d^2 nm g^{d-1}), where there are nαn_\alpha occurrences of α\alpha in SS, mαm_\alpha occurrences of α\alpha in LL, and where g=max⁥α∈[1..d]min⁥{nα,mα−nα}g=\max_{\alpha\in[1..d]} \min\{n_\alpha,m_\alpha-n_\alpha\} measures the difficulty of the instance. The difficulty gg is bounded by above by various terms, such as the length of the shortest string SS, and by the maximum number of occurrences of a single character in SS. Those results illustrate how, in many cases, the correction distance between two strings can be easier to compute than in the worst case scenario.Comment: 16 pages, no figures, long version of the extended abstract accepted to SPIRE 201

    A Reformulation of Matrix Graph Grammars with Boolean Complexes

    Full text link
    Prior publication in the Electronic Journal of Combinatorics.Graph transformation is concerned with the manipulation of graphs by means of rules. Graph grammars have been traditionally studied using techniques from category theory. In previous works, we introduced Matrix Graph Grammars (MGG) as a purely algebraic approach for the study of graph dynamics, based on the representation of simple graphs by means of their adjacency matrices. The observation that, in addition to positive information, a rule implicitly defines negative conditions for its application (edges cannot become dangling, and cannot be added twice as we work with simple digraphs) has led to a representation of graphs as two matrices encoding positive and negative information. Using this representation, we have reformulated the main concepts in MGGs, while we have introduced other new ideas. In particular, we present (i) a new formulation of productions together with an abstraction of them (so called swaps), (ii) the notion of coherence, which checks whether a production sequence can be potentially applied, (iii) the minimal graph enabling the applicability of a sequence, and (iv) the conditions for compatibility of sequences (lack of dangling edges) and G-congruence (whether two sequences have the same minimal initial graph).This work has been partially sponsored by the Spanish Ministry of Science and Innovation, project METEORIC (TIN2008-02081/TIN)

    Matching Lenses: Alignment and View Update

    Get PDF
    Bidirectional programming languages have been proposed as a practical approach to the view update problem. Programs in these languages, often called lenses, can be read in two ways— from left to right as functions mapping sources to views, and from right to left as functions mapping updated views back to updated sources. Lenses address the view update problem by making it possible to define a view and its associated update policy together. One issue that has not received sufficient attention in the design of bidirectional languages is alignment. In general, to correctly propagate an update to a view, a lens needs to match up the pieces of the edited view with corresponding pieces of the underlying source. Unfortunately, existing bidirectional languages are extremely limited in their treatment of alignment—they only support simple strategies that do not suffice for many examples of practical interest. In this paper, we propose a novel framework of matching lenses that extends basic lenses with new mechanisms for calculating and using alignments. We enrich the types of lenses with “chunks” that identify the locations of data that should be re-aligned after updates, and we formulate refined behavioral laws that capture essential constraints on the handling of chunks. To demonstrate the utility of our approach, we develop a core language of matching lenses for string data, and we extend it with primitives for describing a number of useful alignment heuristics

    A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

    Get PDF
    The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finitestate conditional random field model for edit sequences between strings. Conditional random fields have advantages over generative approaches to this problem, such as pair HMMs or the work of Ristad and Yianilos, because as conditionally-trained methods, they enable the use of complex, arbitrary actions and features of the input strings. As in generative models, the training data does not have to specify the edit sequences between the given string pairs. Unlike generative models, however, our model is trained on both positive and negative instances of string pairs. We present positive experimental results on several data sets

    Database Streaming Compression on Memory-Limited Machines

    Get PDF
    Dynamic Huffman compression algorithms operate on data-streams with a bounded symbol list. With these algorithms, the complete list of symbols must be contained in main memory or secondary storage. A horizontal format transaction database that is streaming can have a very large item list. Many nodes tax both the processing hardware primary memory size, and the processing time to dynamically maintain the tree. This research investigated Huffman compression of a transaction-streaming database with a very large symbol list, where each item in the transaction database schema’s item list is a symbol to compress. The constraint of a large symbol list is, in this research, equivalent to the constraint of a memory-limited machine. A large symbol set will result if each item in a large database item list is a symbol to compress in a database stream. In addition, database streams may have some temporal component spanning months or years. Finally, the horizontal format is the format most suited to a streaming transaction database because the transaction IDs are not known beforehand This research prototypes an algorithm that will compresses a transaction database stream. There are several advantages to the memory limited dynamic Huffman algorithm. Dynamic Huffman algorithms are single pass algorithms. In many instances a second pass over the data is not possible, such as with streaming databases. Previous dynamic Huffman algorithms are not memory limited, they are asymptotic to O(n), where n is the number of distinct item IDs. Memory is required to grow to fit the n items. The improvement of the new memory limited Dynamic Huffman algorithm is that it would have an O(k) asymptotic memory requirement; where k is the maximum number of nodes in the Huffman tree, k \u3c n, and k is a user chosen constant. The new memory limited Dynamic Huffman algorithm compresses horizontally encoded transaction databases that do not contain long runs of 0’s or 1’s
    • 

    corecore