1,126 research outputs found

    A comparative evaluation of name-matching algorithms

    Full text link
    Name matching—recognizing when two different strings are likely to denote the same entity—is an important task in many legal information systems, such as case-management systems. The naming conventions peculiar to legal cases limit the effectiveness of generic approximate string-matching algorithms in this task. This paper proposes a three-stage framework for name matching, identifies how each stage in the framework addresses the naming variations that typically arise in legal cases, describes several alternative approaches to each stage, and evaluates the performance of various combinations of the alternatives on a representative collection of names drawn from a United States District Court case management system. The best tradeoff between accuracy and efficiency in this collection was achieved by algorithms that standardize capitalization, spacing, and punctuation; filter redundant terms; index using an abstraction function that is both order-insensitive and tolerant of small numbers of omissions or additions; and compare names in a symmetrical, word-by-word fashion. 1

    IST Austria Technical Report

    Get PDF
    The edit distance between two (untimed) traces is the minimum cost of a sequence of edit operations (insertion, deletion, or substitution) needed to transform one trace to the other. Edit distances have been extensively studied in the untimed setting, and form the basis for approximate matching of sequences in different domains such as coding theory, parsing, and speech recognition. In this paper, we lift the study of edit distances from untimed languages to the timed setting. We define an edit distance between timed words which incorporates both the edit distance between the untimed words and the absolute difference in timestamps. Our edit distance between two timed words is computable in polynomial time. Further, we show that the edit distance between a timed word and a timed language generated by a timed automaton, defined as the edit distance between the word and the closest word in the language, is PSPACE-complete. While computing the edit distance between two timed automata is undecidable, we show that the approximate version, where we decide if the edit distance between two timed automata is either less than a given parameter or more than delta away from the parameter, for delta>0, can be solved in exponential space and is EXPSPACE-hard. Our definitions and techniques can be generalized to the setting of hybrid systems, and we show analogous decidability results for rectangular automata

    Edit Distance for Pushdown Automata

    Get PDF
    The edit distance between two words w1,w2w_1, w_2 is the minimal number of word operations (letter insertions, deletions, and substitutions) necessary to transform w1w_1 to w2w_2. The edit distance generalizes to languages L1,L2\mathcal{L}_1, \mathcal{L}_2, where the edit distance from L1\mathcal{L}_1 to L2\mathcal{L}_2 is the minimal number kk such that for every word from L1\mathcal{L}_1 there exists a word in L2\mathcal{L}_2 with edit distance at most kk. We study the edit distance computation problem between pushdown automata and their subclasses. The problem of computing edit distance to a pushdown automaton is undecidable, and in practice, the interesting question is to compute the edit distance from a pushdown automaton (the implementation, a standard model for programs with recursion) to a regular language (the specification). In this work, we present a complete picture of decidability and complexity for the following problems: (1)~deciding whether, for a given threshold kk, the edit distance from a pushdown automaton to a finite automaton is at most kk, and (2)~deciding whether the edit distance from a pushdown automaton to a finite automaton is finite.Comment: An extended version of a paper accepted to ICALP 2015 with the same title. The paper has been accepted to the LMCS journa

    Weighted Transducers for Robustness Verification

    Get PDF
    Automata theory provides us with fundamental notions such as languages, membership, emptiness and inclusion that in turn allow us to specify and verify properties of reactive systems in a useful manner. However, these notions all yield "yes"/"no" answers that sometimes fall short of being satisfactory answers when the models being analyzed are imperfect, and the observations made are prone to errors. To address this issue, a common engineering approach is not just to verify that a system satisfies a property, but whether it does so robustly. We present notions of robustness that place a metric on words, thus providing a natural notion of distance between words. Such a metric naturally leads to a topological neighborhood of words and languages, leading to quantitative and robust versions of the membership, emptiness and inclusion problems. More generally, we consider weighted transducers to model the cost of errors. Such a transducer models neighborhoods of words by providing the cost of rewriting a word into another. The main contribution of this work is to study robustness verification problems in the context of weighted transducers. We provide algorithms for solving the robust and quantitative versions of the membership and inclusion problems while providing useful motivating case studies including approximate pattern matching problems to detect clinically relevant events in a large type-1 diabetes dataset

    Testing Membership for Timed Automata

    Full text link
    Given a timed automata which admits thick components and a timed word xx, we present a tester which decides if xx is in the language of the automaton or if xx is ϵ\epsilon-far from the language, using finitely many samples taken from the weighted time distribution μ\mu associated with an input xx. We introduce a distance between timed words, the {\em timed edit distance}, which generalizes the classical edit distance. A timed word xx is ϵ\epsilon-far from a timed language if its relative distance to the language is greater than ϵ\epsilon.Comment: 26 page

    IST Austria Technical Report

    Get PDF
    The edit distance between two words w1, w2 is the minimal number of word operations (letter insertions, deletions, and substitutions) necessary to transform w1 to w2. The edit distance generalizes to languages L1, L2, where the edit distance is the minimal number k such that for every word from L1 there exists a word in L2 with edit distance at most k. We study the edit distance computation problem between pushdown automata and their subclasses. The problem of computing edit distance to a pushdown automaton is undecidable, and in practice, the interesting question is to compute the edit distance from a pushdown automaton (the implementation, a standard model for programs with recursion) to a regular language (the specification). In this work, we present a complete picture of decidability and complexity for deciding whether, for a given threshold k, the edit distance from a pushdown automaton to a finite automaton is at most k

    Unsupervised Detection of Cell-Assembly Sequences by Similarity-Based Clustering

    Get PDF
    Neurons which fire in a fixed temporal pattern (i.e., "cell assemblies") are hypothesized to be a fundamental unit of neural information processing. Several methods are available for the detection of cell assemblies without a time structure. However, the systematic detection of cell assemblies with time structure has been challenging, especially in large datasets, due to the lack of efficient methods for handling the time structure. Here, we show a method to detect a variety of cell-assembly activity patterns, recurring in noisy neural population activities at multiple timescales. The key innovation is the use of a computer science method to comparing strings ("edit similarity"), to group spikes into assemblies. We validated the method using artificial data and experimental data, which were previously recorded from the hippocampus of male Long-Evans rats and the prefrontal cortex of male Brown Norway/Fisher hybrid rats. From the hippocampus, we could simultaneously extract place-cell sequences occurring on different timescales during navigation and awake replay. From the prefrontal cortex, we could discover multiple spike sequences of neurons encoding different segments of a goal-directed task. Unlike conventional event-driven statistical approaches, our method detects cell assemblies without creating event-locked averages. Thus, the method offers a novel analytical tool for deciphering the neural code during arbitrary behavioral and mental processes

    Fast Filter-and-Refine Algorithms for Subsequence Selection

    Get PDF
    • …
    corecore