1,126 research outputs found
A comparative evaluation of name-matching algorithms
Name matching—recognizing when two different strings are likely to denote the same entity—is an important task in many legal information systems, such as case-management systems. The naming conventions peculiar to legal cases limit the effectiveness of generic approximate string-matching algorithms in this task. This paper proposes a three-stage framework for name matching, identifies how each stage in the framework addresses the naming variations that typically arise in legal cases, describes several alternative approaches to each stage, and evaluates the performance of various combinations of the alternatives on a representative collection of names drawn from a United States District Court case management system. The best tradeoff between accuracy and efficiency in this collection was achieved by algorithms that standardize capitalization, spacing, and punctuation; filter redundant terms; index using an abstraction function that is both order-insensitive and tolerant of small numbers of omissions or additions; and compare names in a symmetrical, word-by-word fashion. 1
IST Austria Technical Report
The edit distance between two (untimed) traces is the minimum cost of a sequence of edit operations (insertion, deletion, or substitution) needed to transform one trace to the other. Edit distances have been extensively studied in the untimed setting, and form the basis for approximate matching of sequences in different domains such as coding theory, parsing, and speech recognition.
In this paper, we lift the study of edit distances from untimed languages to the timed setting. We define an edit distance between timed words which incorporates both the edit distance between the untimed words and the absolute difference in timestamps. Our edit distance between two timed words is computable in polynomial time. Further, we show that the edit distance between a timed word and a timed language generated by a timed automaton, defined as the edit distance between the word and the closest word in the language, is PSPACE-complete. While computing the edit distance between two timed automata is undecidable, we show that the approximate version, where we decide if the edit distance between two timed automata is either less than a given parameter or more than delta away from the parameter, for delta>0, can be solved in exponential space and is EXPSPACE-hard. Our definitions and techniques can be generalized to the setting of hybrid systems, and we show analogous decidability results for rectangular automata
Edit Distance for Pushdown Automata
The edit distance between two words is the minimal number of word
operations (letter insertions, deletions, and substitutions) necessary to
transform to . The edit distance generalizes to languages
, where the edit distance from to
is the minimal number such that for every word from
there exists a word in with edit distance at
most . We study the edit distance computation problem between pushdown
automata and their subclasses. The problem of computing edit distance to a
pushdown automaton is undecidable, and in practice, the interesting question is
to compute the edit distance from a pushdown automaton (the implementation, a
standard model for programs with recursion) to a regular language (the
specification). In this work, we present a complete picture of decidability and
complexity for the following problems: (1)~deciding whether, for a given
threshold , the edit distance from a pushdown automaton to a finite
automaton is at most , and (2)~deciding whether the edit distance from a
pushdown automaton to a finite automaton is finite.Comment: An extended version of a paper accepted to ICALP 2015 with the same
title. The paper has been accepted to the LMCS journa
Weighted Transducers for Robustness Verification
Automata theory provides us with fundamental notions such as languages, membership, emptiness and inclusion that in turn allow us to specify and verify properties of reactive systems in a useful manner. However, these notions all yield "yes"/"no" answers that sometimes fall short of being satisfactory answers when the models being analyzed are imperfect, and the observations made are prone to errors. To address this issue, a common engineering approach is not just to verify that a system satisfies a property, but whether it does so robustly. We present notions of robustness that place a metric on words, thus providing a natural notion of distance between words. Such a metric naturally leads to a topological neighborhood of words and languages, leading to quantitative and robust versions of the membership, emptiness and inclusion problems. More generally, we consider weighted transducers to model the cost of errors. Such a transducer models neighborhoods of words by providing the cost of rewriting a word into another. The main contribution of this work is to study robustness verification problems in the context of weighted transducers. We provide algorithms for solving the robust and quantitative versions of the membership and inclusion problems while providing useful motivating case studies including approximate pattern matching problems to detect clinically relevant events in a large type-1 diabetes dataset
Testing Membership for Timed Automata
Given a timed automata which admits thick components and a timed word , we
present a tester which decides if is in the language of the automaton or if
is -far from the language, using finitely many samples taken from
the weighted time distribution associated with an input . We introduce
a distance between timed words, the {\em timed edit distance}, which
generalizes the classical edit distance. A timed word is -far
from a timed language if its relative distance to the language is greater than
.Comment: 26 page
IST Austria Technical Report
The edit distance between two words w1, w2 is the minimal number of word operations (letter insertions, deletions, and substitutions) necessary to transform w1 to w2. The edit distance generalizes to languages L1, L2, where the edit distance is the minimal number k such that for every word from L1 there exists a word in L2 with edit distance at most k. We study the edit distance computation problem between pushdown automata and their subclasses.
The problem of computing edit distance to a pushdown automaton is undecidable, and in practice, the interesting question is to compute the edit distance from a pushdown automaton (the implementation, a standard model for programs with recursion) to a regular language (the specification). In this work, we present a complete picture of decidability and complexity for deciding whether, for a given threshold k, the edit distance from a pushdown automaton to a finite automaton is at most k
Unsupervised Detection of Cell-Assembly Sequences by Similarity-Based Clustering
Neurons which fire in a fixed temporal pattern (i.e., "cell assemblies") are hypothesized to be a fundamental unit of neural information processing. Several methods are available for the detection of cell assemblies without a time structure. However, the systematic detection of cell assemblies with time structure has been challenging, especially in large datasets, due to the lack of efficient methods for handling the time structure. Here, we show a method to detect a variety of cell-assembly activity patterns, recurring in noisy neural population activities at multiple timescales. The key innovation is the use of a computer science method to comparing strings ("edit similarity"), to group spikes into assemblies. We validated the method using artificial data and experimental data, which were previously recorded from the hippocampus of male Long-Evans rats and the prefrontal cortex of male Brown Norway/Fisher hybrid rats. From the hippocampus, we could simultaneously extract place-cell sequences occurring on different timescales during navigation and awake replay. From the prefrontal cortex, we could discover multiple spike sequences of neurons encoding different segments of a goal-directed task. Unlike conventional event-driven statistical approaches, our method detects cell assemblies without creating event-locked averages. Thus, the method offers a novel analytical tool for deciphering the neural code during arbitrary behavioral and mental processes
- …