1,686 research outputs found
Discovering Restricted Regular Expressions with Interleaving
Discovering a concise schema from given XML documents is an important problem
in XML applications. In this paper, we focus on the problem of learning an
unordered schema from a given set of XML examples, which is actually a problem
of learning a restricted regular expression with interleaving using positive
example strings. Schemas with interleaving could present meaningful knowledge
that cannot be disclosed by previous inference techniques. Moreover, inference
of the minimal schema with interleaving is challenging. The problem of finding
a minimal schema with interleaving is shown to be NP-hard. Therefore, we
develop an approximation algorithm and a heuristic solution to tackle the
problem using techniques different from known inference algorithms. We do
experiments on real-world data sets to demonstrate the effectiveness of our
approaches. Our heuristic algorithm is shown to produce results that are very
close to optimal.Comment: 12 page
Deciding Definability by Deterministic Regular Expressions
International audienceWe investigate the complexity of deciding whether a given regular language can be defined with a deterministic regular expression. Our main technical result shows that the problem is Pspace-complete if the input language is represented as a regular expression or nondeterministic finite automaton. The problem becomes Expspace-complete if the language is represented as a regular expression with counters
Automated Code Extraction from Discussion Board Text Dataset
This study introduces and investigates the capabilities of three different
text mining approaches, namely Latent Semantic Analysis, Latent Dirichlet
Analysis, and Clustering Word Vectors, for automating code extraction from a
relatively small discussion board dataset. We compare the outputs of each
algorithm with a previous dataset that was manually coded by two human raters.
The results show that even with a relatively small dataset, automated
approaches can be an asset to course instructors by extracting some of the
discussion codes, which can be used in Epistemic Network Analysis.Comment: LaTeX; typos corrected at page
String Sanitization Under Edit Distance
Let W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS
A Kleene Theorem for Nominal Automata
Nominal automata are a widely studied class of automata designed to recognise languages over infinite alphabets. In this paper, we present a Kleene theorem for nominal automata by providing a syntax to denote regular nominal languages. We use regular expressions with explicit binders for creation and destruction of names and pinpoint an exact property of these expressions - namely memory-finiteness - identifying a subclass of expressions denoting exactly regular nominal languages
- …