1,686 research outputs found

    Discovering Restricted Regular Expressions with Interleaving

    Full text link
    Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal.Comment: 12 page

    Deciding Definability by Deterministic Regular Expressions

    Get PDF
    International audienceWe investigate the complexity of deciding whether a given regular language can be defined with a deterministic regular expression. Our main technical result shows that the problem is Pspace-complete if the input language is represented as a regular expression or nondeterministic finite automaton. The problem becomes Expspace-complete if the language is represented as a regular expression with counters

    Automated Code Extraction from Discussion Board Text Dataset

    Full text link
    This study introduces and investigates the capabilities of three different text mining approaches, namely Latent Semantic Analysis, Latent Dirichlet Analysis, and Clustering Word Vectors, for automating code extraction from a relatively small discussion board dataset. We compare the outputs of each algorithm with a previous dataset that was manually coded by two human raters. The results show that even with a relatively small dataset, automated approaches can be an asset to course instructors by extracting some of the discussion codes, which can be used in Epistemic Network Analysis.Comment: LaTeX; typos corrected at page

    String Sanitization Under Edit Distance

    Get PDF
    Let W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS

    A Kleene Theorem for Nominal Automata

    Get PDF
    Nominal automata are a widely studied class of automata designed to recognise languages over infinite alphabets. In this paper, we present a Kleene theorem for nominal automata by providing a syntax to denote regular nominal languages. We use regular expressions with explicit binders for creation and destruction of names and pinpoint an exact property of these expressions - namely memory-finiteness - identifying a subclass of expressions denoting exactly regular nominal languages
    • …
    corecore