Search CORE

1,686 research outputs found

Discovering Restricted Regular Expressions with Interleaving

Author: A. Ignatiev
E.M. Gold
G.J. Bex
J. Pei
R.W. Bailey
S. Tsukiyama
Publication venue
Publication date: 01/04/2015
Field of study

Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal.Comment: 12 page

arXiv.org e-Print Archive

Crossref

Deciding Definability by Deterministic Regular Expressions

Author: Czerwiński Wojciech
David Claire
Losemann Katja
Martens Wim
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceWe investigate the complexity of deciding whether a given regular language can be defined with a deterministic regular expression. Our main technical result shows that the problem is Pspace-complete if the input language is represented as a regular expression or nondeterministic finite automaton. The problem becomes Expspace-complete if the language is represented as a regular expression with counters

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Automated Code Extraction from Discussion Board Text Dataset

Author: Folkestad James
Ghaffari Sadaf
Luther Yanye
Moraes Marcia
Saravani Sina Mahdipour
Publication venue
Publication date: 18/04/2023
Field of study

This study introduces and investigates the capabilities of three different text mining approaches, namely Latent Semantic Analysis, Latent Dirichlet Analysis, and Clustering Word Vectors, for automating code extraction from a relatively small discussion board dataset. We compare the outputs of each algorithm with a previous dataset that was manually coded by two human raters. The results show that even with a relatively small dataset, automated approaches can be an asset to course instructors by extracting some of the discussion codes, which can be used in Epistemic Network Analysis.Comment: LaTeX; typos corrected at page

arXiv.org e-Print Archive

String Sanitization Under Edit Distance

Author: Bernardini Giulia
Chen Huiping
Gortz Inge Li
Loukides Grigorios
Pisanti Nadia
Pissis Solon P.
Stougie Leen
Sweering Michelle
Weimann Oren
Publication venue
Publication date: 01/01/2020
Field of study

Let W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS

Archivio istituzionale della ricerca - Università di Trieste

VU Research Portal

CWI's Institutional Repository

University of Birmingham Research Portal

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

A Kleene Theorem for Nominal Automata

Author: Brunet Paul
Silva Alexandra
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019)
Publication date: 01/01/2019
Field of study

Nominal automata are a widely studied class of automata designed to recognise languages over infinite alphabets. In this paper, we present a Kleene theorem for nominal automata by providing a syntax to denote regular nominal languages. We use regular expressions with explicit binders for creation and destruction of names and pinpoint an exact property of these expressions - namely memory-finiteness - identifying a subclass of expressions denoting exactly regular nominal languages

UCL Discovery

Dagstuhl Research Online Publication Server