Search CORE

31,392 research outputs found

A Grammatical Inference Approach to Language-Based Anomaly Detection in XML

Author: Lampesberger Harald
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

False-positives are a problem in anomaly-based intrusion detection systems. To counter this issue, we discuss anomaly detection for the eXtensible Markup Language (XML) in a language-theoretic view. We argue that many XML-based attacks target the syntactic level, i.e. the tree structure or element content, and syntax validation of XML documents reduces the attack surface. XML offers so-called schemas for validation, but in real world, schemas are often unavailable, ignored or too general. In this work-in-progress paper we describe a grammatical inference approach to learn an automaton from example XML documents for detecting documents with anomalous syntax. We discuss properties and expressiveness of XML to understand limits of learnability. Our contributions are an XML Schema compatible lexical datatype system to abstract content in XML and an algorithm to learn visibly pushdown automata (VPA) directly from a set of examples. The proposed algorithm does not require the tree representation of XML, so it can process large documents or streams. The resulting deterministic VPA then allows stream validation of documents to recognize deviations in the underlying tree structure or datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and Countermeasures ECTCM 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Learning Moore Machines from Input-Output Traces

Author: A Gupta
A Solar-Lezama
AV Aleksandrov
AW Biermann
B Jonsson
C Higuera de la
CL Heitmeyer
D Angluin
D Lee
EM Gold
EM Gold
F Aarts
F Aarts
F Howar
HI Akram
IP Buzhinsky
J Oncina
K Meinke
K Takahashi
KJ Lang
LPJ Veelenturf
M Shahbaz
M Spichakova
MA Colón
MJ Heule
O Grinchtein
P Dupont
R Alur
R Dorofeeva
S Cassel
T Berg
TM Mitchell
TS Chow
V Ulyantsev
X Jin
Z Kohavi
Publication venue
Publication date: 02/09/2016
Field of study

The problem of learning automata from example traces (but no equivalence or membership queries) is fundamental in automata learning theory and practice. In this paper we study this problem for finite state machines with inputs and outputs, and in particular for Moore machines. We develop three algorithms for solving this problem: (1) the PTAP algorithm, which transforms a set of input-output traces into an incomplete Moore machine and then completes the machine with self-loops; (2) the PRPNI algorithm, which uses the well-known RPNI algorithm for automata learning to learn a product of automata encoding a Moore machine; and (3) the MooreMI algorithm, which directly learns a Moore machine using PTAP extended with state merging. We prove that MooreMI has the fundamental identification in the limit property. We also compare the algorithms experimentally in terms of the size of the learned machine and several notions of accuracy, introduced in this paper. Finally, we compare with OSTIA, an algorithm that learns a more general class of transducers, and find that OSTIA generally does not learn a Moore machine, even when fed with a characteristic sample

arXiv.org e-Print Archive

Crossref

The incremental use of morphological information and lexicalization in data-driven dependency parsing

Author: Eryigit Gulsen
Eryiğit Gülşen
Nivre Joakim
Oflazer Kemal
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/12/2006
Field of study

Typological diversity among the natural languages of the world poses interesting challenges for the models and algorithms used in syntactic parsing. In this paper, we apply a data-driven dependency parser to Turkish, a language characterized by rich morphology and flexible constituent order, and study the effect of employing varying amounts of morpholexical information on parsing performance. The investigations show that accuracy can be improved by using representations based on inflectional groups rather than word forms, confirming earlier studies. In addition, lexicalization and the use of rich morphological features are found to have a positive effect. By combining all these techniques, we obtain the highest reported accuracy for parsing the Turkish Treebank

Sabanci University Research Database

Learn with SAT to Minimize B\"uchi Automata

Author: Barth Stephan
Hofmann Martin
Publication venue: 'Open Publishing Association'
Publication date: 01/10/2012
Field of study

We describe a minimization procedure for nondeterministic B\"uchi automata (NBA). For an automaton A another automaton A_min with the minimal number of states is learned with the help of a SAT-solver. This is done by successively computing automata A' that approximate A in the sense that they accept a given finite set of positive examples and reject a given finite set of negative examples. In the course of the procedure these example sets are successively increased. Thus, our method can be seen as an instance of a generic learning algorithm based on a "minimally adequate teacher" in the sense of Angluin. We use a SAT solver to find an NBA for given sets of positive and negative examples. We use complementation via construction of deterministic parity automata to check candidates computed in this manner for equivalence with A. Failure of equivalence yields new positive or negative examples. Our method proved successful on complete samplings of small automata and of quite some examples of bigger automata. We successfully ran the minimization on over ten thousand automata with mostly up to ten states, including the complements of all possible automata with two states and alphabet size three and discuss results and runtimes; single examples had over 100 states.Comment: In Proceedings GandALF 2012, arXiv:1210.202

arXiv.org e-Print Archive

Directory of Open Access Journals

Learning probability distributions generated by finite-state machines

Author: Castro Rabal Jorge
Gavaldà Mestre Ricard
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

We review methods for inference of probability distributions generated by probabilistic automata and related models for sequence generation. We focus on methods that can be proved to learn in the inference in the limit and PAC formal models. The methods we review are state merging and state splitting methods for probabilistic deterministic automata and the recently developed spectral method for nondeterministic probabilistic automata. In both cases, we derive them from a high-level algorithm described in terms of the Hankel matrix of the distribution to be learned, given as an oracle, and then describe how to adapt that algorithm to account for the error introduced by a finite sample.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC