6,762 research outputs found
A Grammatical Inference Approach to Language-Based Anomaly Detection in XML
False-positives are a problem in anomaly-based intrusion detection systems.
To counter this issue, we discuss anomaly detection for the eXtensible Markup
Language (XML) in a language-theoretic view. We argue that many XML-based
attacks target the syntactic level, i.e. the tree structure or element content,
and syntax validation of XML documents reduces the attack surface. XML offers
so-called schemas for validation, but in real world, schemas are often
unavailable, ignored or too general. In this work-in-progress paper we describe
a grammatical inference approach to learn an automaton from example XML
documents for detecting documents with anomalous syntax.
We discuss properties and expressiveness of XML to understand limits of
learnability. Our contributions are an XML Schema compatible lexical datatype
system to abstract content in XML and an algorithm to learn visibly pushdown
automata (VPA) directly from a set of examples. The proposed algorithm does not
require the tree representation of XML, so it can process large documents or
streams. The resulting deterministic VPA then allows stream validation of
documents to recognize deviations in the underlying tree structure or
datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and
Countermeasures ECTCM 201
Learning probability distributions generated by finite-state machines
We review methods for inference of probability distributions generated by probabilistic automata and related models for sequence generation. We focus on methods that can be proved to learn in the inference
in the limit and PAC formal models. The methods we review are state merging and state splitting methods for probabilistic deterministic automata and the recently developed spectral method for nondeterministic probabilistic automata. In both cases, we derive them from a high-level algorithm described in terms of the Hankel matrix of the distribution to be learned, given as an oracle, and then describe how to adapt that algorithm to account for the error introduced by a finite sample.Peer ReviewedPostprint (author's final draft
A Survey of State Merging Strategies for DFA Identification in the Limit
Identication of deterministic nite automata (DFAs) has an extensive history, both in passive learning and in active learning. Intractability results by Gold [5] and Angluin [1] show that nding the smallest automaton consistent with a set of accepted and rejected strings is NP-complete. Nevertheless, a lot of work has been done on learning DFAs from examples within specic heuristics, starting with Trakhtenbrot and Barzdin's algorithm [15], rediscovered and applied to the discipline of grammatical inference by Gold [5]. Many other algorithms have been developed, the convergence of most of which is based on characteristic sets: RPNI (Regular Positive and Negative Inference) by J. Oncina and P. GarcÃa [11, 12], Traxbar by K. Lang [8], EDSM (Evidence Driven State Merging), Windowed EDSM and Blue- Fringe EDSM by K. Lang, B. Pearlmutter and R. Price [9], SAGE (Self-Adaptive Greedy Estimate) by H. Juillé [7], etc. This paper provides a comprehensive study of the most important state merging strategies developed so far
Sensor Synthesis for POMDPs with Reachability Objectives
Partially observable Markov decision processes (POMDPs) are widely used in
probabilistic planning problems in which an agent interacts with an environment
using noisy and imprecise sensors. We study a setting in which the sensors are
only partially defined and the goal is to synthesize "weakest" additional
sensors, such that in the resulting POMDP, there is a small-memory policy for
the agent that almost-surely (with probability~1) satisfies a reachability
objective. We show that the problem is NP-complete, and present a symbolic
algorithm by encoding the problem into SAT instances. We illustrate trade-offs
between the amount of memory of the policy and the number of additional sensors
on a simple example. We have implemented our approach and consider three
classical POMDP examples from the literature, and show that in all the examples
the number of sensors can be significantly decreased (as compared to the
existing solutions in the literature) without increasing the complexity of the
policies.Comment: arXiv admin note: text overlap with arXiv:1511.0845
Coding-theorem Like Behaviour and Emergence of the Universal Distribution from Resource-bounded Algorithmic Probability
Previously referred to as `miraculous' in the scientific literature because
of its powerful properties and its wide application as optimal solution to the
problem of induction/inference, (approximations to) Algorithmic Probability
(AP) and the associated Universal Distribution are (or should be) of the
greatest importance in science. Here we investigate the emergence, the rates of
emergence and convergence, and the Coding-theorem like behaviour of AP in
Turing-subuniversal models of computation. We investigate empirical
distributions of computing models in the Chomsky hierarchy. We introduce
measures of algorithmic probability and algorithmic complexity based upon
resource-bounded computation, in contrast to previously thoroughly investigated
distributions produced from the output distribution of Turing machines. This
approach allows for numerical approximations to algorithmic
(Kolmogorov-Chaitin) complexity-based estimations at each of the levels of a
computational hierarchy. We demonstrate that all these estimations are
correlated in rank and that they converge both in rank and values as a function
of computational power, despite fundamental differences between computational
models. In the context of natural processes that operate below the Turing
universal level because of finite resources and physical degradation, the
investigation of natural biases stemming from algorithmic rules may shed light
on the distribution of outcomes. We show that up to 60\% of the
simplicity/complexity bias in distributions produced even by the weakest of the
computational models can be accounted for by Algorithmic Probability in its
approximation to the Universal Distribution.Comment: 27 pages main text, 39 pages including supplement. Online complexity
calculator: http://complexitycalculator.com
- …