370 research outputs found
Efficient Inclusion Checking for Deterministic Tree Automata and XML Schemas
Special issue of LATA'08.International audienceWe present algorithms for testing language inclusion L(A) ⊆ L(B) between tree automata in time O(|A| |B|) where B is deterministic (bottom-up or top-down). We extend our algorithms for testing inclusion of automata for unranked trees A in deterministic DTDs or deterministic EDTDs with restrained competition D in time O(|A| |Σ| |D|). Previous algorithms were less efficient or less general
A Grammatical Inference Approach to Language-Based Anomaly Detection in XML
False-positives are a problem in anomaly-based intrusion detection systems.
To counter this issue, we discuss anomaly detection for the eXtensible Markup
Language (XML) in a language-theoretic view. We argue that many XML-based
attacks target the syntactic level, i.e. the tree structure or element content,
and syntax validation of XML documents reduces the attack surface. XML offers
so-called schemas for validation, but in real world, schemas are often
unavailable, ignored or too general. In this work-in-progress paper we describe
a grammatical inference approach to learn an automaton from example XML
documents for detecting documents with anomalous syntax.
We discuss properties and expressiveness of XML to understand limits of
learnability. Our contributions are an XML Schema compatible lexical datatype
system to abstract content in XML and an algorithm to learn visibly pushdown
automata (VPA) directly from a set of examples. The proposed algorithm does not
require the tree representation of XML, so it can process large documents or
streams. The resulting deterministic VPA then allows stream validation of
documents to recognize deviations in the underlying tree structure or
datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and
Countermeasures ECTCM 201
Efficient Inclusion Checking for Deterministic Tree Automata and DTDs
International audienceWe present a new algorithm for testing language inclusion L(A) ⊆ L(B)L(A) between tree automata in time O(|A| |B|) where B is deterministic. We extend this algorithm for testing inclusion between automata for unranked trees A and deterministic DTDs D in time O(|A| |Σ| |D|). No previous algorithms with these complexities exist. A journal extension is available at http://hal.inria.fr/inria-00366082
Query Induction with Schema-Guided Pruning Strategies
International audienceInference algorithms for tree automata that define node selecting queries in unranked trees rely on tree pruning strategies. These impose additional assumptions on node selection that are needed to compensate for small numbers of annotated examples. Pruning-based heuristics in query learning algorithms for Web information extraction often boost the learning quality and speed up the learning process. We will distinguish the class of regular queries that are stable under a given schema-guided pruning strategy, and show that this class is learnable with polynomial time and data. Our learning algorithm is obtained by adding pruning heuristics to the traditional learning algorithm for tree automata from positive and negative examples. While justified by a formal learning model, our learning algorithm for stable queries also performs very well in practice of XML information extraction
Earliest Query Answering for Deterministic Nested Word Automata
International audienceEarliest query answering (EQA) is an objective of many recent streaming algorithms for XML query answering, that aim for close to optimal memory management. In this paper, we show that EQA is infeasible even for a small fragment of Forward XPath except if P=NP. We then present an EQA algorithm for queries and schemas defined by deterministic nested word automata (dNWAs) and distinguish a large class of dNWAs for which streaming query answering is feasible in polynomial space and time
Transformations Between Different Types of Unranked Bottom-Up Tree Automata
We consider the representational state complexity of unranked tree automata.
The bottom-up computation of an unranked tree automaton may be either
deterministic or nondeterministic, and further variants arise depending on
whether the horizontal string languages defining the transitions are
represented by a DFA or an NFA. Also, we consider for unranked tree automata
the alternative syntactic definition of determinism introduced by Cristau et
al. (FCT'05, Lect. Notes Comput. Sci. 3623, pp. 68-79).
We establish upper and lower bounds for the state complexity of conversions
between different types of unranked tree automata.Comment: In Proceedings DCFS 2010, arXiv:1008.127
Deterministic Automata for Unordered Trees
Automata for unordered unranked trees are relevant for defining schemas and
queries for data trees in Json or Xml format. While the existing notions are
well-investigated concerning expressiveness, they all lack a proper notion of
determinism, which makes it difficult to distinguish subclasses of automata for
which problems such as inclusion, equivalence, and minimization can be solved
efficiently. In this paper, we propose and investigate different notions of
"horizontal determinism", starting from automata for unranked trees in which
the horizontal evaluation is performed by finite state automata. We show that a
restriction to confluent horizontal evaluation leads to polynomial-time
emptiness and universality, but still suffers from coNP-completeness of the
emptiness of binary intersections. Finally, efficient algorithms can be
obtained by imposing an order of horizontal evaluation globally for all
automata in the class. Depending on the choice of the order, we obtain
different classes of automata, each of which has the same expressiveness as
CMso.Comment: In Proceedings GandALF 2014, arXiv:1408.556
Schema-Guided Induction of Monadic Queries
International audienceThe induction of monadic node selecting queries from partially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML
- …