280 research outputs found
Algorithms and implementation of functional dependency discovery in XML : a thesis presented in partial fulfilment of the requirements for the degree of Master of Information Sciences in Information Systems at Massey University
1.1 Background Following the advent of the web, there has been a great demand for data interchange between applications using internet infrastructure. XML (extensible Markup Language) provides a structured representation of data empowered by broad adoption and easy deployment. As a subset of SGML (Standard Generalized Markup Language), XML has been standardized by the World Wide Web Consortium (W3C) [Bray et al., 2004], XML is becoming the prevalent data exchange format on the World Wide Web and increasingly significant in storing semi-structured data. After its initial release in 1996, it has evolved and been applied extensively in all fields where the exchange of structured documents in electronic form is required. As with the growing popularity of XML, the issue of functional dependency in XML has recently received well deserved attention. The driving force for the study of dependencies in XML is it is as crucial to XML schema design, as to relational database(RDB) design [Abiteboul et al., 1995]
A Grammatical Inference Approach to Language-Based Anomaly Detection in XML
False-positives are a problem in anomaly-based intrusion detection systems.
To counter this issue, we discuss anomaly detection for the eXtensible Markup
Language (XML) in a language-theoretic view. We argue that many XML-based
attacks target the syntactic level, i.e. the tree structure or element content,
and syntax validation of XML documents reduces the attack surface. XML offers
so-called schemas for validation, but in real world, schemas are often
unavailable, ignored or too general. In this work-in-progress paper we describe
a grammatical inference approach to learn an automaton from example XML
documents for detecting documents with anomalous syntax.
We discuss properties and expressiveness of XML to understand limits of
learnability. Our contributions are an XML Schema compatible lexical datatype
system to abstract content in XML and an algorithm to learn visibly pushdown
automata (VPA) directly from a set of examples. The proposed algorithm does not
require the tree representation of XML, so it can process large documents or
streams. The resulting deterministic VPA then allows stream validation of
documents to recognize deviations in the underlying tree structure or
datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and
Countermeasures ECTCM 201
Schematron Schema Inference
XML je populárnĂ jazyk pro vĂ˝mÄ›nu dat. Mnoho dokumentĹŻ však nemá svĹŻj popis schĂ©matu nebo je tento popis neaktuálnĂ. Tato práce navazuje na práce o automatickĂ©m odvozovánĂ schĂ©mat XML dokumentĹŻ a zaměřuje se na odvozovánĂ schĂ©mat pro Schematron. Schematron je jazyk, kterĂ˝ validuje XML dokumentu pouze pomocĂ pravidel, ne jako celou gramatiku, jako je typickĂ© pro DTD nebo XML Schema. JelikoĹľ oblast generovánĂ schĂ©mat Schematronu nenĂ pĹ™Ăliš prozkoumaná, tato práce analyzuje základnĂ problĂ©my, navrhuje nÄ›kolik postupĹŻ a popisuje jejich vĂ˝hody a nevĂ˝hody.XML is a popular language for data exchange. However, many XML documents do not have their schema or their schema is outdated. This thesis continues on the field of automatic schema inferring for set of XML documents and focuses on Schematron schema inferring. Schematron is a language that validates XML documents with rules, it does not compare the document against a grammar like DTD, and XML Schema does. Because the field of Schematron schema generation is not so much explored, this thesis analyzes basic problems, suggests several approaches and describes their advantages and disadvantages.Department of Software EngineeringKatedra softwarovĂ©ho inĹľenĂ˝rstvĂFaculty of Mathematics and PhysicsMatematicko-fyzikálnĂ fakult
Multiple hierarchies : new aspects of an old solution
In this paper, we present the Multiple Annotation approach, which solves two problems: the problem of annotating overlapping structures, and the problem that occurs when documents should be annotated according to different, possibly heterogeneous tag sets. This approach has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. The files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) are described. These representations serve as a base for several applications
Discovering Restricted Regular Expressions with Interleaving
Discovering a concise schema from given XML documents is an important problem
in XML applications. In this paper, we focus on the problem of learning an
unordered schema from a given set of XML examples, which is actually a problem
of learning a restricted regular expression with interleaving using positive
example strings. Schemas with interleaving could present meaningful knowledge
that cannot be disclosed by previous inference techniques. Moreover, inference
of the minimal schema with interleaving is challenging. The problem of finding
a minimal schema with interleaving is shown to be NP-hard. Therefore, we
develop an approximation algorithm and a heuristic solution to tackle the
problem using techniques different from known inference algorithms. We do
experiments on real-world data sets to demonstrate the effectiveness of our
approaches. Our heuristic algorithm is shown to produce results that are very
close to optimal.Comment: 12 page
XML Document Schema Inference
XML je v souÄŤasnĂ© dobÄ› oblĂbenĂ˝m formátem pro vĂ˝mÄ›nu a uchovávánĂ dat. VnitĹ™nĂ struktura XML dokumentĹŻ je popisována schĂ©matem, kterĂ© má dĹŻleĹľitou Ăşlohu pĹ™i manipulaci s daty. V tĂ©to práci se zabĂ˝váme návrhem metody, která umoĹľnĂ vytvoĹ™it k mnoĹľinÄ› vstupnĂch XML dokumentĹŻ odpovĂdajĂcĂ schĂ©ma. Prozkoumali jsme stávajĂcĂ publikovanĂ© metody, kterĂ© se pokusĂme vylepšit o interakci s uĹľivatelem, která se dosud neobjevila v žádnĂ©m prozkoumanĂ©m algoritmu. PĹ™i odvozovánĂ schĂ©matu vyuĹľĂváme formálnĂho základu metody odvozovánĂ gramatik.XML has become a popular format for data exchange and manipulation. The internal structure of XML documents is described by the schema that plays an important role in the data management. The main idea is to develop a method for the schema inference from the given set of XML documents. By observing published automatic methods, we discovered their disadvantages. Since there is no interactive implementation, we improve these methods with the user interaction. As a solid formal background, the principle of grammatical inference is extended in our approach.
Efficient Inclusion Checking for Deterministic Tree Automata and XML Schemas
Special issue of LATA'08.International audienceWe present algorithms for testing language inclusion L(A) ⊆ L(B) between tree automata in time O(|A| |B|) where B is deterministic (bottom-up or top-down). We extend our algorithms for testing inclusion of automata for unranked trees A in deterministic DTDs or deterministic EDTDs with restrained competition D in time O(|A| |Σ| |D|). Previous algorithms were less efficient or less general
- …