280 research outputs found

    Algorithms and implementation of functional dependency discovery in XML : a thesis presented in partial fulfilment of the requirements for the degree of Master of Information Sciences in Information Systems at Massey University

    Get PDF
    1.1 Background Following the advent of the web, there has been a great demand for data interchange between applications using internet infrastructure. XML (extensible Markup Language) provides a structured representation of data empowered by broad adoption and easy deployment. As a subset of SGML (Standard Generalized Markup Language), XML has been standardized by the World Wide Web Consortium (W3C) [Bray et al., 2004], XML is becoming the prevalent data exchange format on the World Wide Web and increasingly significant in storing semi-structured data. After its initial release in 1996, it has evolved and been applied extensively in all fields where the exchange of structured documents in electronic form is required. As with the growing popularity of XML, the issue of functional dependency in XML has recently received well deserved attention. The driving force for the study of dependencies in XML is it is as crucial to XML schema design, as to relational database(RDB) design [Abiteboul et al., 1995]

    A Grammatical Inference Approach to Language-Based Anomaly Detection in XML

    Full text link
    False-positives are a problem in anomaly-based intrusion detection systems. To counter this issue, we discuss anomaly detection for the eXtensible Markup Language (XML) in a language-theoretic view. We argue that many XML-based attacks target the syntactic level, i.e. the tree structure or element content, and syntax validation of XML documents reduces the attack surface. XML offers so-called schemas for validation, but in real world, schemas are often unavailable, ignored or too general. In this work-in-progress paper we describe a grammatical inference approach to learn an automaton from example XML documents for detecting documents with anomalous syntax. We discuss properties and expressiveness of XML to understand limits of learnability. Our contributions are an XML Schema compatible lexical datatype system to abstract content in XML and an algorithm to learn visibly pushdown automata (VPA) directly from a set of examples. The proposed algorithm does not require the tree representation of XML, so it can process large documents or streams. The resulting deterministic VPA then allows stream validation of documents to recognize deviations in the underlying tree structure or datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and Countermeasures ECTCM 201

    Schematron Schema Inference

    Get PDF
    XML je populární jazyk pro výměnu dat. Mnoho dokumentů však nemá svůj popis schématu nebo je tento popis neaktuální. Tato práce navazuje na práce o automatickém odvozování schémat XML dokumentů a zaměřuje se na odvozování schémat pro Schematron. Schematron je jazyk, který validuje XML dokumentu pouze pomocí pravidel, ne jako celou gramatiku, jako je typické pro DTD nebo XML Schema. Jelikož oblast generování schémat Schematronu není příliš prozkoumaná, tato práce analyzuje základní problémy, navrhuje několik postupů a popisuje jejich výhody a nevýhody.XML is a popular language for data exchange. However, many XML documents do not have their schema or their schema is outdated. This thesis continues on the field of automatic schema inferring for set of XML documents and focuses on Schematron schema inferring. Schematron is a language that validates XML documents with rules, it does not compare the document against a grammar like DTD, and XML Schema does. Because the field of Schematron schema generation is not so much explored, this thesis analyzes basic problems, suggests several approaches and describes their advantages and disadvantages.Department of Software EngineeringKatedra softwarového inženýrstvíFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult

    Multiple hierarchies : new aspects of an old solution

    Get PDF
    In this paper, we present the Multiple Annotation approach, which solves two problems: the problem of annotating overlapping structures, and the problem that occurs when documents should be annotated according to different, possibly heterogeneous tag sets. This approach has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. The files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) are described. These representations serve as a base for several applications

    Discovering Restricted Regular Expressions with Interleaving

    Full text link
    Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the minimal schema with interleaving is challenging. The problem of finding a minimal schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal.Comment: 12 page

    XML Document Schema Inference

    Get PDF
    XML je v současné době oblíbeným formátem pro výměnu a uchovávání dat. Vnitřní struktura XML dokumentů je popisována schématem, které má důležitou úlohu při manipulaci s daty. V této práci se zabýváme návrhem metody, která umožní vytvořit k množině vstupních XML dokumentů odpovídající schéma. Prozkoumali jsme stávající publikované metody, které se pokusíme vylepšit o interakci s uživatelem, která se dosud neobjevila v žádném prozkoumaném algoritmu. Při odvozování schématu využíváme formálního základu metody odvozování gramatik.XML has become a popular format for data exchange and manipulation. The internal structure of XML documents is described by the schema that plays an important role in the data management. The main idea is to develop a method for the schema inference from the given set of XML documents. By observing published automatic methods, we discovered their disadvantages. Since there is no interactive implementation, we improve these methods with the user interaction. As a solid formal background, the principle of grammatical inference is extended in our approach.

    Efficient Inclusion Checking for Deterministic Tree Automata and XML Schemas

    Get PDF
    Special issue of LATA'08.International audienceWe present algorithms for testing language inclusion L(A) ⊆ L(B) between tree automata in time O(|A| |B|) where B is deterministic (bottom-up or top-down). We extend our algorithms for testing inclusion of automata for unranked trees A in deterministic DTDs or deterministic EDTDs with restrained competition D in time O(|A| |Σ| |D|). Previous algorithms were less efficient or less general
    • …
    corecore