2,663 research outputs found

    A Database Approach to Content-based XML retrieval

    Get PDF
    This paper describes a rst prototype system for content-based retrieval from XML data. The system's design supports both XPath queries and complex information retrieval queries based on a language modelling approach to information retrieval. Evaluation using the INEX benchmark shows that it is beneficial if the system is biased to retrieve large XML fragments over small fragments

    A Grammatical Inference Approach to Language-Based Anomaly Detection in XML

    Full text link
    False-positives are a problem in anomaly-based intrusion detection systems. To counter this issue, we discuss anomaly detection for the eXtensible Markup Language (XML) in a language-theoretic view. We argue that many XML-based attacks target the syntactic level, i.e. the tree structure or element content, and syntax validation of XML documents reduces the attack surface. XML offers so-called schemas for validation, but in real world, schemas are often unavailable, ignored or too general. In this work-in-progress paper we describe a grammatical inference approach to learn an automaton from example XML documents for detecting documents with anomalous syntax. We discuss properties and expressiveness of XML to understand limits of learnability. Our contributions are an XML Schema compatible lexical datatype system to abstract content in XML and an algorithm to learn visibly pushdown automata (VPA) directly from a set of examples. The proposed algorithm does not require the tree representation of XML, so it can process large documents or streams. The resulting deterministic VPA then allows stream validation of documents to recognize deviations in the underlying tree structure or datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and Countermeasures ECTCM 201

    Rewrite based Verification of XML Updates

    Get PDF
    We consider problems of access control for update of XML documents. In the context of XML programming, types can be viewed as hedge automata, and static type checking amounts to verify that a program always converts valid source documents into also valid output documents. Given a set of update operations we are particularly interested by checking safety properties such as preservation of document types along any sequence of updates. We are also interested by the related policy consistency problem, that is detecting whether a sequence of authorized operations can simulate a forbidden one. We reduce these questions to type checking problems, solved by computing variants of hedge automata characterizing the set of ancestors and descendants of the initial document type for the closure of parameterized rewrite rules

    Multiple hierarchies : new aspects of an old solution

    Get PDF
    In this paper, we present the Multiple Annotation approach, which solves two problems: the problem of annotating overlapping structures, and the problem that occurs when documents should be annotated according to different, possibly heterogeneous tag sets. This approach has many advantages: it is based on XML, the modeling of alternative annotations is possible, each level can be viewed separately, and new levels can be added at any time. The files can be regarded as an interrelated unit, with the text serving as the implicit link. Two representations of the information contained in the multiple files (one in Prolog and one in XML) are described. These representations serve as a base for several applications

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    A review of the state of the art in Machine Learning on the Semantic Web: Technical Report CSTR-05-003

    Get PDF

    Wikis in scholarly publishing

    Get PDF
    Scientific research is a process concerned with the creation, collective accumulation, contextualization, updating and maintenance of knowledge. Wikis provide an environment that allows to collectively accumulate, contextualize, update and maintain knowledge in a coherent and transparent fashion. Here, we examine the potential of wikis as platforms for scholarly publishing. In the hope to stimulate further discussion, the article itself was drafted on "Species-ID":http://species-id.net/w/index.php?title=Wikis_in_scholarly_publishing&oldid=3815 - a wiki that hosts a prototype for wiki-based scholarly publishing - where it can be updated, expanded or otherwise improved

    New Methods, Current Trends and Software Infrastructure for NLP

    Full text link
    The increasing use of `new methods' in NLP, which the NeMLaP conference series exemplifies, occurs in the context of a wider shift in the nature and concerns of the discipline. This paper begins with a short review of this context and significant trends in the field. The review motivates and leads to a set of requirements for support software of general utility for NLP research and development workers. A freely-available system designed to meet these requirements is described (called GATE - a General Architecture for Text Engineering). Information Extraction (IE), in the sense defined by the Message Understanding Conferences (ARPA \cite{Arp95}), is an NLP application in which many of the new methods have found a home (Hobbs \cite{Hob93}; Jacobs ed. \cite{Jac92}). An IE system based on GATE is also available for research purposes, and this is described. Lastly we review related work.Comment: 12 pages, LaTeX, uses nemlap.sty (included
    • …
    corecore