95,423 research outputs found

    Non-hierarchical Structures: How to Model and Index Overlaps?

    Full text link
    Overlap is a common phenomenon seen when structural components of a digital object are neither disjoint nor nested inside each other. Overlapping components resist reduction to a structural hierarchy, and tree-based indexing and query processing techniques cannot be used for them. Our solution to this data modeling problem is TGSA (Tree-like Graph for Structural Annotations), a novel extension of the XML data model for non-hierarchical structures. We introduce an algorithm for constructing TGSA from annotated documents; the algorithm can efficiently process non-hierarchical structures and is associated with formal proofs, ensuring that transformation of the document to the data model is valid. To enable high performance query analysis in large data repositories, we further introduce an extension of XML pre-post indexing for non-hierarchical structures, which can process both reachability and overlapping relationships.Comment: The paper has been accepted at the Balisage 2014 conferenc

    Principles of Guarded Structural Indexing

    Get PDF
    We present a new structural characterization of the expressive power of the acyclic conjunctive queries in terms of guarded simulations, and give a finite preservation theorem for the guarded simulation invariant fragment of first order logic. We discuss the relevance of these results as a formal basis for constructing so-called guarded structural indexes. Structural indexes were first proposed in the context of semistructured query languages and later successfully applied as an XML indexation mechanism for XPath-like queries on trees and graphs. Guarded structural indexes provide a generalization of structural indexes from graph databases to relational databases

    A structural and functional comparison of differential A and P indexing

    Get PDF
    Indexing P arguments on bivalent predicates is often considered more restricted and less often obligatory than A indexing. However, differential A indexing, i.e., the absence versus the presence of an index referring to the A argument role, is not uncommon either: usually present A indexes can be omitted in particular discourse settings. However, differential A indexing has been a Cinderella subject in the typological study of differential marking, as opposed to differential P indexing or differential A flagging. This paper scrutinizes various cases of both differential A and P indexing and examines structural and functional differences and similarities. It will be shown that exploring differential indexing helps to understand how indexing in general is linked to referential prominence which surfaces as factors such as identifiability, animacy or topicality. Cases where indexing is particularly sensitive to referential prominence, and where it thus is employed only if the referent fulfills certain criteria, bring out the fact that A and P indexing have a common purpose, namely tracking referents through discourse. In this context, the paper also points out that differential A indexing presents an exception from generalizations concerning the amount of material in coding asymmetries

    Logical segmentation for article extraction in digitized old newspapers

    Full text link
    Newspapers are documents made of news item and informative articles. They are not meant to be red iteratively: the reader can pick his items in any order he fancies. Ignoring this structural property, most digitized newspaper archives only offer access by issue or at best by page to their content. We have built a digitization workflow that automatically extracts newspaper articles from images, which allows indexing and retrieval of information at the article level. Our back-end system extracts the logical structure of the page to produce the informative units: the articles. Each image is labelled at the pixel level, through a machine learning based method, then the page logical structure is constructed up from there by the detection of structuring entities such as horizontal and vertical separators, titles and text lines. This logical structure is stored in a METS wrapper associated to the ALTO file produced by the system including the OCRed text. Our front-end system provides a web high definition visualisation of images, textual indexing and retrieval facilities, searching and reading at the article level. Articles transcriptions can be collaboratively corrected, which as a consequence allows for better indexing. We are currently testing our system on the archives of the Journal de Rouen, one of France eldest local newspaper. These 250 years of publication amount to 300 000 pages of very variable image quality and layout complexity. Test year 1808 can be consulted at plair.univ-rouen.fr.Comment: ACM Document Engineering, France (2012

    Differential indexing and information structure management

    Get PDF
    Differential indexing has often been associated with features such as animacy and identifiability, but also with the discourse categories topic and focus. This study presents a cross-linguistic survey of differential indexing phenomena which have been ascribed to topicality or focus, showing that although there are similarities with regard to the discourse-structural effects that the addition or omission of indexes might have in different languages, differential indexing has language-specific, often compositional causes and effects. These are demonstrated by two case studies, on Ruuli (Bantu) and Maltese (Semitic). In both languages, differential indexing of the P argument could be ascribed to the topicality of the referent, but it will be shown that this would not do justice to the multifactorial reality of the phenomenon, as in both languages, differential indexing is triggered by a complex interplay of different factors
    • …
    corecore