95,423 research outputs found
Non-hierarchical Structures: How to Model and Index Overlaps?
Overlap is a common phenomenon seen when structural components of a digital
object are neither disjoint nor nested inside each other. Overlapping
components resist reduction to a structural hierarchy, and tree-based indexing
and query processing techniques cannot be used for them. Our solution to this
data modeling problem is TGSA (Tree-like Graph for Structural Annotations), a
novel extension of the XML data model for non-hierarchical structures. We
introduce an algorithm for constructing TGSA from annotated documents; the
algorithm can efficiently process non-hierarchical structures and is associated
with formal proofs, ensuring that transformation of the document to the data
model is valid. To enable high performance query analysis in large data
repositories, we further introduce an extension of XML pre-post indexing for
non-hierarchical structures, which can process both reachability and
overlapping relationships.Comment: The paper has been accepted at the Balisage 2014 conferenc
Principles of Guarded Structural Indexing
We present a new structural characterization of the expressive power of the acyclic conjunctive queries in terms of guarded simulations, and give a finite preservation theorem for the guarded simulation invariant fragment of first order logic.
We discuss the relevance of these results as a formal basis for constructing so-called guarded structural indexes. Structural indexes were first proposed in the context of semistructured query languages and later successfully applied as an XML indexation mechanism for XPath-like queries on trees and graphs. Guarded structural indexes provide a generalization of structural indexes from graph databases to relational databases
A structural and functional comparison of differential A and P indexing
Indexing P arguments on bivalent predicates is often considered more restricted and less often obligatory than A indexing. However, differential A indexing, i.e., the absence versus the presence of an index referring to the A argument role, is not uncommon either: usually present A indexes can be omitted in particular discourse settings. However, differential A indexing has been a Cinderella subject in the typological study of differential marking, as opposed to differential P indexing or differential A flagging. This paper scrutinizes various cases of both differential A and P indexing and examines structural and functional differences and similarities. It will be shown that exploring differential indexing helps to understand how indexing in general is linked to referential prominence which surfaces as factors such as identifiability, animacy or topicality. Cases where indexing is particularly sensitive to referential prominence, and where it thus is employed only if the referent fulfills certain criteria, bring out the fact that A and P indexing have a common purpose, namely tracking referents through discourse. In this context, the paper also points out that differential A indexing presents an exception from generalizations concerning the amount of material in coding asymmetries
Logical segmentation for article extraction in digitized old newspapers
Newspapers are documents made of news item and informative articles. They are
not meant to be red iteratively: the reader can pick his items in any order he
fancies. Ignoring this structural property, most digitized newspaper archives
only offer access by issue or at best by page to their content. We have built a
digitization workflow that automatically extracts newspaper articles from
images, which allows indexing and retrieval of information at the article
level. Our back-end system extracts the logical structure of the page to
produce the informative units: the articles. Each image is labelled at the
pixel level, through a machine learning based method, then the page logical
structure is constructed up from there by the detection of structuring entities
such as horizontal and vertical separators, titles and text lines. This logical
structure is stored in a METS wrapper associated to the ALTO file produced by
the system including the OCRed text. Our front-end system provides a web high
definition visualisation of images, textual indexing and retrieval facilities,
searching and reading at the article level. Articles transcriptions can be
collaboratively corrected, which as a consequence allows for better indexing.
We are currently testing our system on the archives of the Journal de Rouen,
one of France eldest local newspaper. These 250 years of publication amount to
300 000 pages of very variable image quality and layout complexity. Test year
1808 can be consulted at plair.univ-rouen.fr.Comment: ACM Document Engineering, France (2012
Differential indexing and information structure management
Differential indexing has often been associated with features such as animacy and identifiability, but also with the discourse categories topic and focus. This study presents a cross-linguistic survey of differential indexing phenomena which have been ascribed to topicality or focus, showing that although there are similarities with regard to the discourse-structural effects that the addition or omission of indexes might have in different languages, differential indexing has language-specific, often compositional causes and effects. These are demonstrated by two case studies, on Ruuli (Bantu) and Maltese (Semitic). In both languages, differential indexing of the P argument could be ascribed to the topicality of the referent, but it will be shown that this would not do justice to the multifactorial reality of the phenomenon, as in both languages, differential indexing is triggered by a complex interplay of different factors
- …