9 research outputs found
Combining Linguistic and Spatial Information for Document Analysis
We present a framework to analyze color documents of complex layout. In
addition, no assumption is made on the layout. Our framework combines in a
content-driven bottom-up approach two different sources of information: textual
and spatial. To analyze the text, shallow natural language processing tools,
such as taggers and partial parsers, are used. To infer relations of the
logical layout we resort to a qualitative spatial calculus closely related to
Allen's calculus. We evaluate the system against documents from a color journal
and present the results of extracting the reading order from the journal's
pages. In this case, our analysis is successful as it extracts the intended
reading order from the document.Comment: Appeared in: J. Mariani and D. Harman (Eds.) Proceedings of RIAO'2000
Content-Based Multimedia Information Access, CID, 2000. pp. 266-27
Thick 2D Relations for Document Understanding
We use a propositional language of qualitative rectangle relations to detect the reading order from document images. To this end, we define the notion of a document encoding rule and we analyze possible formalisms to express document encoding rules such as LATEX and SGML. Document encoding rules expressed in the propositional language of rectangles are used to build a reading order detector for document images. In order to achieve robustness and avoid brittleness when applying the system to real life document images, the notion of a thick boundary interpretation for a qualitative relation is introduced. The framework is tested on a collection of heterogeneous document images showing recall rates up to 89%
Automated Problem Domain Cognition Process in Information Systems Design
An automated cognitive approach for the design of Information Systems is presented. It is supposed to
be used at the very beginning of the design process, between the stages of requirements determination and
analysis, including the stage of analysis. In the context of the approach used either UML or ERD notations may
be used for model representation. The approach provides the opportunity of using natural language text
documents as a source of knowledge for automated problem domain model generation. It also simplifies the
process of modelling by assisting the human user during the whole period of working upon the model (using UML
or ERD notations)
Adaptive Methods for Robust Document Image Understanding
A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy