1,101,504 research outputs found

    Semantics-Based Content Extraction in Typewritten Historical Documents

    No full text
    This paper presents a flexible approach to extracting content from scanned historical documents using semantic information. The final electronic document is the result of a "digital historical document lifecycle" process, where the expert knowledge of the historian/archivist user is incorporated at different stages. Results show that such a conversion strategy aided by (expert) user-specified semantic information and which enables the processing of individual parts of the document in a specialised way, produces superior (in a variety of significant ways) results than document analysis and understanding techniques devised for contemporary documents

    A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis

    Full text link
    Automatic analysis of scanned historical documents comprises a wide range of image analysis tasks, which are often challenging for machine learning due to a lack of human-annotated learning samples. With the advent of deep neural networks, a promising way to cope with the lack of training data is to pre-train models on images from a different domain and then fine-tune them on historical documents. In the current research, a typical example of such cross-domain transfer learning is the use of neural networks that have been pre-trained on the ImageNet database for object recognition. It remains a mostly open question whether or not this pre-training helps to analyse historical documents, which have fundamentally different image properties when compared with ImageNet. In this paper, we present a comprehensive empirical survey on the effect of ImageNet pre-training for diverse historical document analysis tasks, including character recognition, style classification, manuscript dating, semantic segmentation, and content-based retrieval. While we obtain mixed results for semantic segmentation at pixel-level, we observe a clear trend across different network architectures that ImageNet pre-training has a positive effect on classification as well as content-based retrieval

    Summary document of the work being done in 1992-1993 regarding the migratory salmonid fisheries of the River Ribble catchment

    Get PDF
    This document lists the work on salmonid fisheries in the River Ribble catchment in the north west of England which was done in 1992-93. The document includes the following: analysis of historical catch data from the River Ribble covering 1937-1991; Ribble and Hodder stock assessment surveys 1992-93 including Swanside Beck; Ribble catch data from game angling log books 1993; results of salmon scale sampling 1993; results of microtagging programme in Hodder and Ribble 1993; Waddow Weir dish counter data 1993; Winkley Hall (Hodder), Locks Weir and Waddow Hall (Ribble) fish counter data 1993; number of fish and ova from Witcherwell Hatchery 1993

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3

    The PAGE (Page Analysis and Ground-Truth Elements) format framework

    Get PDF
    There is a plethora of established and proposed document representation formats but none that can adequately support individual stages within an entire sequence of document image analysis methods (from document image enhancement to layout analysis to OCR) and their evaluation. This paper describes PAGE, a new XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content. The suitability of the framework to the evaluation of entire workflows as well as individual stages has been extensively validated by using it in high-profile applications such as in public contemporary and historical ground-truthed datasets and in the ICDAR Page Segmentation competition series

    An Introductory Review of a Structural VAR-X Estimation and Applications

    Get PDF
    This document presents how to estimate and implement a structural VAR-X model under long run and impact identification restrictions. Estimation by bayesian and maximum likelihood methods is presented. Applications of the structural VAR-X for impulse response functions to structural shocks, multiplier analysis of the exogenous variables, forecast error variance decomposition and historical decomposition of the endogenous variables are also described, as well as a method for computing HPD regions in a bayesian context. Some of the concepts are exemplified with an application to US data.S-VAR, B-VAR, VAR-X, IRF, FEVD, Historical Decomposition. Classification JEL: C11, C18, C32.

    Open Evaluation Tool for Layout Analysis of Document Images

    Full text link
    This paper presents an open tool for standardizing the evaluation process of the layout analysis task of document images at pixel level. We introduce a new evaluation tool that is both available as a standalone Java application and as a RESTful web service. This evaluation tool is free and open-source in order to be a common tool that anyone can use and contribute to. It aims at providing as many metrics as possible to investigate layout analysis predictions, and also provide an easy way of visualizing the results. This tool evaluates document segmentation at pixel level, and support multi-labeled pixel ground truth. Finally, this tool has been successfully used for the ICDAR2017 competition on Layout Analysis for Challenging Medieval Manuscripts.Comment: The 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), HIP: 4th International Workshop on Historical Document Imaging and Processing, Kyoto, Japan, 201

    Historical Document Digitization through Layout Analysis and Deep Content Classification

    Get PDF
    Document layout segmentation and recognition is an important task in the creation of digitized documents collections, especially when dealing with historical documents. This paper presents an hybrid approach to layout segmentation as well as a strategy to classify document regions, which is applied to the process of digitization of an historical encyclopedia. Our layout analysis method merges a classic top-down approach and a bottom-up classification process based on local geometrical features, while regions are classified by means of features extracted from a Convolutional Neural Network merged in a Random Forest classifier. Experiments are conducted on the first volume of the ``Enciclopedia Treccani'', a large dataset containing 999 manually annotated pages from the historical Italian encyclopedia

    Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

    Full text link
    The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance
    corecore