4,291 research outputs found
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last
decades naturally lend themselves to automatic processing and exploration.
Research work seeking to automatically process facsimiles and extract
information thereby are multiplying with, as a first essential step, document
layout analysis. If the identification and categorization of segments of
interest in document images have seen significant progress over the last years
thanks to deep learning techniques, many challenges remain with, among others,
the use of finer-grained segmentation typologies and the consideration of
complex, heterogeneous documents such as historical newspapers. Besides, most
approaches consider visual features only, ignoring textual signal. In this
context, we introduce a multimodal approach for the semantic segmentation of
historical newspapers that combines visual and textual features. Based on a
series of experiments on diachronic Swiss and Luxembourgish newspapers, we
investigate, among others, the predictive power of visual and textual features
and their capacity to generalize across time and sources. Results show
consistent improvement of multimodal models in comparison to a strong visual
baseline, as well as better robustness to high material variance
Statistical parsing of morphologically rich languages (SPMRL): what, how and whither
The term Morphologically Rich Languages (MRLs) refers to languages in which significant information concerning syntactic units and relations is expressed at word-level. There is ample evidence that the application of readily available statistical parsing models to such languages is susceptible to serious performance degradation. The first workshop on statistical parsing of MRLs hosts a variety of contributions which show that despite language-specific idiosyncrasies, the problems associated with parsing MRLs cut across languages and parsing frameworks. In this paper we review the current state-of-affairs with respect to parsing MRLs and point out central challenges. We synthesize the contributions of researchers working on parsing Arabic, Basque, French, German, Hebrew, Hindi and Korean to point out shared solutions across languages. The overarching analysis suggests itself as a source of directions for future investigations
GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION
The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed a system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded by our degradation modules, and used for training and evaluating Optical Character Recognition (OCR) systems. Our document image degradation methodology incorporates several often-encountered types of noise at the page and pixel levels. Examples of OCR evaluation and synthetically degraded document images are given to demonstrate the effectiveness
Grid’5000 Based Large Scale OCR Using the DTW Algorithm: Case of the Arabic Cursive Writing
International audienc
- …