219 research outputs found
A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units
We address the design of a unified multilingual system for handwriting
recognition. Most of multi- lingual systems rests on specialized models that
are trained on a single language and one of them is selected at test time.
While some recognition systems are based on a unified optical model, dealing
with a unified language model remains a major issue, as traditional language
models are generally trained on corpora composed of large word lexicons per
language. Here, we bring a solution by con- sidering language models based on
sub-lexical units, called multigrams. Dealing with multigrams strongly reduces
the lexicon size and thus decreases the language model complexity. This makes
pos- sible the design of an end-to-end unified multilingual recognition system
where both a single optical model and a single language model are trained on
all the languages. We discuss the impact of the language unification on each
model and show that our system reaches state-of-the-art methods perfor- mance
with a strong reduction of the complexity.Comment: preprin
Logical segmentation for article extraction in digitized old newspapers
Newspapers are documents made of news item and informative articles. They are
not meant to be red iteratively: the reader can pick his items in any order he
fancies. Ignoring this structural property, most digitized newspaper archives
only offer access by issue or at best by page to their content. We have built a
digitization workflow that automatically extracts newspaper articles from
images, which allows indexing and retrieval of information at the article
level. Our back-end system extracts the logical structure of the page to
produce the informative units: the articles. Each image is labelled at the
pixel level, through a machine learning based method, then the page logical
structure is constructed up from there by the detection of structuring entities
such as horizontal and vertical separators, titles and text lines. This logical
structure is stored in a METS wrapper associated to the ALTO file produced by
the system including the OCRed text. Our front-end system provides a web high
definition visualisation of images, textual indexing and retrieval facilities,
searching and reading at the article level. Articles transcriptions can be
collaboratively corrected, which as a consequence allows for better indexing.
We are currently testing our system on the archives of the Journal de Rouen,
one of France eldest local newspaper. These 250 years of publication amount to
300 000 pages of very variable image quality and layout complexity. Test year
1808 can be consulted at plair.univ-rouen.fr.Comment: ACM Document Engineering, France (2012
Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks
In this paper, we introduce a fully convolutional network for the document
layout analysis task. While state-of-the-art methods are using models
pre-trained on natural scene images, our method Doc-UFCN relies on a U-shaped
model trained from scratch for detecting objects from historical documents. We
consider the line segmentation task and more generally the layout analysis
problem as a pixel-wise classification task then our model outputs a
pixel-labeling of the input images. We show that Doc-UFCN outperforms
state-of-the-art methods on various datasets and also demonstrate that the
pre-trained parts on natural scene images are not required to reach good
results. In addition, we show that pre-training on multiple document datasets
can improve the performances. We evaluate the models using various metrics to
have a fair and complete comparison between the methods
Handwritten Document Analysis for Automatic Writer Recognition
In this paper, we show that both the writer identification and the writer verification tasks can be carried out using local features such as graphemes extracted from the segmentation of cursive handwriting. We thus enlarge the scope of the possible use of these two tasks which have been, up to now, mainly evaluated on script handwritings. A textual based Information Retrieval model is used for the writer identification stage. This allows the use of a particular feature space based on feature frequencies. Image queries are handwritten documents projected in this feature space. The approach achieves 95% correct identification on the PSI_DataBase and 86% on the IAM_DataBase. Then writer hypothesis retrieved are analysed during a verification phase. We call upon a mutual information criterion to verify that two documents may have been produced by the same writer or not. Hypothesis testing is used for this purpose. The proposed method is first scaled on the PSI_DataBase then evaluated on the IAM_DataBase. On both databases, similar performance of nearly 96% correct verification is reported, thus making the approach general and very promising for large scale applications in the domain of handwritten document querying and writer verification
DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition
Unconstrained handwritten text recognition is a challenging computer vision
task. It is traditionally handled by a two-step approach, combining line
segmentation followed by text line recognition. For the first time, we propose
an end-to-end segmentation-free architecture for the task of handwritten
document recognition: the Document Attention Network. In addition to text
recognition, the model is trained to label text parts using begin and end tags
in an XML-like fashion. This model is made up of an FCN encoder for feature
extraction and a stack of transformer decoder layers for a recurrent
token-by-token prediction process. It takes whole text documents as input and
sequentially outputs characters, as well as logical layout tokens. Contrary to
the existing segmentation-based approaches, the model is trained without using
any segmentation label. We achieve competitive results on the READ 2016 dataset
at page level, as well as double-page level with a CER of 3.43% and 3.70%,
respectively. We also provide results for the RIMES 2009 dataset at page level,
reaching 4.54% of CER.
We provide all source code and pre-trained model weights at
https://github.com/FactoDeepLearning/DAN
Stochastic gradient descent with gradient estimator for categorical features
Categorical data are present in key areas such as health or supply chain, and
this data require specific treatment. In order to apply recent machine learning
models on such data, encoding is needed. In order to build interpretable
models, one-hot encoding is still a very good solution, but such encoding
creates sparse data. Gradient estimators are not suited for sparse data: the
gradient is mainly considered as zero while it simply does not always exists,
thus a novel gradient estimator is introduced. We show what this estimator
minimizes in theory and show its efficiency on different datasets with multiple
model architectures. This new estimator performs better than common estimators
under similar settings. A real world retail dataset is also released after
anonymization. Overall, the aim of this paper is to thoroughly consider
categorical data and adapt models and optimizers to these key features
Adaptation de modèles de Markov cachés - Application à la reconnaissance de caractères imprimés
International audienceWe present in this paper a new algorithm for the adaptation of hidden Markov models (HMM models). The principle of our iterative adaptive algorithm is to alternate an HMM structure adaptation stage with an HMM Gaussian MAP adaptation stage. This algorithm is applied to the recognition of printed characters to adapt the models learned by a polyfont character recognition engine to new forms of characters. Comparing the results with those of MAP and MLLR classic adaptations shows a slight increase in the performance of the recognition system
Prediction of Selection Decision of Document Using Bibliographic Data at the National Library of France (BnF)
p. 135-140International audienceThe selection process of the documents is a very important step in mass digitization projects. This is especially true at the BnF, where the digitization should include or not OCRization depending on the OCR results expected. Consequently, the selection task is very complex and time consuming due to the number of documents to be processed and the diversity of the selection criteria to consider. Trying to improve and simplify this task by automation, we studied the relationship between bibliographic data and the selection decisions of documents. We used two statistical analysis : a factor analysis of correspondence and a multiple correspondence analysis. Our analysis has shown that, for example, the documents in format "4 or GR FOL" and edited "between 1961 and 1990" in Morocco are more likely to be "Selected". However, the documents in format "16 or 8" and edited "between 1871 and 1800 in English or Spanish have a greater chance to be "Not Selected"
Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software
This paper describes first large scale article detection and extraction efforts on the Finnish Digi newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898 . The historical digital newspaper archive environment of the NLF is based on commercial docWorks software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in t his respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laborator y of University of Rouen Normandy. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869 1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues . We then divided the annotated set in to training and evaluation set s of 168 and 56 pages. We trained PIVAJ successfully and evaluate d the results using the layout evaluation software developed by PRImA research laboratory of University of Salford. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.Peer reviewe
- …