Search CORE

23 research outputs found

Extraction d’information dans des documents manuscrits anciens

Author: Granet Adeline
Publication venue: HAL CCSD
Publication date: 12/12/2018
Field of study

Exploring unexploited but newly digitized resources to find relevant information is a complicated task due to the amount of available resources. Thanks to the ANR project CIRESFI, the most important resource for the Italian Comedy of the 18th century, is a set of accounting registers consisting of 28,000 pages. Information retrieval is a long and complex process that requires expertise at every step: detection and segmentation in paragraphs, lines or words, features extraction, handwriting recognition. Systems based on deep neural networks dominate these approaches. The major issue is the need of a large amount of data to achieve their learning. However, the registers of the Italian Comedy have no ground truth. To overcome this lack of data, we explore approaches that involving transfer learning. That means using heterogeneous labeled and available data, with at least one common feature with our data to drive the systems, and then applying them to our data. All of our experiments have shown us the difficulty of carrying out this task, each choice at each stage having a strong impact on the rest of the system. We converge on a solution separating the optical model from the language model in order to achieve independent learning with different available resources and joining together thanks to a projection of the information into a non-latent common space.La tâche d'exploration dans des ressources inexploitées mais nouvellement numérisées, afin d'y trouver des informations pertinentes, est complexifiée par la quantité de ressources disponibles. Grâce au projet ANR CIRESFI, la ressource la plus importante, pour la Comédie-Italienne du XVIIIe siècle, est un ensemble de registres comptables constituée de 28 000 pages. L'extraction d'informations est un processus long et complexe qui demande une expertise à chaque étape : détection et segmentation, extraction de caractéristiques, reconnaissance d’écriture manuscrite. Les systèmes à base de réseaux de neurones profonds dominent dans l'ensemble ces approches. Le problème majeur est qu'ils nécessitent d'avoir une grande quantité de données pour réaliser leur apprentissage. Cependant, les registres de la Comédie-Italienne ne possèdent pas de vérité terrain. Pour palier ce manque de données, nous explorons des approches pouvant opérer un apprentissage par transfert de connaissance. Cela signifie utiliser un ensemble de données déjà étiquetées et disponibles, possédant un minimum de points communs avec nos données pour entraîner les systèmes, pour ensuite les appliquer sur nos données. L'ensemble de nos expérimentations nous ont montré la difficulté de réaliser cette tâche, chaque choix à chaque étape ayant un impact fort sur la suite du système. Nous convergeons vers une solution séparant le modèle optique du modèle de langage afin de réaliser un apprentissage indépendant avec différents types de ressources disponibles et se rejoignant grâce à une projection de l'ensemble des informations dans un espace commun non-latent

Thèses en Ligne

Hal-Diderot

Extracting information in old handwritten documents

Author: Granet Adeline
Publication venue
Publication date: 12/12/2018
Field of study

La tâche d'exploration dans des ressources inexploitées mais nouvellement numérisées, afin d'y trouver des informations pertinentes, est complexifiée par la quantité de ressources disponibles. Grâce au projet ANR CIRESFI, la ressource la plus importante, pour la Comédie-Italienne du XVIIIe siècle, est un ensemble de registres comptables constituée de 28 000 pages. L'extraction d'informations est un processus long et complexe qui demande une expertise à chaque étape : détection et segmentation, extraction de caractéristiques, reconnaissance d’écriture manuscrite. Les systèmes à base de réseaux de neurones profonds dominent dans l'ensemble ces approches. Le problème majeur est qu'ils nécessitent d'avoir une grande quantité de données pour réaliser leur apprentissage. Cependant, les registres de la Comédie- Italienne ne possèdent pas de vérité terrain. Pour palier ce manque de données, nous explorons des approches pouvant opérer un apprentissage par transfert de connaissance. Cela signifie utiliser un ensemble de données déjà étiquetées et disponibles, possédant un minimum de points communs avec nos données pour entraîner les systèmes, pour ensuite les appliquer sur nos données. L'ensemble de nos expérimentations nous ont montré la difficulté de réaliser cette tâche, chaque choix à chaque étape ayant un impact fort sur la suite du système. Nous convergeons vers une solution séparant le modèle optique du modèle de langage afin de réaliser un apprentissage indépendant avec différents types de ressources disponibles et se rejoignant grâce à une projection de l'ensemble des informations dans un espace commun nonlatent.Exploring unexploited but newly digitized resources to find relevant information is a complicated task due to the amount of available resources. Thanks to the ANR project CIRESFI, the most important resource for the Italian Comedy of the 18th century, is a set of accounting registers consisting of 28,000 pages. Information retrieval is a long and complex process that requires expertise at every step: detection and segmentation in paragraphs, lines or words, features extraction, handwriting recognition. Systems based on deep neural networks dominate these approaches. The major issue is the need of a large amount of data to achieve their learning. However, the registers of the Italian Comedy have no ground truth. To overcome this lack of data, we explore approaches that involving transfer learning. That means using heterogeneous labeled and available data, with at least one common feature with our data to drive the systems, and then applying them to our data. All of our experiments have shown us the difficulty of carrying out this task, each choice at each stage having a strong impact on the rest of the system. We converge on a solution separating the optical model from the language model in order to achieve independent learning with different available resources and joining together thanks to a projection of the information into a non-latent common space

Theses.fr

Transfer Learning for Structures Spotting in Unlabeled Handwritten Documents using Randomly Generated Documents

Author: Granet Adeline
Mouchère Harold
Roman-Jimenez Geoffrey
Viard-Gaudin Christian
Publication venue: 'Scitepress'
Publication date: 16/01/2018
Field of study

International audienceDespite recent achievements in handwritten text recognition due to major advances in deep neural networks, historical handwritten documents analysis is still a challenging problem because of the requirement of large annotated training database. In this context, knowledge transfer of neural networks pre-trained on already available labeled data could allow us to process new collections of documents. In this study, we focus on localization of structures at the word-level, distinguishing words from numbers, in unlabeled handwritten documents. We based our approach on a transductive transfer learning paradigm using a deep convolutional neural network pre-trained on artificial labeled images randomly generated with strokes, word and number patches. We designed our model to predict a mask of the structures positions at the pixel-level, directly from the pixel values. The model has been trained using 100,000 generated images. The classification performances of our model were assessed by using randomly generated images coming from a different set of images of words and digits. At the pixel level, the averaged accuracy of the proposed structures detection system reach 96.1%. We evaluated the transfer capability of our model on two datasets of real handwritten documents unseen during the training. Results show that our model is able to distinguish most ”digits” structures from ”word” structures while avoiding other various structures present in the documents, showing the good transferability of the system to real documents

Transfer Learning for a Letter-Ngrams to Word Decoder in the Context of Historical Handwriting Recognition with Scarce Resources

Author: Granet Adeline
Morin Emmanuel
Mouchère Harold
Quiniou Solen
Viard-Gaudin Christian
Publication venue: HAL CCSD
Publication date: 20/08/2018
Field of study

International audienceLack of data can be an issue when beginning a new study on historical handwritten documents. In order to deal with this, we present the character-based decoder part of a multilingual approach based on transductive transfer learning for a historical handwriting recognition task on Italian Comedy Registers. The decoder must build a sequence of characters that corresponds to a word from a vector of letter-ngrams. As learning data, we created a new dataset from untapped resources that covers the same domain and period of our Italian Comedy data, as well as resources from common domains, periods, or languages. We obtain a 97.42% Character Recognition Rate and a 86.57% Word Recognition Rate on our Italian Comedy data, despite a lexical coverage of 67% between the Italian Comedy data and the training data. These results show that an efficient system can be obtained by a carefully selecting the datasets used for the transfer learning

Hal-Diderot

Transfer Learning for Handwriting Recognition on Historical Documents

Author: Granet Adeline
Morin Emmanuel
Mouchère Harold
Quiniou Solen
Viard-Gaudin Christian
Publication venue: HAL CCSD
Publication date: 16/01/2018
Field of study

International audienceIn this work, we investigate handwriting recognition on new historical handwritten documents using transfer learning. Establishing a manual ground-truth of a new collection of handwritten documents is time consuming but needed to train and to test recognition systems. We want to implement a recognition system without performing this annotation step. Our research deals with transfer learning from heterogeneous datasets with a ground-truth and sharing common properties with a new dataset that has no ground-truth. The main difficulties of transfer learning lie in changes in the writing style, the vocabulary, and the named entities over centuries and datasets. In our experiment, we show how a CNN-BLSTM-CTC neural network behaves, for the task of transcribing handwritten titles of plays of the Italian Comedy, when trained on combinations of various datasets such as RIMES, Georges Washington, and Los Esposalles. We show that the choice of the training datasets and the merging methods are determinant to the results of the transfer learning task

Separating Optical and Language Models through Encoder-Decoder Strategy for Transferable Handwriting Recognition

Author: Granet Adeline
Morin Emmanuel
Mouchère Harold
Quiniou Solen
Viard-Gaudin Christian
Publication venue: HAL CCSD
Publication date: 01/08/2018
Field of study

International audienceLack of data can be an issue when beginning a new study on historical handwritten documents. To deal with this, we propose a deep-learning based recognizer which separates the optical and the language models in order to train them separately using different resources. In this work, we present the optical encoder part of a multilingual transductive transfer learning applied to historical handwriting recognition. The optical encoder transforms the input word image into a non-latent space that depends only on the letter-n-grams: it enables it to be independent of the language. This transformation avoids embedding a language model and operating the transfer learning across languages using the same alphabet. The language decoder creates from a vector of letter-n-grams a word as a sequence of characters. Experiments show that separating optical and language model can be a solution for multilingual transfer learning

Décodeur neuronal pour la transcription de documents manuscrits anciens

Author: Granet Adeline
Morin Emmanuel
Mouchère Harold
Quiniou Solen
Viard-Gaudin Christian
Publication venue: HAL CCSD
Publication date: 16/05/2018
Field of study

National audienceL'absence de données annotées peut être une difficulté majeure lorsque l'on s'intéresse à l'analyse de documents manuscrits anciens. Pour contourner cette difficulté, nous proposons de diviser le problème en deux, afin de pouvoir s'appuyer sur des données plus facilement accessibles. Dans cet article nous présentons la partie décodeur d'un encodeur-décodeur multimodal utilisant l'apprentissage par transfert de connaissances pour la transcription des titres de pièces de la Comédie Italienne. Le décodeur transforme un vecteur de n-grammes au niveau caractères en une séquence de caractères correspondant à un mot. L'apprentissage par transfert de connaissances est réalisé principalement à partir d'une nouvelle ressource inexploitée contemporaine à la Comédie-Italienne et thématiquement proche ; ainsi que d'autres ressources couvrant d'autres domaines, des langages différents et même des périodes différentes. Nous obtenons 97,27% de caractères bien reconnus sur les données de la Comédie-Italienne, ainsi que 86,57% de mots correctement générés malgré une couverture de 67,58% uniquement entre la Comédie-Italienne et l'ensemble d'apprentissage. Les expériences montrent qu'un tel système peut être une approche efficace dans le cadre d'apprentissage par transfert. ABSTRACT Neural decoder for the transcription of historical handwritten documents. The lack of data can be an issue at the beginning of a study on new historical handwritten documents. To solve this issue, we present the decoder part of a multimodal approach based on transductive transfer learning for transcripting play titles of the Italian Comedy. MOTS-CLÉS : modèle neuronal, apprentissage par transfert, transcription, Comédie Italienne

Hal-Diderot

Décodeur neuronal pour la transcription de documents manuscrits anciens

Author: Granet Adeline
Morin Emmanuel
Mouchère Harold
Quiniou Solen
Viard-Gaudin Christian
Publication venue: HAL CCSD
Publication date: 16/05/2018
Field of study

Transfer Learning for a Letter-Ngrams to Word Decoder in the Context of Historical Handwriting Recognition with Scarce Resources

Author: Granet Adeline
Morin Emmanuel
Mouchère Harold
Quiniou Solen
Viard-Gaudin Christian
Publication venue: HAL CCSD
Publication date: 20/08/2018
Field of study

Étude préliminaire de reconnaissance d'écriture sur des documents historiques

Author: Granet Adeline
Morin Emmanuel
Mouchère Harold
Quiniou Solen
Viard-Gaudin Christian
Publication venue: HAL CCSD
Publication date: 01/03/2017
Field of study

National audienceABSTRACT. This work cares about information retrieval in accounting registers of Italian comedy of the 18 th century. These documents contain precious information for human and social science researchers interested in the integration of the Italian actors during this century. Information retrieval in old documents which have never been studied before, is a long and difficult process. Each step asks an expertise : detection and segmentation into blocs, lines or words; extraction efficient features; and handwriting recognition. The BLSTM recurrent neural network with CTC decoding is the most popular solution which outperforms others for alignment between a transcription and an input sequence. This paper explains a preliminary investigation using this kind of recurrent neural network for the following task : identify the play's titles in multilingual historical documents using closed vocabulary that mainly contains named entities.Ce travail s'intéresse à l'extraction d'informations dans les registres comptables de la Comédie-Italienne du XVIII e siècle. Ces derniers renferment des informations précieuses pour des chercheurs en sciences humaines et sociales qui travaillent sur l'acculturation des acteurs italiens de cette époque. L'extraction d'informations, dans des documents anciens non encore étudiés, est un processus long et complexe qui demande une expertise à chaque étape : détection et segmentation en blocs, lignes ou mots, extraction de caractéristiques, reconnaissance d'écri-ture manuscrite. Les réseaux de neurones récurrents, de type BLSTM, avec un décodage CTC constituent une des méthodes les plus prometteuses en reconnaissance d'écriture, pour réaliser l'étiquetage d'une séquence donnée en entrée et produire un résultat de reconnaissance. Cet article présente une étude préliminaire de l'utilisation de ce type de réseau de neurones pour une première tâche : la reconnaissance des titres des pièces de théâtre, dans des documents historiques multilingues (français et italien) utilisant un vocabulaire fermé et essentiellement composé d'entités nommées

Hal-Diderot