42 research outputs found
Using definite clause grammars to build a global system for analyzing collections of documents
International audienceCollections of documents are sets of heterogeneous documents, like a specific ancient book series, having proper structural and semantic properties linking them. A particular collection contains document images with specific physical layouts, like text pages or full-page illustrations, appearing in a specific order. Its contents, like journal articles, may be shared by several pages, not necessary following, producing strong dependencies between pages interpretations.In order to build an analysis system which can bring contextual information from the collection to the appropriate recognition modules for each page, we propose to express the structural and the semantic properties of a collection with a definite clause grammar. This is made possible by representing collections as streams of document descriptors, and by using extensions to the formalism we present here. We are then able to automatically generate a parser dedicated to a collection. Beside allowing structural variations and complex information flows, we also show that this approach enables the design of analysis stages, on a document or a set of documents. The interest of context usage is illustrated with several examples and their appropriate formalization in this framework
Mémoire visuelle pour l'analyse de documents structurés
International audienceCurrent analysis methods propose systems which are adapted to recognize sets of documents of a same kind (statistical methods, grammatical analysis), but each page is then processed in isolation. However, in a collection context, it is important to make the most of the result of the processing of one or several pages in order to improve the later processing of other pages. Thus, we propose the concept of visual memory so as to enable existing recognition systems to use, locally and at run time, contextual information for each page. We detail its specifications which allow an implementation in any recognition system. Finally, we present our implementation using a grammatical method, and its application for several types of information flow.Les méthodes d'analyse de documents actuelles proposent des systèmes qui sont adaptés pour reconnaitre des lots de documents d'un même type (méthodes statistiques, analyses grammaticales), mais qui ne sont ensuite appliqués qu'isolément sur chacune des pages à traiter. Pourtant, dans un contexte de collection, il est important de profiter du résultat du traitement d'une ou plusieurs pages pour améliorer le traitement ultérieur d'autres pages. Nous proposons donc un cadre, la mémoire visuelle, pour intégrer aux systèmes de reconnaissance existants la prise en compte locale et en cours d'analyse d'informations contextuelles utiles pour chaque page. Nous détaillons ses spécifications qui permettent son implémentation dans n'importe quel système de reconnaissance. Nous présentons enfin une mise en œuvre dans un système grammatical et déclinons différents schémas de circulation d'information permis par la mémoire visuelle
Comment introduire simplement et uniformément deux modes d'interaction asynchrones complémentaires dans un système d'analyse de documents existant
National audienceExtracting contents from degraded documents, like historical ones, is difficult. Existing analysis systems usually rely on a manual correction of results during a post-processing stage, and cannot make use of external information to adapt their response. This paper presents how to adapt an existing document analysis system to enable an efficient interaction during the analysis stage, and benefit from external information. We describe the minimal architecture required, and the two complimentary interaction models we propose: they are suitable for mass document processing, and easy to implement. For transcription tasks in documents dating from the 18th century, our prototype permitted an important reduction of human workload.L'analyse et la reconnaissance de documents dégradés, en particulier les documents anciens, est difficile. Les systèmes existants recourent généralement une correction manuelle des résultats en post-processing, sans tirer profit de ces nouvelles informations pour améliorer leur réponse. Cet article explique comment adapter un système d'analyse de documents existant pour lui permettre d'interagir avec un opérateur humain ou d'autres processus, durant la phase d'analyse, et exploiter à ce moment les informations externes fournies. Nous décrivons l'architecture minimale requise, puis les deux modes d'interaction complémentaires que nous proposons : ils sont adaptés au traitement de documents en grand volume, et simples à implémenter. Pour des tâches de transcription de mots manuscrits dans des documents du XVIIIe siècle, notre prototype montre une réduction conséquente de la quantité de travail manuel nécessaire
Interprétation contextuelle et assistée de fonds d'archives numérisées (application à des registres de ventes du XVIIIe siècle)
Les fonds d'archives forment de grandes quantités de documents difficiles à interpréter automatiquement : les approches classiques imposent un lourd effort de conception, sans parvenir à empêcher la production d'erreurs qu'il faut corriger après les traitements.Face à ces limites, notre travail vise à améliorer la processus d'interprétation, en conservant un fonctionnement page par page, et en lui apportant des informations contextuelles extraites du fonds documentaire ou fournies par des opérateurs humains.Nous proposons une extension ciblée de la description d'une page qui permet la mise en place systématique d'échanges entre le processus d'interprétation et son environnement. Un mécanisme global itératif gère l'apport progressif d'informations contextuelles à ce processus, ce qui améliore l'interprétation.L'utilisation de ces nouveaux outils pour le traitement de documents du XVIIIe siècle a montré qu'il était facile d'intégrer nos propositions à un système existant, que sa conception restait simple, et que l'effort de correction pouvait être diminué.Fonds, also called historical document collections, are important amounts of digitized documents which are difficult to interpret automatically: usual approaches require a lot of work during design, but do not manage to avoid producing many errors which have to be corrected after processing.To cope with those limitations, our work aimed at improving the interpretation process by making use of information extracted from the fond, or provided by human operators, while keeping a page by page processing.We proposed a simple extension of page description language which permits to automatically generate information exchange between the interpretation process and its environment. A global iterative mechanism progressively brings contextual information to the later process, and improves interpretation.Experiments and application of those new tools for the processing of documents from the 18th century showed that our propositions were easy to integrate in an existing system, that its design is still simple, and that required manual corrections were reduced.RENNES-INSA (352382210) / SudocSudocFranceF
Contextual and assisted interpretation of digitized fonds : application to sales registers from the 18th century
Les fonds d'archives forment de grandes quantités de documents difficiles à interpréter automatiquement : les approches classiques imposent un lourd effort de conception, sans parvenir à empêcher la production d'erreurs qu'il faut corriger après les traitements.Face à ces limites, notre travail vise à améliorer la processus d'interprétation, en conservant un fonctionnement page par page, et en lui apportant des informations contextuelles extraites du fonds documentaire ou fournies par des opérateurs humains.Nous proposons une extension ciblée de la description d'une page qui permet la mise en place systématique d'échanges entre le processus d'interprétation et son environnement. Un mécanisme global itératif gère l'apport progressif d'informations contextuelles à ce processus, ce qui améliore l'interprétation.L'utilisation de ces nouveaux outils pour le traitement de documents du XVIIIe siècle a montré qu'il était facile d'intégrer nos propositions à un système existant, que sa conception restait simple, et que l'effort de correction pouvait être diminué.Fonds, also called historical document collections, are important amounts of digitized documents which are difficult to interpret automatically: usual approaches require a lot of work during design, but do not manage to avoid producing many errors which have to be corrected after processing.To cope with those limitations, our work aimed at improving the interpretation process by making use of information extracted from the fond, or provided by human operators, while keeping a page by page processing.We proposed a simple extension of page description language which permits to automatically generate information exchange between the interpretation process and its environment. A global iterative mechanism progressively brings contextual information to the later process, and improves interpretation.Experiments and application of those new tools for the processing of documents from the 18th century showed that our propositions were easy to integrate in an existing system, that its design is still simple, and that required manual corrections were reduced
Interprétation contextuelle et assistée de fonds d'archives numérisées : application à des registres de ventes du XVIIIe siècle
Fonds, also called historical document collections, are important amounts of digitized documents which are difficult to interpret automatically: usual approaches require a lot of work during design, but do not manage to avoid producing many errors which have to be corrected after processing.To cope with those limitations, our work aimed at improving the interpretation process by making use of information extracted from the fond, or provided by human operators, while keeping a page by page processing.We proposed a simple extension of page description language which permits to automatically generate information exchange between the interpretation process and its environment. A global iterative mechanism progressively brings contextual information to the later process, and improves interpretation.Experiments and application of those new tools for the processing of documents from the 18th century showed that our propositions were easy to integrate in an existing system, that its design is still simple, and that required manual corrections were reduced.Les fonds d'archives forment de grandes quantités de documents difficiles à interpréter automatiquement : les approches classiques imposent un lourd effort de conception, sans parvenir à empêcher la production d'erreurs qu'il faut corriger après les traitements.Face à ces limites, notre travail vise à améliorer la processus d'interprétation, en conservant un fonctionnement page par page, et en lui apportant des informations contextuelles extraites du fonds documentaire ou fournies par des opérateurs humains.Nous proposons une extension ciblée de la description d'une page qui permet la mise en place systématique d'échanges entre le processus d'interprétation et son environnement. Un mécanisme global itératif gère l'apport progressif d'informations contextuelles à ce processus, ce qui améliore l'interprétation.L'utilisation de ces nouveaux outils pour le traitement de documents du XVIIIe siècle a montré qu'il était facile d'intégrer nos propositions à un système existant, que sa conception restait simple, et que l'effort de correction pouvait être diminué
Interprétation contextuelle et assistée de fonds d'archives numérisées : application à des registres de ventes du XVIIIe siècle
Fonds, also called historical document collections, are important amounts of digitized documents which are difficult to interpret automatically: usual approaches require a lot of work during design, but do not manage to avoid producing many errors which have to be corrected after processing.To cope with those limitations, our work aimed at improving the interpretation process by making use of information extracted from the fond, or provided by human operators, while keeping a page by page processing.We proposed a simple extension of page description language which permits to automatically generate information exchange between the interpretation process and its environment. A global iterative mechanism progressively brings contextual information to the later process, and improves interpretation.Experiments and application of those new tools for the processing of documents from the 18th century showed that our propositions were easy to integrate in an existing system, that its design is still simple, and that required manual corrections were reduced.Les fonds d'archives forment de grandes quantités de documents difficiles à interpréter automatiquement : les approches classiques imposent un lourd effort de conception, sans parvenir à empêcher la production d'erreurs qu'il faut corriger après les traitements.Face à ces limites, notre travail vise à améliorer la processus d'interprétation, en conservant un fonctionnement page par page, et en lui apportant des informations contextuelles extraites du fonds documentaire ou fournies par des opérateurs humains.Nous proposons une extension ciblée de la description d'une page qui permet la mise en place systématique d'échanges entre le processus d'interprétation et son environnement. Un mécanisme global itératif gère l'apport progressif d'informations contextuelles à ce processus, ce qui améliore l'interprétation.L'utilisation de ces nouveaux outils pour le traitement de documents du XVIIIe siècle a montré qu'il était facile d'intégrer nos propositions à un système existant, que sa conception restait simple, et que l'effort de correction pouvait être diminué
Iterative analysis of document collections enables efficient human-initiated interaction
International audienceDocument analysis and recognition systems often fail to produce results with a sufficient quality level when processing old and damaged documents sets, and require manual corrections to improve results. This paper presents how, using the iterative analysis of document pages we recently proposed, we can implement a spontaneous interaction model, suitable for mass document processing. It enables human operators to detect and correct errors made by the automatic system, and reintegrates the corrections they made into subsequent analysis steps of the iterative analysis process. Thus, a page analyzer can reprocess erroneous parts and those which depend on them, avoiding the necessity to manually fix during post-processing all the consequences of errors made by the automatic system. After presenting the global system architecture and a prototype implementation of our proposal, we show that document model can be simply enriched to enable the spontaneous interaction model we propose. We present how to use it in a practical example to correct under-segmentation issues during the localization of numbers in documents from the 18th century. Evaluations we conducted on the example case show, on 50 pages containing 1637 numbers to localize, that the interaction model we propose can reduce human workload (29.8% less elements to provide) for a given target quality level when compared to a manual post-processing
Iterative Analysis of Pages in Document Collections for Efficient User Interaction
International audienceThe analysis of sets of degraded documents, like historical ones, is error-prone and requires human help to achieve acceptable quality levels. However, human interaction raises 3 main issues when processing important amounts of pages: none of the user or the system should wait for work; information provided by a human operator should not be restricted to local isolated corrections, but rather produce durable changes in the system; the ability to interact with a human operator should not increase the complexity of document models nor duplicate them between analysis and human interaction processes. To solve those issues, we propose an iterative approach, based on a special mechanism called visual memory, to reintegrate external information during page analysis. So as to demonstrate the interest for existing systems, we explain how we adapted a (rule-based) page analysis tool to enable, in this iterative approach, a delayed interaction with a human operator based on an adaptation of error recovery principles for compilers and the well-known exception handling mechanism. We validated our iterative approach on sales registers from the 18th century
Exploiting Collection Level for Improving Assisted Handwritten Words Transcription of Historical Documents
International audienceTranscription of handwritten words in historical documents is still a difficult task. When processing huge amount of pages, document-centered approaches are limited by the trade-off between automatic recognition errors and the tedious aspect of human user annotation work. In this article, we investigate the use of inter page dependencies to overcome those limitations. For this, we propose a new architecture that allows the exploitation of handwritten word redundancies over pages by considering documents from a higher point of view, namely the collection level. The experiments we conducted on handwritten word transcription show promising results in terms of recognition error and human user work reductions