711 research outputs found

    An iterative multimodal framework for the transcription of handwritten historical documents

    Full text link
    [EN] The transcription of historical documents is one of the most interesting tasks in which Handwritten Text Recognition can be applied, due to its interest in humanities research. One alternative for transcribing the ancient manuscripts is the use of speech dictation by using Automatic Speech Recognition techniques. In the two alternatives similar models (Hidden Markov Models and n-grams) and decoding processes (Viterbi decoding) are employed, which allows a possible combination of the two modalities with little diffi- culties. In this work, we explore the possibility of using recognition results of one modality to restrict the decoding process of the other modality, and apply this process iteratively. Results of these multimodal iterative alternatives are significantly better than the baseline uni-modal systems and better than the non-iterative alternatives. 2012 Elsevier B.V. All rights reserved.Work supported by the EC (FEDER/FSE) and the Spanish MEC/MICINN under the MIPRCV ’’Consolider Ingenio 2010’’ program (CSD2007-00018), iTrans2 (TIN2009–14511) and MITTRAL (TIN2009-14633-C03–01) projects. Also supported by the Spanish MITyC under the erudito.com (TSI-020110-2009-439) project and by the Generalitat Valenciana under grant GV/2010/067, and by the UPV under project PAID-05-11-2779 and grant UPV/2009/2851.Alabau, V.; Martínez Hinarejos, CD.; Romero Gómez, V.; Lagarda Arroyo, AL. (2014). An iterative multimodal framework for the transcription of handwritten historical documents. Pattern Recognition Letters. 35:195-203. https://doi.org/10.1016/j.patrec.2012.11.007S1952033

    Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing

    Full text link
    Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. This problem is faced from three different, but complementary, scenarios: · Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. · Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just a word or character to correct (e.g. an on-line handwritten word). · Crowdsourcing: Open distributed collaboration emerges as a powerful tool for massive transcription at a relatively low cost, since the paleographer supervision effort may be dramatically reduced. Multimodal combination allows one to use the speech dictation of handwritten text lines in a multimodal crowdsourcing platform, where collaborators may provide their speech by using their own mobile device instead of using desktop or laptop computers, which makes it possible to recruit more collaborators.El Procesamiento del Lenguaje Natural (PLN) es un campo de investigación interdisciplinar de las Ciencias de la Computación, Lingüística y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacción Hombre-Máquina. La mayoría de las tareas de investigación del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducción del lenguaje natural, que se pueden utilizar para construir sistemas automáticos para la transcripción y traducción de documentos. En cuanto a los documentos manuscritos digitalizados, la transcripción se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalización de imágenes sólo proporciona, en la mayoría de los casos, la búsqueda por imagen y no por contenidos lingüísticos. La transcripción es aún más importante en el caso de los manuscritos históricos, ya que la mayoría de estos documentos son únicos y la preservación de su contenido es crucial por razones culturales e históricas. La transcripción de manuscritos históricos suele ser realizada por paleógrafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta común para ayudar a los paleógrafos en su tarea, la cual proporciona un borrador de la transcripción que los paleógrafos pueden corregir con métodos más o menos sofisticados. Este borrador de transcripción es útil cuando presenta una tasa de error suficientemente reducida para que el proceso de corrección sea más cómodo que una completa transcripción desde cero. Por lo tanto, la obtención de un borrador de transcripción con una baja tasa de error es crucial para que esta tecnología de PLN sea incorporada en el proceso de transcripción. El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripción ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los paleógrafos para obtener la transcripción de manuscritos históricos digitalizados. Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios: · Multimodalidad: El uso de sistemas RES permite a los paleógrafos acelerar el proceso de transcripción manual, ya que son capaces de corregir en un borrador de la transcripción. Otra alternativa es obtener el borrador de la transcripción dictando el contenido a un sistema de Reconocimiento Automático de Habla. Cuando ambas fuentes están disponibles, una combinación multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hipótesis final. · Interactividad: El uso de tecnologías asistenciales en el proceso de transcripción permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripción correcta, gracias a la cooperación entre el sistema asistencial y el paleógrafo para obtener la transcripción perfecta. La realimentación multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de información adicionales con señales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la señal de habla del dictado del contenido de dicha imagen de texto), o señales que representen sólo una palabra o carácter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla táctil). · Crowdsourcing: La colaboración distribuida y abierta surge como una poderosa herramienta para la transcripción masiva a un costo relativamente bajo, ya que el esfuerzo de supervisión de los paleógrafos puede ser drásticamente reducido. La combinación multimodal permite utilizar el dictado del contenido de líneas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo móvil en lugar de usar ordenadores,El Processament del Llenguatge Natural (PLN) és un camp de recerca interdisciplinar de les Ciències de la Computació, la Lingüística i el Reconeixement de Patrons que estudia, entre d'altres, l'ús del llenguatge natural humà en la interacció Home-Màquina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del món real. Aquest és el cas del reconeixement i la traducció del llenguatge natural, que es poden utilitzar per construir sistemes automàtics per a la transcripció i traducció de documents. Quant als documents manuscrits digitalitzats, la transcripció s'utilitza per facilitar l'accés digital als continguts, ja que la simple digitalització d'imatges només proporciona, en la majoria dels casos, la cerca per imatge i no per continguts lingüístics (paraules clau, expressions, categories sintàctiques o semàntiques). La transcripció és encara més important en el cas dels manuscrits històrics, ja que la majoria d'aquests documents són únics i la preservació del seu contingut és crucial per raons culturals i històriques. La transcripció de manuscrits històrics sol ser realitzada per paleògrafs, els quals són persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els paleògrafs en la seua tasca, la qual proporciona un esborrany de la transcripció que els paleògrafs poden esmenar amb mètodes més o menys sofisticats. Aquest esborrany de transcripció és útil quan presenta una taxa d'error prou reduïda perquè el procés de correcció siga més còmode que una completa transcripció des de zero. Per tant, l'obtenció d'un esborrany de transcripció amb un baixa taxa d'error és crucial perquè aquesta tecnologia del PLN siga incorporada en el procés de transcripció. El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripció ofert per un sistema RES, amb l'objectiu de reduir l'esforç realitzat pels paleògrafs per obtenir la transcripció de manuscrits històrics digitalitzats. Aquest problema s'enfronta a partir de tres escenaris diferents, però complementaris: · Multimodalitat: L'ús de sistemes RES permet als paleògrafs accelerar el procés de transcripció manual, ja que són capaços de corregir un esborrany de la transcripció. Una altra alternativa és obtenir l'esborrany de la transcripció dictant el contingut a un sistema de Reconeixement Automàtic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinació multimodal és possible i es pot realitzar un procés iteratiu per refinar la hipòtesi final. · Interactivitat: L'ús de tecnologies assistencials en el procés de transcripció permet reduir el temps i l'esforç humà requerits per obtenir la transcripció real, gràcies a la cooperació entre el sistema assistencial i el paleògraf per obtenir la transcripció perfecta. La realimentació multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informació addicionals amb senyals que representen la mateixa seqüencia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen només una paraula o caràcter a corregir (per exemple, una paraula manuscrita mitjançant una pantalla tàctil). · Crowdsourcing: La col·laboració distribuïda i oberta sorgeix com una poderosa eina per a la transcripció massiva a un cost relativament baix, ja que l'esforç de supervisió dels paleògrafs pot ser reduït dràsticament. La combinació multimodal permet utilitzar el dictat del contingut de línies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col·laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu mòbil en lloc d'utilitzar ordinadors d'escriptori o portàtils, la qual cosa permet ampliar el nombrGranell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86137TESI

    Interactive handwriting recognition with limited user effort

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10032-013-0204-5[EN] Transcription of handwritten text in (old) documents is an important, time-consuming task for digital libraries. Although post-editing automatic recognition of handwritten text is feasible, it is not clearly better than simply ignoring it and transcribing the document from scratch. A more effective approach is to follow an interactive approach in which both the system is guided by the user, and the user is assisted by the system to complete the transcription task as efficiently as possible. Nevertheless, in some applications, the user effort available to transcribe documents is limited and fully supervision of the system output is not realistic. To circumvent these problems, we propose a novel interactive approach which efficiently employs user effort to transcribe a document by improving three different aspects. Firstly, the system employs a limited amount of effort to solely supervise recognised words that are likely to be incorrect. Thus, user effort is efficiently focused on the supervision of words for which the system is not confident enough. Secondly, it refines the initial transcription provided to the user by recomputing it constrained to user supervisions. In this way, incorrect words in unsupervised parts can be automatically amended without user supervision. Finally, it improves the underlying system models by retraining the system from partially supervised transcriptions. In order to prove these statements, empirical results are presented on two real databases showing that the proposed approach can notably reduce user effort in the transcription of handwritten text in (old) documents.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No 287755 (transLectures). Also supported by the Spanish Government (MICINN, MITyC, "Plan E", under Grants MIPRCV "Consolider Ingenio 2010", MITTRAL (TIN2009-14633-C03-01), erudito.com (TSI-020110-2009-439), iTrans2 (TIN2009-14511), and FPU (AP2007-02867), and the Generalitat Valenciana (Grants Prometeo/2009/014 and GV/2010/067).Serrano Martinez Santos, N.; Giménez Pastor, A.; Civera Saiz, J.; Sanchis Navarro, JA.; Juan Císcar, A. (2014). Interactive handwriting recognition with limited user effort. International Journal on Document Analysis and Recognition. 17(1):47-59. https://doi.org/10.1007/s10032-013-0204-5S4759171Agua, M., Serrano, N., Civera, J., Juan, A.: Character-based handwritten text recognition of multilingual documents. In: Proceedings of Advances in Speech and Language Technologies for Iberian Languages (IBERSPEECH 2012), Madrid (Spain), pp. 187–196 (2012)Ahn, L.V., Maurer, B., Mcmillen, C., Abraham, D., Blum, M.: reCAPTCHA: human-based character recognition via web security measures. Science 321, 1465–1468 (2008)Barrachina, S., Bender, O., Casacuberta, F., Civera, J., Cubel, E., Khadivi, S., Lagarda, A.L., Ney, H., Tomás, J., Vidal, E.: Statistical approaches to computer-assisted translation. Comput. Linguist. 35(1), 3–28 (2009)Bertolami, R., Bunke, H.: Hidden markov model-based ensemble methods for offline handwritten text line recognition. Pattern Recognit. 41, 3452–3460 (2008)Bunke, H., Bengio, S., Vinciarelli, A.: Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)Dreuw, P., Jonas, S., Ney, H.: White-space models for offline Arabic handwriting recognition. In: Proceedings of the 19th International Conference on, Pattern Recognition, pp. 1–4 (2008)Efron, B., Tibshirani, R.J.: An introduction to bootstrap. Chapman and Hall/CRC, London (1994)Fischer, A., Wuthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., Stolz, M.: Automatic transcription of handwritten medieval documents. In: Proceedings of the 15th International Conference on Virtual Systems and Multimedia, pp. 137–142 (2009)Frinken, V., Bunke, H.: Evaluating retraining rules for semi-supervised learning in neural network based cursive word recognition. In: Proceedings of the 10th International Conference on Document Analysis and Recognition, Barcelona (Spain), pp. 31–35 (2009)Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)Hakkani-Tür, D., Riccardi, G., Tur, G.: An active approach to spoken language processing. ACM Trans. Speech Lang. Process. 3, 1–31 (2006)Kristjannson, T., Culotta, A., Viola, P., McCallum, A.: Interactive information extraction with constrained conditional random fields. In: Proceedings of the 19th Natural Conference on Artificial Intelligence, San Jose, CA (USA), pp. 412–418 (2004)Laurence Likforman-Sulem, A.Z., Taconet, B.: Text line segmentation of historical documents: a survey. Int. J. Doc. Anal. Recognit. 9, 123–138 (2007)Le Bourgeois, F., Emptoz, H.: Debora: digital access to books of the renaissance. Int. J. Doc. Anal. Recognit. 9, 193–221 (2007)Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)Neal, R.M., Hinton, G.E.: Learning in graphical models. In: A View of the EM Algorithm That Justifies Incremental, Sparse, and Other Variants, Chap. MIT Press, Cambridge, MA, USA, pp. 355–368 (1999)Pérez, D., Tarazón, L., Serrano, N., Ramos-Terrades, O., Juan, A.: The GERMANA database. In: Proceedings of the 10th International Conference on Document Analysis and Recognition, Barcelona (Spain), pp. 301–305 (2009)Plötz, T., Fink, G.A.: Markov models for offline handwriting recognition: a survey. Int. J. Doc. Anal. Recognit. 12(4), 269–298 (2009)Quiniou, S., Cheriet, M., Anquetil, E.: Error handling approach using characterization and correction steps for handwritten document analysis. Int. J. Doc. Anal. Recognit. 15(2), 125–141 (2012)Rodríguez, L., García-Varea, I., Vidal, E.: Multi-modal computer assisted speech transcription. In: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, ACM, New York, NY, USA, pp. 30:1–30:7 (2010)Serrano, N., Pérez, D., Sanchis, A., Juan, A.: Adaptation from partially supervised handwritten text transcriptions. In: Proceedings of the 11th International Conference on Multimodal Interfaces and the 6th Workshop on Machine Learning for Multimodal Interaction, Cambridge, MA (USA), pp. 289–292 (2009)Serrano, N., Castro, F., Juan, A.: The RODRIGO database. In: Proceedings of the 7th International Conference on Language Resources and Evaluation, Valleta (Malta), pp. 2709–2712 (2010)Serrano, N., Giménez, A., Sanchis, A., Juan, A.: Active learning strategies for handwritten text transcription. In: Proceedings of the 12th International Conference on Multimodal Interfaces and the 7th Workshop on Machine Learning for Multimodal, Interaction, Beijing (China) (2010)Serrano, N., Sanchis, A., Juan, A.: Balancing error and supervision effort in interactive-predictive handwriting recognition. In: Proceedings of the 15th International Conference on Intelligent User Interfaces, Hong Kong (China), pp. 373–376 (2010)Serrano, N., Tarazón, L., Pérez, D., Ramos-Terrades, O., Juan, A.: The GIDOC prototype. In: Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems, Funchal (Portugal), pp. 82–89 (2010)Settles, B.: Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison (2009)Tarazón, L., Pérez, D., Serrano, N., Alabau, V., Ramos-Terrades, O., Sanchis, A., Juan, A.: Confidence measures for error correction in interactive transcription of handwritten text. In: Proceedings of the 15th International Conference on Image Analysis, Processing, Vietri sul Mare (Italy) (2009)Toselli, A., Juan, A., Keysers, D., González, J., Salvador, I., Ney, H., Vidal, E., Casacuberta, F.: Integrated handwriting recognition and interpretation using finite-state models. Int. J. Pattern Recognit. Artif. Intell. 18(4), 519–539 (2004)Toselli, A., Romero, V., Rodríguez, L., Vidal, E.: Computer assisted transcription of handwritten text. In: Proceedings of the 9th International Conference on Document Analysis and Recognition, Curitiba (Brazil), pp. 944–948 (2007)Valor, J., Pérez, A., Civera, J., Juan, A.: Integrating a state-of-the-art ASR system into the opencast Matterhorn platform. In: Proceedings of the Advances in Speech and Language Technologies for Iberian Languages (IBERSPEECH 2012), Madrid (Spain), pp. 237–246 (2012)Wessel, F., Ney, H.: Unsupervised training of acoustic models for large vocabulary continuous speech recognition. IEEE Trans Speech Audio Process 13(1), 23–31 (2005

    Interactive Transcription of Old Text Documents

    Full text link
    Nowadays, there are huge collections of handwritten text documents in libraries all over the world. The high demand for these resources has led to the creation of digital libraries in order to facilitate the preservation and provide electronic access to these documents. However text transcription of these documents im- ages are not always available to allow users to quickly search information, or computers to process the information, search patterns or draw out statistics. The problem is that manual transcription of these documents is an expensive task from both economical and time viewpoints. This thesis presents a novel ap- proach for e cient Computer Assisted Transcription (CAT) of handwritten text documents using state-of-the-art Handwriting Text Recognition (HTR) systems. The objective of CAT approaches is to e ciently complete a transcription task through human-machine collaboration, as the e ort required to generate a manual transcription is high, and automatically generated transcriptions from state-of-the-art systems still do not reach the accuracy required. This thesis is centered on a special application of CAT, that is, the transcription of old text document when the quantity of user e ort available is limited, and thus, the entire document cannot be revised. In this approach, the objective is to generate the best possible transcription by means of the user e ort available. This thesis provides a comprehensive view of the CAT process from feature extraction to user interaction. First, a statistical approach to generalise interactive transcription is pro- posed. As its direct application is unfeasible, some assumptions are made to apply it to two di erent tasks. First, on the interactive transcription of hand- written text documents, and next, on the interactive detection of the document layout. Next, the digitisation and annotation process of two real old text documents is described. This process was carried out because of the scarcity of similar resources and the need of annotated data to thoroughly test all the developed tools and techniques in this thesis. These two documents were carefully selected to represent the general di culties that are encountered when dealing with HTR. Baseline results are presented on these two documents to settle down a benchmark with a standard HTR system. Finally, these annotated documents were made freely available to the community. It must be noted that, all the techniques and methods developed in this thesis have been assessed on these two real old text documents. Then, a CAT approach for HTR when user e ort is limited is studied and extensively tested. The ultimate goal of applying CAT is achieved by putting together three processes. Given a recognised transcription from an HTR system. The rst process consists in locating (possibly) incorrect words and employs the user e ort available to supervise them (if necessary). As most words are not expected to be supervised due to the limited user e ort available, only a few are selected to be revised. The system presents to the user a small subset of these words according to an estimation of their correctness, or to be more precise, according to their con dence level. Next, the second process starts once these low con dence words have been supervised. This process updates the recogni- tion of the document taking user corrections into consideration, which improves the quality of those words that were not revised by the user. Finally, the last process adapts the system from the partially revised (and possibly not perfect) transcription obtained so far. In this adaptation, the system intelligently selects the correct words of the transcription. As results, the adapted system will bet- ter recognise future transcriptions. Transcription experiments using this CAT approach show that this approach is mostly e ective when user e ort is low. The last contribution of this thesis is a method for balancing the nal tran- scription quality and the supervision e ort applied using our previously de- scribed CAT approach. In other words, this method allows the user to control the amount of errors in the transcriptions obtained from a CAT approach. The motivation of this method is to let users decide on the nal quality of the desired documents, as partially erroneous transcriptions can be su cient to convey the meaning, and the user e ort required to transcribe them might be signi cantly lower when compared to obtaining a totally manual transcription. Consequently, the system estimates the minimum user e ort required to reach the amount of error de ned by the user. Error estimation is performed by computing sepa- rately the error produced by each recognised word, and thus, asking the user to only revise the ones in which most errors occur. Additionally, an interactive prototype is presented, which integrates most of the interactive techniques presented in this thesis. This prototype has been developed to be used by palaeographic expert, who do not have any background in HTR technologies. After a slight ne tuning by a HTR expert, the prototype lets the transcribers to manually annotate the document or employ the CAT ap- proach presented. All automatic operations, such as recognition, are performed in background, detaching the transcriber from the details of the system. The prototype was assessed by an expert transcriber and showed to be adequate and e cient for its purpose. The prototype is freely available under a GNU Public Licence (GPL).Serrano Martínez-Santos, N. (2014). Interactive Transcription of Old Text Documents [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/37979TESI

    Advances in Document Layout Analysis

    Full text link
    [EN] Handwritten Text Segmentation (HTS) is a task within the Document Layout Analysis field that aims to detect and extract the different page regions of interest found in handwritten documents. HTS remains an active topic, that has gained importance with the years, due to the increasing demand to provide textual access to the myriads of handwritten document collections held by archives and libraries. This thesis considers HTS as a task that must be tackled in two specialized phases: detection and extraction. We see the detection phase fundamentally as a recognition problem that yields the vertical positions of each region of interest as a by-product. The extraction phase consists in calculating the best contour coordinates of the region using the position information provided by the detection phase. Our proposed detection approach allows us to attack both higher level regions: paragraphs, diagrams, etc., and lower level regions like text lines. In the case of text line detection we model the problem to ensure that the system's yielded vertical position approximates the fictitious line that connects the lower part of the grapheme bodies in a text line, commonly known as the baseline. One of the main contributions of this thesis, is that the proposed modelling approach allows us to include prior information regarding the layout of the documents being processed. This is performed via a Vertical Layout Model (VLM). We develop a Hidden Markov Model (HMM) based framework to tackle both region detection and classification as an integrated task and study the performance and ease of use of the proposed approach in many corpora. We review the modelling simplicity of our approach to process regions at different levels of information: text lines, paragraphs, titles, etc. We study the impact of adding deterministic and/or probabilistic prior information and restrictions via the VLM that our approach provides. Having a separate phase that accurately yields the detection position (base- lines in the case of text lines) of each region greatly simplifies the problem that must be tackled during the extraction phase. In this thesis we propose to use a distance map that takes into consideration the grey-scale information in the image. This allows us to yield extraction frontiers which are equidistant to the adjacent text regions. We study how our approach escalates its accuracy proportionally to the quality of the provided detection vertical position. Our extraction approach gives near perfect results when human reviewed baselines are provided.[ES] La Segmentación de Texto Manuscrito (STM) es una tarea dentro del campo de investigación de Análisis de Estructura de Documentos (AED) que tiene como objetivo detectar y extraer las diferentes regiones de interés de las páginas que se encuentran en documentos manuscritos. La STM es un tema de investigación activo que ha ganado importancia con los años debido a la creciente demanda de proporcionar acceso textual a las miles de colecciones de documentos manuscritos que se conservan en archivos y bibliotecas. Esta tesis entiende la STM como una tarea que debe ser abordada en dos fases especializadas: detección y extracción. Consideramos que la fase de detección es, fundamentalmente, un problema de clasificación cuyo subproducto son las posiciones verticales de cada región de interés. Por su parte, la fase de extracción consiste en calcular las mejores coordenadas de contorno de la región utilizando la información de posición proporcionada por la fase de detección. Nuestro enfoque de detección nos permite atacar tanto regiones de alto nivel (párrafos, diagramas¿) como regiones de nivel bajo (líneas de texto principalmente). En el caso de la detección de líneas de texto, modelamos el problema para asegurar que la posición vertical estimada por el sistema se aproxime a la línea ficticia que conecta la parte inferior de los cuerpos de los grafemas en una línea de texto, comúnmente conocida como línea base. Una de las principales aportaciones de esta tesis es que el enfoque de modelización propuesto nos permite incluir información conocida a priori sobre la disposición de los documentos que se están procesando. Esto se realiza mediante un Modelo de Estructura Vertical (MEV). Desarrollamos un marco de trabajo basado en los Modelos Ocultos de Markov (MOM) para abordar tanto la detección de regiones como su clasificación de forma integrada, así como para estudiar el rendimiento y la facilidad de uso del enfoque propuesto en numerosos corpus. Así mismo, revisamos la simplicidad del modelado de nuestro enfoque para procesar regiones en diferentes niveles de información: líneas de texto, párrafos, títulos, etc. Finalmente, estudiamos el impacto de añadir información y restricciones previas deterministas o probabilistas a través de el MEV propuesto que nuestro enfoque proporciona. Disponer de un método independiente que obtiene con precisión la posición de cada región detectada (líneas base en el caso de las líneas de texto) simplifica enormemente el problema que debe abordarse durante la fase de extracción. En esta tesis proponemos utilizar un mapa de distancias que tiene en cuenta la información de escala de grises de la imagen. Esto nos permite obtener fronteras de extracción que son equidistantes a las regiones de texto adyacentes. Estudiamos como nuestro enfoque aumenta su precisión de manera proporcional a la calidad de la detección y descubrimos que da resultados casi perfectos cuando se le proporcionan líneas de base revisadas por humanos.[CA] La Segmentació de Text Manuscrit (STM) és una tasca dins del camp d'investigació d'Anàlisi d'Estructura de Documents (AED) que té com a objectiu detectar I extraure les diferents regions d'interès de les pàgines que es troben en documents manuscrits. La STM és un tema d'investigació actiu que ha guanyat importància amb els anys a causa de la creixent demanda per proporcionar accés textual als milers de col·leccions de documents manuscrits que es conserven en arxius i biblioteques. Aquesta tesi entén la STM com una tasca que ha de ser abordada en dues fases especialitzades: detecció i extracció. Considerem que la fase de detecció és, fonamentalment, un problema de classificació el subproducte de la qual són les posicions verticals de cada regió d'interès. Per la seva part, la fase d'extracció consisteix a calcular les millors coordenades de contorn de la regió utilitzant la informació de posició proporcionada per la fase de detecció. El nostre enfocament de detecció ens permet atacar tant regions d'alt nivell (paràgrafs, diagrames ...) com regions de nivell baix (línies de text principalment). En el cas de la detecció de línies de text, modelem el problema per a assegurar que la posició vertical estimada pel sistema s'aproximi a la línia fictícia que connecta la part inferior dels cossos dels grafemes en una línia de text, comunament coneguda com a línia base. Una de les principals aportacions d'aquesta tesi és que l'enfocament de modelització proposat ens permet incloure informació coneguda a priori sobre la disposició dels documents que s'estan processant. Això es realitza mitjançant un Model d'Estructura Vertical (MEV). Desenvolupem un marc de treball basat en els Models Ocults de Markov (MOM) per a abordar tant la detecció de regions com la seva classificació de forma integrada, així com per a estudiar el rendiment i la facilitat d'ús de l'enfocament proposat en nombrosos corpus. Així mateix, revisem la simplicitat del modelatge del nostre enfocament per a processar regions en diferents nivells d'informació: línies de text, paràgrafs, títols, etc. Finalment, estudiem l'impacte d'afegir informació i restriccions prèvies deterministes o probabilistes a través del MEV que el nostre mètode proporciona. Disposar d'un mètode independent que obté amb precisió la posició de cada regió detectada (línies base en el cas de les línies de text) simplifica enormement el problema que ha d'abordar-se durant la fase d'extracció. En aquesta tesi proposem utilitzar un mapa de distàncies que té en compte la informació d'escala de grisos de la imatge. Això ens permet obtenir fronteres d'extracció que són equidistants de les regions de text adjacents. Estudiem com el nostre enfocament augmenta la seva precisió de manera proporcional a la qualitat de la detecció i descobrim que dona resultats quasi perfectes quan se li proporcionen línies de base revisades per humans.Bosch Campos, V. (2020). Advances in Document Layout Analysis [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/138397TESI

    Improving on-line handwritten recognition in interactive machine translation

    Full text link
    [EN] On-line handwriting text recognition (HTR) could be used as a more natural way of interaction in many interactive applications. However, current HTR technology is far from developing error-free systems and, consequently, its use in many applications is limited. Despite this, there are many scenarios, as in the correction of the errors of fully-automatic systems using HTR in a post-editing step, in which the information from the specific task allows to constrain the search and therefore to improve the HTR accuracy. For example, in machine translation (MT), the on-line HTR system can also be used to correct translation errors. The HTR can take advantage of information from the translation problem such as the source sentence that is translated, the portion of the translated sentence that has been supervised by the human, or the translation error to be amended. Empirical experimentation suggests that this is a valuable information to improve the robustness of the on-line HTR system achieving remarkable results.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under Grant agreement no. 287576 (CasMaCat), from the EC (FEDER/FSE), and from the Spanish MEC/MICINN under the Active2Trans (TIN2012-31723) project. It is also supported by the Generalitat Valenciana under Grant ALMPR (Prometeo/2009/01) and GV/2010/067.Alabau Gonzalvo, V.; Sanchis Navarro, JA.; Casacuberta Nolla, F. (2014). Improving on-line handwritten recognition in interactive machine translation. Pattern Recognition. 47(3):1217-1228. https://doi.org/10.1016/j.patcog.2013.09.035S1217122847

    Multimodal output combination for transcribing historical handwritten documents

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-23192-1_21Transcription of digitalised historical documents is an interesting task in the document analysis area. This transcription can be achieved by using Handwritten Text Recognition (HTR) on digitalised pages or by using Automatic Speech Recognition (ASR) on the dictation of contents. Moreover, another option is using both systems in a multimodal combination to obtain a draft transcription, given that combining the outputs of different recognition systems will generally improve the recognition accuracy. In this work, we present a new combination method based on Confusion Network. We check its effectiveness for transcribing a Spanish historical book. Results on both unimodal combination with different optical (for HTR) and acoustic (for ASR) models, and multimodal combination, show a relative reduction of Word and Character Error Rate of 14.3% and 16.6%, respectively, over the HTR baseline.Work partially supported by European Union -7th FP, under grant 600707 (tranScriptorium), and by the Spanish MEC under projects STraDA (TIN2012-37475-C02-01), Active2Trans (TIN2012-31723), and SmartWays (RTC-2014-1466-4).Granell Romero, E.; Martínez-Hinarejos, C. (2015). Multimodal output combination for transcribing historical handwritten documents. En Computer Analysis of Images and Patterns. Springer. 246-260. https://doi.org/10.1007/978-3-319-23192-1_21S246260Alabau, V., Martínez-Hinarejos, C.D., Romero, V., Lagarda, A.L.: An iterative multimodal framework for the transcription of handwritten historical documents. Pattern Recognition Letters 35, 195–203 (2014)Bertolami, R., Halter, B., Bunke, H.: Combination of multiple handwritten text line recognition systems with a recursive approach. In: Proc. Int. Conf. Frontiers Handwriting Recognition, pp. 61–65 (2006)Bisani, M., Ney, H.: Bootstrap estimates for confidence intervals in ASR performance evaluation. In: Proc. of Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 409–412 (2004)Collobert, R., Bengio, S., Mariéthoz, J.: Torch: a modular machine learning software library. Tech. rep., IDIAP-RR 02–46, IDIAP (2002)Dreuw, P., Jonas, S., Ney, H.: White-space models for offline Arabic handwriting recognition. In: Proc. of Int. Conf. on Pattern Recognition, pp. 1–4 (2008)Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proc. of Int. Conf. Acoustics, Speech and Signal Processing, vol. 3, pp. 1635–1638 (2000)Ishimaru, S., Nishizaki, H., Sekiguchi, Y.: Effect of confusion network combination on speech recognition system for editing. In: Proc. of APSIPA Annual Summit and Conf., vol. 4, pp. 1–4 (2011)Johnson, D.: ICSI Quicknet soft package (2004). http://www1.icsi.berkeley.edu/Speech/qn.htmlKneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proc. of Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 181–184 (1995)Krishnamurthy, H.K.: Study of algorithms to combine multiple automatic speech recognition (ASR) system outputs. Master’s thesis, Department of Electrical and Computer Engineering (2009). http://hdl.handle.net/2047/d10019273Luján-Mares, M., Tamarit, V., Alabau, V., Martínez-Hinarejos, C.D., Pastor i Gadea, M., Sanchis, A., Toselli, A.H.: iATROS: a speech and handwritting recognition system. In: V Jornadas en Tecnologías del Habla (VJTH2008), pp. 75–78 (2008)Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Mariño, J.B., Nadeu, C.: Albayzin speech database: design of the phonetic corpus. In: Proc. of EuroSpeech 1993, pp. 175–178 (1993)Netpbm home page. http://netpbm.sourceforge.net/Plamondon, R., Srihari, S.N.: On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 63–84 (2000)Romero, V., Leiva, L.A., Toselli, A.H., Vidal, E.: Interactive multimodal transcription of text images using a web-based demo system. In: Proc. of Conf. on Intelligent User Interfaces, pp. 477–478 (2009)Serrano, N., Castro, F., Juan, A.: The RODRIGO Database. In: Proc. of Language Resources and Evaluation Conference, pp. 2709–2712 (2010)Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proc. Interspeech, pp. 901–904 (2002)Woodruff, P., Dupont, S.: Bimodal combination of speech and handwriting for improved word recognition. In: Proc. of EUSIPCO 2005, pp. 1918–1921 (2005)Xue, J., Zhao, Y.: Improved confusion network algorithm and shortest path search from word lattice. In: Proc. of Int. Conf. in Acoustics, Speech and Signal Processing, vol. 1, pp. 853–856 (2005)Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al.: The HTK book (for HTK version 3.4). Cambridge university Eng. Dept. (2006)Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. Transactions on Information Systems 22(2), 179–214 (2004

    Multimodal Interactive Transcription of Handwritten Text Images

    Full text link
    En esta tesis se presenta un nuevo marco interactivo y multimodal para la transcripción de Documentos manuscritos. Esta aproximación, lejos de proporcionar la transcripción completa pretende asistir al experto en la dura tarea de transcribir. Hasta la fecha, los sistemas de reconocimiento de texto manuscrito disponibles no proporcionan transcripciones aceptables por los usuarios y, generalmente, se requiere la intervención del humano para corregir las transcripciones obtenidas. Estos sistemas han demostrado ser realmente útiles en aplicaciones restringidas y con vocabularios limitados (como es el caso del reconocimiento de direcciones postales o de cantidades numéricas en cheques bancarios), consiguiendo en este tipo de tareas resultados aceptables. Sin embargo, cuando se trabaja con documentos manuscritos sin ningún tipo de restricción (como documentos manuscritos antiguos o texto espontáneo), la tecnología actual solo consigue resultados inaceptables. El escenario interactivo estudiado en esta tesis permite una solución más efectiva. En este escenario, el sistema de reconocimiento y el usuario cooperan para generar la transcripción final de la imagen de texto. El sistema utiliza la imagen de texto y una parte de la transcripción previamente validada (prefijo) para proponer una posible continuación. Despues, el usuario encuentra y corrige el siguente error producido por el sistema, generando así un nuevo prefijo mas largo. Este nuevo prefijo, es utilizado por el sistema para sugerir una nueva hipótesis. La tecnología utilizada se basa en modelos ocultos de Markov y n-gramas. Estos modelos son utilizados aquí de la misma manera que en el reconocimiento automático del habla. Algunas modificaciones en la definición convencional de los n-gramas han sido necesarias para tener en cuenta la retroalimentación del usuario en este sistema.Romero Gómez, V. (2010). Multimodal Interactive Transcription of Handwritten Text Images [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8541Palanci
    corecore