39 research outputs found

    Learning to Read by Spelling: Towards Unsupervised Text Recognition

    Full text link
    This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets. We present detailed analysis for various aspects of the proposed method, namely - (1) impact of the length of training sequences on convergence, (2) relation between character frequencies and the order in which they are learnt, (3) generalisation ability of our recognition network to inputs of arbitrary lengths, and (4) impact of varying the text corpus on recognition accuracy. Finally, we demonstrate excellent text recognition accuracy on both synthetically generated text images, and scanned images of real printed books, using no labelled training examples

    Learning to Read Bushman: Automatic Handwriting Recognition for Bushman Languages

    Get PDF
    The Bleek and Lloyd Collection contains notebooks that document the tradition, language and culture of the Bushman people who lived in South Africa in the late 19th century. Transcriptions of these notebooks would allow for the provision of services such as text-based search and text-to-speech. However, these notebooks are currently only available in the form of digital scans and the manual creation of transcriptions is a costly and time-consuming process. Thus, automatic methods could serve as an alternative approach to creating transcriptions of the text in the notebooks. In order to evaluate the use of automatic methods, a corpus of Bushman texts and their associated transcriptions was created. The creation of this corpus involved: the development of a custom method for encoding the Bushman script, which contains complex diacritics; the creation of a tool for creating and transcribing the texts in the notebooks; and the running of a series of workshops in which the tool was used to create the corpus. The corpus was used to evaluate the use of various techniques for automatically transcribing the texts in the corpus in order to determine which approaches were best suited to the complex Bushman script. These techniques included the use of Support Vector Machines, Artificial Neural Networks and Hidden Markov Models as machine learning algorithms, which were coupled with different descriptive features. The effect of the texts used for training the machine learning algorithms was also investigated as well as the use of a statistical language model. It was found that, for Bushman word recognition, the use of a Support Vector Machine with Histograms of Oriented Gradient features resulted in the best performance and, for Bushman text line recognition, Marti & Bunke features resulted in the best performance when used with Hidden Markov Models. The automatic transcription of the Bushman texts proved to be difficult and the performance of the different recognition systems was largely affected by the complexities of the Bushman script. It was also found that, besides having an influence on determining which techniques may be the most appropriate for automatic handwriting recognition, the texts used in a automatic handwriting recognition system also play a large role in determining whether or not automatic recognition should be attempted at all

    Proceedings of the 4th International Workshop on Reading Music Systems

    Full text link
    The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 4th International Workshop on Reading Music Systems, held online on Nov. 18th 2022.Comment: Proceedings edited by Jorge Calvo-Zaragoza, Alexander Pacha and Elona Shatr

    Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing

    Full text link
    Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. This problem is faced from three different, but complementary, scenarios: · Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. · Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just a word or character to correct (e.g. an on-line handwritten word). · Crowdsourcing: Open distributed collaboration emerges as a powerful tool for massive transcription at a relatively low cost, since the paleographer supervision effort may be dramatically reduced. Multimodal combination allows one to use the speech dictation of handwritten text lines in a multimodal crowdsourcing platform, where collaborators may provide their speech by using their own mobile device instead of using desktop or laptop computers, which makes it possible to recruit more collaborators.El Procesamiento del Lenguaje Natural (PLN) es un campo de investigación interdisciplinar de las Ciencias de la Computación, Lingüística y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacción Hombre-Máquina. La mayoría de las tareas de investigación del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducción del lenguaje natural, que se pueden utilizar para construir sistemas automáticos para la transcripción y traducción de documentos. En cuanto a los documentos manuscritos digitalizados, la transcripción se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalización de imágenes sólo proporciona, en la mayoría de los casos, la búsqueda por imagen y no por contenidos lingüísticos. La transcripción es aún más importante en el caso de los manuscritos históricos, ya que la mayoría de estos documentos son únicos y la preservación de su contenido es crucial por razones culturales e históricas. La transcripción de manuscritos históricos suele ser realizada por paleógrafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta común para ayudar a los paleógrafos en su tarea, la cual proporciona un borrador de la transcripción que los paleógrafos pueden corregir con métodos más o menos sofisticados. Este borrador de transcripción es útil cuando presenta una tasa de error suficientemente reducida para que el proceso de corrección sea más cómodo que una completa transcripción desde cero. Por lo tanto, la obtención de un borrador de transcripción con una baja tasa de error es crucial para que esta tecnología de PLN sea incorporada en el proceso de transcripción. El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripción ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los paleógrafos para obtener la transcripción de manuscritos históricos digitalizados. Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios: · Multimodalidad: El uso de sistemas RES permite a los paleógrafos acelerar el proceso de transcripción manual, ya que son capaces de corregir en un borrador de la transcripción. Otra alternativa es obtener el borrador de la transcripción dictando el contenido a un sistema de Reconocimiento Automático de Habla. Cuando ambas fuentes están disponibles, una combinación multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hipótesis final. · Interactividad: El uso de tecnologías asistenciales en el proceso de transcripción permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripción correcta, gracias a la cooperación entre el sistema asistencial y el paleógrafo para obtener la transcripción perfecta. La realimentación multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de información adicionales con señales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la señal de habla del dictado del contenido de dicha imagen de texto), o señales que representen sólo una palabra o carácter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla táctil). · Crowdsourcing: La colaboración distribuida y abierta surge como una poderosa herramienta para la transcripción masiva a un costo relativamente bajo, ya que el esfuerzo de supervisión de los paleógrafos puede ser drásticamente reducido. La combinación multimodal permite utilizar el dictado del contenido de líneas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo móvil en lugar de usar ordenadores,El Processament del Llenguatge Natural (PLN) és un camp de recerca interdisciplinar de les Ciències de la Computació, la Lingüística i el Reconeixement de Patrons que estudia, entre d'altres, l'ús del llenguatge natural humà en la interacció Home-Màquina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del món real. Aquest és el cas del reconeixement i la traducció del llenguatge natural, que es poden utilitzar per construir sistemes automàtics per a la transcripció i traducció de documents. Quant als documents manuscrits digitalitzats, la transcripció s'utilitza per facilitar l'accés digital als continguts, ja que la simple digitalització d'imatges només proporciona, en la majoria dels casos, la cerca per imatge i no per continguts lingüístics (paraules clau, expressions, categories sintàctiques o semàntiques). La transcripció és encara més important en el cas dels manuscrits històrics, ja que la majoria d'aquests documents són únics i la preservació del seu contingut és crucial per raons culturals i històriques. La transcripció de manuscrits històrics sol ser realitzada per paleògrafs, els quals són persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els paleògrafs en la seua tasca, la qual proporciona un esborrany de la transcripció que els paleògrafs poden esmenar amb mètodes més o menys sofisticats. Aquest esborrany de transcripció és útil quan presenta una taxa d'error prou reduïda perquè el procés de correcció siga més còmode que una completa transcripció des de zero. Per tant, l'obtenció d'un esborrany de transcripció amb un baixa taxa d'error és crucial perquè aquesta tecnologia del PLN siga incorporada en el procés de transcripció. El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripció ofert per un sistema RES, amb l'objectiu de reduir l'esforç realitzat pels paleògrafs per obtenir la transcripció de manuscrits històrics digitalitzats. Aquest problema s'enfronta a partir de tres escenaris diferents, però complementaris: · Multimodalitat: L'ús de sistemes RES permet als paleògrafs accelerar el procés de transcripció manual, ja que són capaços de corregir un esborrany de la transcripció. Una altra alternativa és obtenir l'esborrany de la transcripció dictant el contingut a un sistema de Reconeixement Automàtic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinació multimodal és possible i es pot realitzar un procés iteratiu per refinar la hipòtesi final. · Interactivitat: L'ús de tecnologies assistencials en el procés de transcripció permet reduir el temps i l'esforç humà requerits per obtenir la transcripció real, gràcies a la cooperació entre el sistema assistencial i el paleògraf per obtenir la transcripció perfecta. La realimentació multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informació addicionals amb senyals que representen la mateixa seqüencia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen només una paraula o caràcter a corregir (per exemple, una paraula manuscrita mitjançant una pantalla tàctil). · Crowdsourcing: La col·laboració distribuïda i oberta sorgeix com una poderosa eina per a la transcripció massiva a un cost relativament baix, ja que l'esforç de supervisió dels paleògrafs pot ser reduït dràsticament. La combinació multimodal permet utilitzar el dictat del contingut de línies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col·laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu mòbil en lloc d'utilitzar ordinadors d'escriptori o portàtils, la qual cosa permet ampliar el nombrGranell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86137TESI

    Mining Spatio-Temporal Datasets: Relevance, Challenges and Current Research Directions

    Get PDF
    Spatio-temporal data usually records the states over time of an object, an event or a position in space. Spatio-temporal data can be found in several application fields, such as traffic management, environment monitoring, weather forerast, etc. In the past, huge effort was devoted to spatial data representation and manipulation with particular focus on its visualisation. More recently, the interest of many users has shifted from static views of geospatial phenomena, which capture its “spatiality” only, to more advanced means of discovering dynamic relationships among the patterns and events contained in the data as well as understanding the changes occurring in spatial data over time

    A Computational Theory of Contextual Knowledge in Machine Reading

    Get PDF
    Machine recognition of off–line handwriting can be achieved by either recognising words as individual symbols (word level recognition) or by segmenting a word into parts, usually letters, and classifying those parts (letter level recognition). Whichever method is used, current handwriting recognition systems cannot overcome the inherent ambiguity in writingwithout recourse to contextual information. This thesis presents a set of experiments that use Hidden Markov Models of language to resolve ambiguity in the classification process. It goes on to describe an algorithm designed to recognise a document written by a single–author and to improve recognition by adaptingto the writing style and learning new words. Learning and adaptation is achieved by reading the document over several iterations. The algorithm is designed to incorporate contextual processing, adaptation to modify the shape of known words and learning of new words within a constrained dictionary. Adaptation occurs when a word that has previously been trained in the classifier is recognised at either the word or letter level and the word image is used to modify the classifier. Learning occurs when a new word that has not been in the training set is recognised at the letter level and is subsequently added to the classifier. Words and letters are recognised using a nearest neighbour classifier and used features based on the two–dimensional Fourier transform. By incorporating a measure of confidence based on the distribution of training points around an exemplar, adaptation and learning is constrained to only occur when a word is confidently classified. The algorithm was implemented and tested with a dictionary of 1000 words. Results show that adaptation of the letter classifier improved recognition on average by 3.9% with only 1.6% at the whole word level. Two experiments were carried out to evaluate the learning in the system. It was found that learning accounted for little improvement in the classification results and also that learning new words was prone to misclassifications being propagated
    corecore