46 research outputs found

    Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing

    Full text link
    Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. This problem is faced from three different, but complementary, scenarios: 路 Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. 路 Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just a word or character to correct (e.g. an on-line handwritten word). 路 Crowdsourcing: Open distributed collaboration emerges as a powerful tool for massive transcription at a relatively low cost, since the paleographer supervision effort may be dramatically reduced. Multimodal combination allows one to use the speech dictation of handwritten text lines in a multimodal crowdsourcing platform, where collaborators may provide their speech by using their own mobile device instead of using desktop or laptop computers, which makes it possible to recruit more collaborators.El Procesamiento del Lenguaje Natural (PLN) es un campo de investigaci贸n interdisciplinar de las Ciencias de la Computaci贸n, Ling眉铆stica y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacci贸n Hombre-M谩quina. La mayor铆a de las tareas de investigaci贸n del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducci贸n del lenguaje natural, que se pueden utilizar para construir sistemas autom谩ticos para la transcripci贸n y traducci贸n de documentos. En cuanto a los documentos manuscritos digitalizados, la transcripci贸n se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalizaci贸n de im谩genes s贸lo proporciona, en la mayor铆a de los casos, la b煤squeda por imagen y no por contenidos ling眉铆sticos. La transcripci贸n es a煤n m谩s importante en el caso de los manuscritos hist贸ricos, ya que la mayor铆a de estos documentos son 煤nicos y la preservaci贸n de su contenido es crucial por razones culturales e hist贸ricas. La transcripci贸n de manuscritos hist贸ricos suele ser realizada por pale贸grafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta com煤n para ayudar a los pale贸grafos en su tarea, la cual proporciona un borrador de la transcripci贸n que los pale贸grafos pueden corregir con m茅todos m谩s o menos sofisticados. Este borrador de transcripci贸n es 煤til cuando presenta una tasa de error suficientemente reducida para que el proceso de correcci贸n sea m谩s c贸modo que una completa transcripci贸n desde cero. Por lo tanto, la obtenci贸n de un borrador de transcripci贸n con una baja tasa de error es crucial para que esta tecnolog铆a de PLN sea incorporada en el proceso de transcripci贸n. El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripci贸n ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los pale贸grafos para obtener la transcripci贸n de manuscritos hist贸ricos digitalizados. Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios: 路 Multimodalidad: El uso de sistemas RES permite a los pale贸grafos acelerar el proceso de transcripci贸n manual, ya que son capaces de corregir en un borrador de la transcripci贸n. Otra alternativa es obtener el borrador de la transcripci贸n dictando el contenido a un sistema de Reconocimiento Autom谩tico de Habla. Cuando ambas fuentes est谩n disponibles, una combinaci贸n multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hip贸tesis final. 路 Interactividad: El uso de tecnolog铆as asistenciales en el proceso de transcripci贸n permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripci贸n correcta, gracias a la cooperaci贸n entre el sistema asistencial y el pale贸grafo para obtener la transcripci贸n perfecta. La realimentaci贸n multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de informaci贸n adicionales con se帽ales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la se帽al de habla del dictado del contenido de dicha imagen de texto), o se帽ales que representen s贸lo una palabra o car谩cter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla t谩ctil). 路 Crowdsourcing: La colaboraci贸n distribuida y abierta surge como una poderosa herramienta para la transcripci贸n masiva a un costo relativamente bajo, ya que el esfuerzo de supervisi贸n de los pale贸grafos puede ser dr谩sticamente reducido. La combinaci贸n multimodal permite utilizar el dictado del contenido de l铆neas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo m贸vil en lugar de usar ordenadores,El Processament del Llenguatge Natural (PLN) 茅s un camp de recerca interdisciplinar de les Ci猫ncies de la Computaci贸, la Ling眉铆stica i el Reconeixement de Patrons que estudia, entre d'altres, l'煤s del llenguatge natural hum脿 en la interacci贸 Home-M脿quina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del m贸n real. Aquest 茅s el cas del reconeixement i la traducci贸 del llenguatge natural, que es poden utilitzar per construir sistemes autom脿tics per a la transcripci贸 i traducci贸 de documents. Quant als documents manuscrits digitalitzats, la transcripci贸 s'utilitza per facilitar l'acc茅s digital als continguts, ja que la simple digitalitzaci贸 d'imatges nom茅s proporciona, en la majoria dels casos, la cerca per imatge i no per continguts ling眉铆stics (paraules clau, expressions, categories sint脿ctiques o sem脿ntiques). La transcripci贸 茅s encara m茅s important en el cas dels manuscrits hist貌rics, ja que la majoria d'aquests documents s贸n 煤nics i la preservaci贸 del seu contingut 茅s crucial per raons culturals i hist貌riques. La transcripci贸 de manuscrits hist貌rics sol ser realitzada per pale貌grafs, els quals s贸n persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els pale貌grafs en la seua tasca, la qual proporciona un esborrany de la transcripci贸 que els pale貌grafs poden esmenar amb m猫todes m茅s o menys sofisticats. Aquest esborrany de transcripci贸 茅s 煤til quan presenta una taxa d'error prou redu茂da perqu猫 el proc茅s de correcci贸 siga m茅s c貌mode que una completa transcripci贸 des de zero. Per tant, l'obtenci贸 d'un esborrany de transcripci贸 amb un baixa taxa d'error 茅s crucial perqu猫 aquesta tecnologia del PLN siga incorporada en el proc茅s de transcripci贸. El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripci贸 ofert per un sistema RES, amb l'objectiu de reduir l'esfor莽 realitzat pels pale貌grafs per obtenir la transcripci贸 de manuscrits hist貌rics digitalitzats. Aquest problema s'enfronta a partir de tres escenaris diferents, per貌 complementaris: 路 Multimodalitat: L'煤s de sistemes RES permet als pale貌grafs accelerar el proc茅s de transcripci贸 manual, ja que s贸n capa莽os de corregir un esborrany de la transcripci贸. Una altra alternativa 茅s obtenir l'esborrany de la transcripci贸 dictant el contingut a un sistema de Reconeixement Autom脿tic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinaci贸 multimodal 茅s possible i es pot realitzar un proc茅s iteratiu per refinar la hip貌tesi final. 路 Interactivitat: L'煤s de tecnologies assistencials en el proc茅s de transcripci贸 permet reduir el temps i l'esfor莽 hum脿 requerits per obtenir la transcripci贸 real, gr脿cies a la cooperaci贸 entre el sistema assistencial i el pale貌graf per obtenir la transcripci贸 perfecta. La realimentaci贸 multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informaci贸 addicionals amb senyals que representen la mateixa seq眉encia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen nom茅s una paraula o car脿cter a corregir (per exemple, una paraula manuscrita mitjan莽ant una pantalla t脿ctil). 路 Crowdsourcing: La col路laboraci贸 distribu茂da i oberta sorgeix com una poderosa eina per a la transcripci贸 massiva a un cost relativament baix, ja que l'esfor莽 de supervisi贸 dels pale貌grafs pot ser redu茂t dr脿sticament. La combinaci贸 multimodal permet utilitzar el dictat del contingut de l铆nies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col路laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu m貌bil en lloc d'utilitzar ordinadors d'escriptori o port脿tils, la qual cosa permet ampliar el nombrGranell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral no publicada]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/86137TESI

    Enforcing constraints for multi-lingual and cross-lingual speech-to-text systems

    Get PDF
    The recent development of neural network-based automatic speech recognition (ASR) systems has greatly reduced the state-of-the-art phone error rates in several languages. However, when an ASR system trained on one language tries to recognize speech from another language, such a system usually fails, even when the two languages come from the same language family. The above scenario poses a problem for low-resource languages. Such languages usually do not have enough paired data for training a moderately-sized ASR model and thus require either cross-lingual adaptation or zero-shot recognition. Due to the increasing interest in bringing ASR technology to low-resource languages, the cross-lingual adaptation of end-to-end speech recognition systems has recently received more attention. However, little analysis has been done to understand how the model learns a shared representation across languages and how language-dependent representations can be fine-tuned to improve the system鈥檚 performance. We compare a bi-lingual CTC model with language-specific tuning at earlier LSTM layers to one without such tuning. This is to understand if having language-independent pathways in the model helps with multi-lingual learning and why. We first train the network on Dutch and then transfer the system to English under the bi-lingual CTC loss. After that, the representations from the two networks are visualized. Results showed that the consonants of the two languages are learned very well under a shared mapping but that vowels could benefit significantly when further language-dependent transformations are applied before the last classification layer. These results can be used as a guide for designing multilingual and cross-lingual end-to-end systems in the future. However, creating specialized processing units in the neural network for each training language could yield increasingly large networks as the number of training languages increases. It is also unclear how to adapt such a system to zero-shot recognition. The remaining work adapts two existing constraints to the realm of multi-lingual and cross-lingual ASR. The first constraint is cycle-consistent training. This method defines a shared codebook of phonetic tokens for all training languages. Input speech first passes through the speech encoder of the ASR system and gets quantized into discrete representations from the codebook. The discrete sequence representation is then passed through an auxiliary speech decoder to reconstruct the input speech. The framework constrains the reconstructed speech to be close to the original input speech. The second constraint is regret minimization training. It separates an ASR encoder into two parts: a feature extractor and a predictor. Regret minimization defines an additional regret term for each training sample as the difference between the losses of an auxiliary language-specific predictor with the real language I.D. and a fake language I.D. This constraint enables the feature extractor to learn an invariant speech-to-phone mapping across all languages and could potentially improve the model's generalization ability to new languages

    Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models

    Full text link
    Tesis por compendio[ES] Durante la 煤ltima d茅cada, los medios de comunicaci贸n han experimentado una revoluci贸n, alej谩ndose de la televisi贸n convencional hacia las plataformas de contenido bajo demanda. Adem谩s, esta revoluci贸n no ha cambiado solamente la manera en la que nos entretenemos, si no tambi茅n la manera en la que aprendemos. En este sentido, las plataformas de contenido educativo bajo demanda tambi茅n han proliferado para proporcionar recursos educativos de diversos tipos. Estas nuevas v铆as de distribuci贸n de contenido han llegado con nuevos requisitos para mejorar la accesibilidad, en particular las relacionadas con las dificultades de audici贸n y las barreras ling眉铆sticas. Aqu铆 radica la oportunidad para el reconocimiento autom谩tico del habla (RAH) para cumplir estos requisitos, proporcionando subtitulado autom谩tico de alta calidad. Este subtitulado proporciona una base s贸lida para reducir esta brecha de accesibilidad, especialmente para contenido en directo o streaming. Estos sistemas de streaming deben trabajar bajo estrictas condiciones de tiempo real, proporcionando la subtitulaci贸n tan r谩pido como sea posible, trabajando con un contexto limitado. Sin embargo, esta limitaci贸n puede conllevar una degradaci贸n de la calidad cuando se compara con los sistemas para contenido en diferido u offline. Esta tesis propone un sistema de RAH en streaming con baja latencia, con una calidad similar a un sistema offline. Concretamente, este trabajo describe el camino seguido desde el sistema offline h铆brido inicial hasta el eficiente sistema final de reconocimiento en streaming. El primer paso es la adaptaci贸n del sistema para efectuar una sola iteraci贸n de reconocimiento haciendo uso de modelos de lenguaje estado del arte basados en redes neuronales. En los sistemas basados en m煤ltiples iteraciones estos modelos son relegados a una segunda (o posterior) iteraci贸n por su gran coste computacional. Tras adaptar el modelo de lenguaje, el modelo ac煤stico basado en redes neuronales tambi茅n tiene que adaptarse para trabajar con un contexto limitado. La integraci贸n y la adaptaci贸n de estos modelos es ampliamente descrita en esta tesis, evaluando el sistema RAH resultante, completamente adaptado para streaming, en conjuntos de datos acad茅micos extensamente utilizados y desafiantes tareas basadas en contenidos audiovisuales reales. Como resultado, el sistema proporciona bajas tasas de error con un reducido tiempo de respuesta, comparables al sistema offline.[CA] Durant l'煤ltima d猫cada, els mitjans de comunicaci贸 han experimentat una revoluci贸, allunyant-se de la televisi贸 convencional cap a les plataformes de contingut sota demanda. A m茅s a m茅s, aquesta revoluci贸 no ha canviat nom茅s la manera en la que ens entretenim, si no tamb茅 la manera en la que aprenem. En aquest sentit, les plataformes de contingut educatiu sota demanda tamb茅 han proliferat pera proporcionar recursos educatius de diversos tipus. Aquestes noves vies de distribuci贸 de contingut han arribat amb nous requisits per a millorar l'accessibilitat, en particular les relacionades amb les dificultats d'audici贸 i les barreres ling眉铆stiques. Aqu铆 radica l'oportunitat per al reconeixement autom脿tic de la parla (RAH) per a complir aquests requisits, proporcionant subtitulat autom脿tic d'alta qualitat. Aquest subtitulat proporciona una base s貌lida per a reduir aquesta bretxa d'accessibilitat, especialment per a contingut en directe o streaming. Aquests sistemes han de treballar sota estrictes condicions de temps real, proporcionant la subtitulaci贸 tan r脿pid com sigui possible, treballant en un context limitat. Aquesta limitaci贸, per貌, pot comportar una degradaci贸 de la qualitat quan es compara amb els sistemes per a contingut en diferit o offline. Aquesta tesi proposa un sistema de RAH en streaming amb baixa lat猫ncia, amb una qualitat similar a un sistema offline. Concretament, aquest treball descriu el cam铆 seguit des del sistema offline h铆brid inicial fins l'eficient sistema final de reconeixement en streaming. El primer pas 茅s l'adaptaci贸 del sistema per a efectuar una sola iteraci贸 de reconeixement fent servir els models de llenguatge de l'estat de l'art basat en xarxes neuronals. En els sistemes basats en m煤ltiples iteracions aquests models son relegades a una segona (o posterior) iteraci贸 pel seu gran cost computacional. Un cop el model de llenguatge s'ha adaptat, el model ac煤stic basat en xarxes neuronals tamb茅 s'ha d'adaptar per a treballar amb un context limitat. La integraci贸 i l'adaptaci贸 d'aquests models 茅s 脿mpliament descrita en aquesta tesi, avaluant el sistema RAH resultant, completament adaptat per streaming, en conjunts de dades acad猫miques 脿mpliament utilitzades i desafiants tasques basades en continguts audiovisuals reals. Com a resultat, el sistema proporciona baixes taxes d'error amb un redu茂t temps de resposta, comparables al sistema offline.[EN] Over the last decade, the media have experienced a revolution, turning away from the conventional TV in favor of on-demand platforms. In addition, this media revolution not only changed the way entertainment is conceived but also how learning is conducted. Indeed, on-demand educational platforms have also proliferated and are now providing educational resources on diverse topics. These new ways to distribute content have come along with requirements to improve accessibility, particularly related to hearing difficulties and language barriers. Here is the opportunity for automatic speech recognition (ASR) to comply with these requirements by providing high-quality automatic captioning. Automatic captioning provides a sound basis for diminishing the accessibility gap, especially for live or streaming content. To this end, streaming ASR must work under strict real-time conditions, providing captions as fast as possible, and working with limited context. However, this limited context usually leads to a quality degradation as compared to the pre-recorded or offline content. This thesis is aimed at developing low-latency streaming ASR with a quality similar to offline ASR. More precisely, it describes the path followed from an initial hybrid offline system to an efficient streaming-adapted system. The first step is to perform a single recognition pass using a state-of-the-art neural network-based language model. In conventional multi-pass systems, this model is often deferred to the second or later pass due to its computational complexity. As with the language model, the neural-based acoustic model is also properly adapted to work with limited context. The adaptation and integration of these models is thoroughly described and assessed using fully-fledged streaming systems on well-known academic and challenging real-world benchmarks. In brief, it is shown that the proposed adaptation of the language and acoustic models allows the streaming-adapted system to reach the accuracy of the initial offline system with low latency.Jorge Cano, J. (2022). Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models [Tesis doctoral]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/191001Compendi

    Proceedings of the ACM SIGIR Workshop ''Searching Spontaneous Conversational Speech''

    Get PDF

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

    Get PDF
    This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems

    Robust learning of acoustic representations from diverse speech data

    Get PDF
    Automatic speech recognition is increasingly applied to new domains. A key challenge is to robustly learn, update and maintain representations to cope with transient acoustic conditions. A typical example is broadcast media, for which speakers and environments may change rapidly, and available supervision may be poor. The concern of this thesis is to build and investigate methods for acoustic modelling that are robust to the characteristics and transient conditions as embodied by such media. The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio with approximate labels, but training methods can be sensitive to label errors, and their use is therefore not trivial. State-of-the-art semi-supervised training makes effective use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid overfitting to poor supervision, but does not make use of the transcriptions. Existing approaches that do aim to make use of the transcriptions typically employ an algorithm to filter or combine the transcriptions with the recognition output from a seed model, but the final result does not encode uncertainty. We propose a method to combine the lattice output from a biased recognition pass with the transcripts, crucially preserving uncertainty in the lattice where appropriate. This substantially reduces the word error rate on a broadcast task. The second contribution is a method to factorise representations for speakers and environments so that they may be combined in novel combinations. In realistic scenarios, the speaker or environment transform at test time might be unknown, or there may be insufficient data to learn a joint transform. We show that in such cases, factorised, or independent, representations are required to avoid deteriorating performance. Using i-vectors, we factorise speaker or environment information using multi-condition training with neural networks. Specifically, we extract bottleneck features from networks trained to classify either speakers or environments. The resulting factorised representations prove beneficial when one factor is missing at test time, or when all factors are seen, but not in the desired combination. The third contribution is an investigation of model adaptation in a longitudinal setting. In this scenario, we repeatedly adapt a model to new data, with the constraint that previous data becomes unavailable. We first demonstrate the effect of such a constraint, and show that using a cyclical learning rate may help. We then observe that these successive models lend themselves well to ensembling. Finally, we show that the impact of this constraint in an active learning setting may be detrimental to performance, and suggest to combine active learning with semi-supervised training to avoid biasing the model. The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature extractor, known as SincNet. In contrast to traditional techniques that warp the filterbank frequencies in standard feature extraction, adapting SincNet parameters is more flexible and more readily optimised, whilst maintaining interpretability. On a task adapting from adult to child speech, we show that this layer is well suited for adaptation and is very effective with respect to the small number of adapted parameters

    Learning cognitive maps: Finding useful structure in an uncertain world

    Get PDF
    In this chapter we will describe the central mechanisms that influence how people learn about large-scale space. We will focus particularly on how these mechanisms enable people to effectively cope with both the uncertainty inherent in a constantly changing world and also with the high information content of natural environments. The major lessons are that humans get by with a less is more approach to building structure, and that they are able to quickly adapt to environmental changes thanks to a range of general purpose mechanisms. By looking at abstract principles, instead of concrete implementation details, it is shown that the study of human learning can provide valuable lessons for robotics. Finally, these issues are discussed in the context of an implementation on a mobile robot. 漏 2007 Springer-Verlag Berlin Heidelberg

    Towards Automatic Speech-Language Assessment for Aphasia Rehabilitation

    Full text link
    Speech-based technology has the potential to reinforce traditional aphasia therapy through the development of automatic speech-language assessment systems. Such systems can provide clinicians with supplementary information to assist with progress monitoring and treatment planning, and can provide support for on-demand auxiliary treatment. However, current technology cannot support this type of application due to the difficulties associated with aphasic speech processing. The focus of this dissertation is on the development of computational methods that can accurately assess aphasic speech across a range of clinically-relevant dimensions. The first part of the dissertation focuses on novel techniques for assessing aphasic speech intelligibility in constrained contexts. The second part investigates acoustic modeling methods that lead to significant improvement in aphasic speech recognition and allow the system to work with unconstrained speech samples. The final part demonstrates the efficacy of speech recognition-based analysis in automatic paraphasia detection, extraction of clinically-motivated quantitative measures, and estimation of aphasia severity. The methods and results presented in this work will enable robust technologies for accurately recognizing and assessing aphasic speech, and will provide insights into the link between computational methods and clinical understanding of aphasia.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140840/1/ducle_1.pd
    corecore