32 research outputs found

    Fast N-Gram Language Model Look-Ahead for Decoders With Static Pronunciation Prefix Trees

    Get PDF
    Decoders that make use of token-passing restrict their search space by various types of token pruning. With use of the Language Model Look-Ahead (LMLA) technique it is possible to increase the number of tokens that can be pruned without loss of decoding precision. Unfortunately, for token passing decoders that use single static pronunciation prefix trees, full n-gram LMLA increases the needed number of language model probability calculations considerably. In this paper a method for applying full n-gram LMLA in a decoder with a single static pronunciation tree is introduced. The experiments show that this method improves the speed of the decoder without an increase of search errors.\u

    Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled

    Get PDF
    In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions of the audio to be processed. Any mismatch will decrease the accuracy of the recognition. If it is unpredictable what kind of data can be expected, or in other words if the conditions of the audio to be processed are unknown, it is impossible to tune the models. If the material consists of `surprise data' the output of the system is likely to be poor. In this thesis methods are presented for which no external training data is required for training models. These novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT. This system consists of three subsystems: speech/non-speech classification, speaker diarization and automatic speech recognition. The speech/non-speech classification subsystem separates speech from silence and unknown audible non-speech events. The type of non-speech present in audio recordings can vary from paper shuffling in recordings of meetings to sound effects in television shows. Because it is unknown what type of non-speech needs to be detected, it is not possible to train high quality statistical models for each type of non-speech sound. The speech/non-speech classification subsystem, also called the speech activity detection subsystem, does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech activity component. Next, the models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. This approach makes it possible to classify speech and non-speech with high accuracy, without the need to know what kinds of sound are present in the audio recording. Once all non-speech is filtered out of the audio, it is the task of the speaker diarization subsystem to determine how many speakers occur in the recording and exactly when they are speaking. The speaker diarization subsystem applies agglomerative clustering to create clusters of speech fragments for each speaker in the recording. First, statistical speaker models are created on random chunks of the recording and by iteratively realigning the data, retraining the models and merging models that represent the same speaker, accurate speaker models are obtained for speaker clustering. This method does not require any statistical models developed on a training set, which makes the diarization subsystem insensitive for variation in audio conditions. Unfortunately, because the algorithm is of complexity O(n3)O(n^3), this clustering method is slow for long recordings. Two variations of the subsystem are presented that reduce the needed computational effort, so that the subsystem is applicable for long audio recordings as well. The automatic speech recognition subsystem developed for this research, is based on Viterbi decoding on a fixed pronunciation prefix tree. Using the fixed tree, a flexible modular decoder could be developed, but it was not straightforward to apply full language model look-ahead efficiently. In this thesis a novel method is discussed that makes it possible to apply language model look-ahead effectively on the fixed tree. Also, to obtain higher speech recognition accuracy on audio with unknown acoustical conditions, a selection from the numerous known methods that exist for robust automatic speech recognition is applied and evaluated in this thesis. The three individual subsystems as well as the entire system have been successfully evaluated on three international benchmarks. The diarization subsystem has been evaluated at the NIST RT06s benchmark and the speech activity detection subsystem has been tested at RT07s. The entire system was evaluated at N-Best, the first automatic speech recognition benchmark for Dutch

    Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models

    Full text link
    Tesis por compendio[ES] Durante la última década, los medios de comunicación han experimentado una revolución, alejándose de la televisión convencional hacia las plataformas de contenido bajo demanda. Además, esta revolución no ha cambiado solamente la manera en la que nos entretenemos, si no también la manera en la que aprendemos. En este sentido, las plataformas de contenido educativo bajo demanda también han proliferado para proporcionar recursos educativos de diversos tipos. Estas nuevas vías de distribución de contenido han llegado con nuevos requisitos para mejorar la accesibilidad, en particular las relacionadas con las dificultades de audición y las barreras lingüísticas. Aquí radica la oportunidad para el reconocimiento automático del habla (RAH) para cumplir estos requisitos, proporcionando subtitulado automático de alta calidad. Este subtitulado proporciona una base sólida para reducir esta brecha de accesibilidad, especialmente para contenido en directo o streaming. Estos sistemas de streaming deben trabajar bajo estrictas condiciones de tiempo real, proporcionando la subtitulación tan rápido como sea posible, trabajando con un contexto limitado. Sin embargo, esta limitación puede conllevar una degradación de la calidad cuando se compara con los sistemas para contenido en diferido u offline. Esta tesis propone un sistema de RAH en streaming con baja latencia, con una calidad similar a un sistema offline. Concretamente, este trabajo describe el camino seguido desde el sistema offline híbrido inicial hasta el eficiente sistema final de reconocimiento en streaming. El primer paso es la adaptación del sistema para efectuar una sola iteración de reconocimiento haciendo uso de modelos de lenguaje estado del arte basados en redes neuronales. En los sistemas basados en múltiples iteraciones estos modelos son relegados a una segunda (o posterior) iteración por su gran coste computacional. Tras adaptar el modelo de lenguaje, el modelo acústico basado en redes neuronales también tiene que adaptarse para trabajar con un contexto limitado. La integración y la adaptación de estos modelos es ampliamente descrita en esta tesis, evaluando el sistema RAH resultante, completamente adaptado para streaming, en conjuntos de datos académicos extensamente utilizados y desafiantes tareas basadas en contenidos audiovisuales reales. Como resultado, el sistema proporciona bajas tasas de error con un reducido tiempo de respuesta, comparables al sistema offline.[CA] Durant l'última dècada, els mitjans de comunicació han experimentat una revolució, allunyant-se de la televisió convencional cap a les plataformes de contingut sota demanda. A més a més, aquesta revolució no ha canviat només la manera en la que ens entretenim, si no també la manera en la que aprenem. En aquest sentit, les plataformes de contingut educatiu sota demanda també han proliferat pera proporcionar recursos educatius de diversos tipus. Aquestes noves vies de distribució de contingut han arribat amb nous requisits per a millorar l'accessibilitat, en particular les relacionades amb les dificultats d'audició i les barreres lingüístiques. Aquí radica l'oportunitat per al reconeixement automàtic de la parla (RAH) per a complir aquests requisits, proporcionant subtitulat automàtic d'alta qualitat. Aquest subtitulat proporciona una base sòlida per a reduir aquesta bretxa d'accessibilitat, especialment per a contingut en directe o streaming. Aquests sistemes han de treballar sota estrictes condicions de temps real, proporcionant la subtitulació tan ràpid com sigui possible, treballant en un context limitat. Aquesta limitació, però, pot comportar una degradació de la qualitat quan es compara amb els sistemes per a contingut en diferit o offline. Aquesta tesi proposa un sistema de RAH en streaming amb baixa latència, amb una qualitat similar a un sistema offline. Concretament, aquest treball descriu el camí seguit des del sistema offline híbrid inicial fins l'eficient sistema final de reconeixement en streaming. El primer pas és l'adaptació del sistema per a efectuar una sola iteració de reconeixement fent servir els models de llenguatge de l'estat de l'art basat en xarxes neuronals. En els sistemes basats en múltiples iteracions aquests models son relegades a una segona (o posterior) iteració pel seu gran cost computacional. Un cop el model de llenguatge s'ha adaptat, el model acústic basat en xarxes neuronals també s'ha d'adaptar per a treballar amb un context limitat. La integració i l'adaptació d'aquests models és àmpliament descrita en aquesta tesi, avaluant el sistema RAH resultant, completament adaptat per streaming, en conjunts de dades acadèmiques àmpliament utilitzades i desafiants tasques basades en continguts audiovisuals reals. Com a resultat, el sistema proporciona baixes taxes d'error amb un reduït temps de resposta, comparables al sistema offline.[EN] Over the last decade, the media have experienced a revolution, turning away from the conventional TV in favor of on-demand platforms. In addition, this media revolution not only changed the way entertainment is conceived but also how learning is conducted. Indeed, on-demand educational platforms have also proliferated and are now providing educational resources on diverse topics. These new ways to distribute content have come along with requirements to improve accessibility, particularly related to hearing difficulties and language barriers. Here is the opportunity for automatic speech recognition (ASR) to comply with these requirements by providing high-quality automatic captioning. Automatic captioning provides a sound basis for diminishing the accessibility gap, especially for live or streaming content. To this end, streaming ASR must work under strict real-time conditions, providing captions as fast as possible, and working with limited context. However, this limited context usually leads to a quality degradation as compared to the pre-recorded or offline content. This thesis is aimed at developing low-latency streaming ASR with a quality similar to offline ASR. More precisely, it describes the path followed from an initial hybrid offline system to an efficient streaming-adapted system. The first step is to perform a single recognition pass using a state-of-the-art neural network-based language model. In conventional multi-pass systems, this model is often deferred to the second or later pass due to its computational complexity. As with the language model, the neural-based acoustic model is also properly adapted to work with limited context. The adaptation and integration of these models is thoroughly described and assessed using fully-fledged streaming systems on well-known academic and challenging real-world benchmarks. In brief, it is shown that the proposed adaptation of the language and acoustic models allows the streaming-adapted system to reach the accuracy of the initial offline system with low latency.Jorge Cano, J. (2022). Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/191001Compendi

    Error Correction based on Error Signatures applied to automatic speech recognition

    Get PDF

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    Negotiation of form: Analysis of Feedback and student response in two different contexts.

    Get PDF
    In the first part of the dissertation, which corresponds to the theoretical one, a series of cognitive theories are analysed; we also deal with the study of learning strategies; we analyse several studies on the interaction in the field of discourse analysis; we also describe different studies related to discourse and the acquisition of a new language and finally classroom discourse is analysed according to different aspects. Once these aspects are dealt with, we offer a bibliographic review on the topic studied in relation to the aspect of feedback. This review begins with works written in the 1960s and ends nowadays. We describe the works attending to aspects such as main objective, method used and results offered. Finally we deal with characteristics such as the importance of age in the learning process, since we are researching two groups of different ages, and we also describe different learning styles and affective factors. The research part is divided into hypotheses, subjects participating in the two contexts, method, results, discussion and conclusions. Some of the hypotheses we offer are the following ones: 1. In the native teacher class, the corrective exchanges will contain more moves. 2. There will be more confirmation in the class with a lower level of English. 3. There will be more correction in the group with a lower level of English. 4. The more experienced teacher will encourage more self-correction. With respect to the description of subjects in both contexts, we explain that fifteen lessons were audio-taped in two different schools, in two different levels and with teachers who had different characteristics. In this part we also explain the social context which is necessary to understand the characteristics of the two classes. In relation to the results obtained, we observed the following: We have observed that the E.S.O. class is much more interactive than the Bachillerato class, which is shown in the fact that exchanges are much longer in the former, even though the teacher in that class was not native. We consider that recast, which is the corrective technique used mostly for phonological errors, is a very adequate technique and, in fact, it is the technique which students accepted most. We could also observe that it was not always the student the one who chose to reject the acceptance of the correction: There were other options as the case of a different student or the teacher not accepting the correction. In this sense, we would advise teachers to reflect upon their attitude in class and they should consider giving students the opportunity to repeat and accept their corrections. As a conclusion, we must assert that our study of the incidence of error and correction leads us to adopt a positive attitude towards students mistakes. According to the communicative teaching methods used today, errors are not considered as lack of learning, they are rather the proof that learning is taking place. This is a fact broadly accepted today and that can be applied to the learning of both a first language and a second language. We learn through a process of trial and error, constructing and testing hypotheses, and continually revising them to the light of direct correction and the new data we receive from it. We learn a language through using it, rather than learning it first and then use it. Errors must be considered as a visible proof of the invisible process of learning. __________________________________________________________________________________________________ RESUMEN En la primera parte de la tesis, que corresponde a la parte teórica de la misma, se analizan las diferentes teorías cognoscitivas, los distintos enfoques en el estudio de las estrategias de aprendizaje, se lleva a cabo un análisis de los estudios sobre interacción en el campo del análisis del discurso, se relacionan los estudios sobre el discurso y la investigación en la adquisición de una lengua y finalmente se analiza el discurso del aula en varios aspectos. Una vez analizados estos aspectos se realiza una revisión bibliográfica del tema que nos interesa en relación con la corrección. Se describen los estudios realizados dando cuenta de los objetivos de las investigaciones, el método empleado y los resultados obtenidos. Finalmente analizamos aspectos como la diferencia de edad en el aprendizaje, así como los distintos estilos de aprendizaje y aspectos afectivos. En la parte de investigación distinguimos hipótesis del trabajo, sujetos actuando en los dos contextos y método de trabajo, resultados y discusión de los mismos, y conclusiones. Explicamos que 15 lecciones fueron grabadas en dos colegios diferentes, en dos niveles también distintos y con dos profesoras con diferentes características. En cuanto a los resultados obtenidos, se observó en general que la clase de E.SO. es mucho más viva e interactiva que la clase de Bachillerato. También se concluye a partir de dichos resultados que la reformulación empleada para corregir los errores en fonología nos parece una técnica muy adecuada y en efecto es aquella que más aceptada es por los estudiantes. Nuestra investigación sobre la incidencia del error y la corrección nos lleva a adoptar una actitud positiva en cuanto a los errores de los estudiantes. De acuerdo con los actuales métodos de enseñanza de tipo comunicativo, el error no debe ser considerado como una falta de aprendizaje sino más bien la prueba de que el aprendizaje está ocurriendo
    corecore