583 research outputs found

    King's speech: pronounce a foreign language with style

    Get PDF
    Computer assisted pronunciation training requires strategies that capture the attention of the learners and guide them along the learning pathway. In this paper, we introduce an immersive storytelling scenario for creating appropriate learning conditions. The proposed learning interaction is orchestrated by a spoken karaoke. We motivate the concept of the spoken karaoke and describe our design. Driven by the requirements of the proposed scenario, we suggest a modular architecture designed for immersive learning applications. We present our prototype system and our approach for the processing of spoken and visual interaction modalities. Finally, we discuss how technological challenges can be addressed in order to enable the learner's self-evaluation

    Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process.

    Get PDF
    An End-Of-Turn Detection Module (EOTD-M) is an essential component of au- tomatic Spoken Dialogue Systems. The capability of correctly detecting whether a user’s utterance has ended or not improves the accuracy in interpreting the meaning of the message and decreases the latency in the answer. Usually, in di- alogue systems, an EOTD-M is coupled with an Automatic Speech Recognition Module (ASR-M) to transmit complete utterances to the Natural Language Un- derstanding unit. Mistakes in the ASR-M transcription can have a strong effect on the performance of the EOTD-M. The actual extent of this effect depends on the particular combination of ASR-M transcription errors and the sentence featurization techniques implemented as part of the EOTD-M. In this paper we investigate this important relationship for an EOTD-M based on semantic information and particular characteristics of the speakers (speech profiles). We introduce an Automatic Speech Recognition Simulator (ASR-SIM) that mod- els different types of semantic mistakes in the ASR-M transcription as well as different speech profiles. We use the simulator to evaluate the sensitivity to ASR-M mistakes of a Long Short-Term Memory network classifier trained in EOTD with different featurization techniques. Our experiments reveal the dif- ferent ways in which the performance of the model is influenced by the ASR-M errors. We corroborate that not only is the ASR-SIM useful to estimate the performance of an EOTD-M in customized noisy scenarios, but it can also be used to generate training datasets with the expected error rates of real working conditions, which leads to better performance.EMPATHIC IT1244-19 TIN2016-78365-R PID2019-104966GB-I00

    Audio-Visual Speech Processing for Multimedia Localisation

    Get PDF
    For many years, film and television have dominated the entertainment industry. Recently, with the introduction of a range of digital formats and mobile devices, multimedia’s ubiquity as the dominant form of entertainment has increased dramatically. This, in turn, has increased demand on the entertainment industry, with production companies looking to increase their revenue by providing entertainment media to a growing international market. This brings with it challenges in the form of multimedia localisation - the process of preparing content for international distribution. The industry is now looking to modernise production processes - moving what were once wholly manual practices to semi-automated workflows. A key aspect of the localisation process is the alignment of content, such as subtitles or audio, when adapting content from one region to another. One method of automating this is through using audio content as a guide, providing a solution via audio-to-text alignment. While many approaches for audio-to-text alignment currently exist, these all require language models - meaning that dozens of languages models would be required for these approaches to be reliably implemented in large production companies. To address this, this thesis explores the development of audio-to-text alignment procedures which do not rely on language models, instead providing a language independent method for aligning multimedia content. To achieve this, the project explores both audio and visual speech processing, with a focus on voice activity detection, as a means for segmenting and aligning audio and text data. The thesis first presents a novel method for detecting speech activity in entertainment media. This method is compared with current state of the art, and demonstrates significant improvement over baseline methods. Secondly, the thesis explores a novel set of features for detecting voice activity in visual speech data. Here, we show that the combination of landmark and appearance-based features outperforms recent methods for visual voice activity detection, and specifically that the incorporation of landmark features is particularly crucial when presented with challenging natural speech data. Lastly, a speech activity-based alignment framework is presented which demonstrates encouraging results. Here, we show that Dynamic Time Warping (DTW) can be used for segment matching and alignment of audio and subtitle data, and we also present a novel method for aligning scene-level content which outperforms DTW for sequence alignment of finer-level data. To conclude, we demonstrate that combining global and local alignment approaches achieves strong alignment estimates, but that the resulting output is not sufficient for wholly automated subtitle alignment. We therefore propose that this be used as a platform for the development of lexical-discovery based alignment techniques, as the general alignment provided by our system would improve symbolic sequence discovery for sparse dictionary-based systems

    Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled

    Get PDF
    In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions of the audio to be processed. Any mismatch will decrease the accuracy of the recognition. If it is unpredictable what kind of data can be expected, or in other words if the conditions of the audio to be processed are unknown, it is impossible to tune the models. If the material consists of `surprise data' the output of the system is likely to be poor. In this thesis methods are presented for which no external training data is required for training models. These novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT. This system consists of three subsystems: speech/non-speech classification, speaker diarization and automatic speech recognition. The speech/non-speech classification subsystem separates speech from silence and unknown audible non-speech events. The type of non-speech present in audio recordings can vary from paper shuffling in recordings of meetings to sound effects in television shows. Because it is unknown what type of non-speech needs to be detected, it is not possible to train high quality statistical models for each type of non-speech sound. The speech/non-speech classification subsystem, also called the speech activity detection subsystem, does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech activity component. Next, the models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. This approach makes it possible to classify speech and non-speech with high accuracy, without the need to know what kinds of sound are present in the audio recording. Once all non-speech is filtered out of the audio, it is the task of the speaker diarization subsystem to determine how many speakers occur in the recording and exactly when they are speaking. The speaker diarization subsystem applies agglomerative clustering to create clusters of speech fragments for each speaker in the recording. First, statistical speaker models are created on random chunks of the recording and by iteratively realigning the data, retraining the models and merging models that represent the same speaker, accurate speaker models are obtained for speaker clustering. This method does not require any statistical models developed on a training set, which makes the diarization subsystem insensitive for variation in audio conditions. Unfortunately, because the algorithm is of complexity O(n3)O(n^3), this clustering method is slow for long recordings. Two variations of the subsystem are presented that reduce the needed computational effort, so that the subsystem is applicable for long audio recordings as well. The automatic speech recognition subsystem developed for this research, is based on Viterbi decoding on a fixed pronunciation prefix tree. Using the fixed tree, a flexible modular decoder could be developed, but it was not straightforward to apply full language model look-ahead efficiently. In this thesis a novel method is discussed that makes it possible to apply language model look-ahead effectively on the fixed tree. Also, to obtain higher speech recognition accuracy on audio with unknown acoustical conditions, a selection from the numerous known methods that exist for robust automatic speech recognition is applied and evaluated in this thesis. The three individual subsystems as well as the entire system have been successfully evaluated on three international benchmarks. The diarization subsystem has been evaluated at the NIST RT06s benchmark and the speech activity detection subsystem has been tested at RT07s. The entire system was evaluated at N-Best, the first automatic speech recognition benchmark for Dutch

    Recent Advances in Steganography

    Get PDF
    Steganography is the art and science of communicating which hides the existence of the communication. Steganographic technologies are an important part of the future of Internet security and privacy on open systems such as the Internet. This book's focus is on a relatively new field of study in Steganography and it takes a look at this technology by introducing the readers various concepts of Steganography and Steganalysis. The book has a brief history of steganography and it surveys steganalysis methods considering their modeling techniques. Some new steganography techniques for hiding secret data in images are presented. Furthermore, steganography in speeches is reviewed, and a new approach for hiding data in speeches is introduced

    Bayesian Approaches to Uncertainty in Speech Processing

    Get PDF

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
    corecore