37 research outputs found

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

    Fast Speech in Unit Selection Speech Synthesis

    Get PDF
    Moers-Prinz D. Fast Speech in Unit Selection Speech Synthesis. Bielefeld: Universität Bielefeld; 2020.Speech synthesis is part of the everyday life of many people with severe visual disabilities. For those who are reliant on assistive speech technology the possibility to choose a fast speaking rate is reported to be essential. But also expressive speech synthesis and other spoken language interfaces may require an integration of fast speech. Architectures like formant or diphone synthesis are able to produce synthetic speech at fast speech rates, but the generated speech does not sound very natural. Unit selection synthesis systems, however, are capable of delivering more natural output. Nevertheless, fast speech has not been adequately implemented into such systems to date. Thus, the goal of the work presented here was to determine an optimal strategy for modeling fast speech in unit selection speech synthesis to provide potential users with a more natural sounding alternative for fast speech output

    Bayesian Approaches to Uncertainty in Speech Processing

    Get PDF

    Synthesis of listener vocalizations : towards interactive speech synthesis

    Get PDF
    Spoken and multi-modal dialogue systems start to use listener vocalizations, such as uh-huh and mm-hm, for natural interaction. Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: Where to synthesize a listener vocalization? What meaning should be conveyed through the synthesized vocalization? And, how to realize an appropriate listener vocalization with the intended meaning? This thesis addresses the latter question. The investigation starts with proposing a three-stage approach: (i) data collection, (ii) annotation, and (iii) realization. The first stage presents a method to collect natural listener vocalizations from German and British English professional actors in a recording studio. In the second stage, we explore a methodology for annotating listener vocalizations -- meaning and behavior (form) annotation. The third stage proposes a realization strategy that uses unit selection and signal modification techniques to generate appropriate listener vocalizations upon user requests. Finally, we evaluate naturalness and appropriateness of synthesized vocalizations using perception studies. The work is implemented in the open source MARY text-to-speech framework, and it is integrated into the SEMAINE project\u27s Sensitive Artificial Listener (SAL) demonstrator.Dialogsysteme nutzen zunehmend Hörer-Vokalisierungen, wie z.B. a-ha oder mm-hm, für natürliche Interaktion. Die Generierung von Hörer-Vokalisierungen ist eines der zentralen Ziele emotional gefärbter, konversationeller Sprachsynthese. Ein Erfolg in diesem Unterfangen hängt von den Antworten auf drei Fragen ab: Wo bzw. wann sollten Vokalisierungen synthetisiert werden? Welche Bedeutung sollte in den synthetisierten Vokalisierungen vermittelt werden? Und wie können angemessene Hörer-Vokalisierungen mit der intendierten Bedeutung realisiert werden? Diese Arbeit widmet sich der letztgenannten Frage. Die Untersuchung erfolgt in drei Schritten: (i) Korpuserstellung; (ii) Annotation; und (iii) Realisierung. Der erste Schritt präsentiert eine Methode zur Sammlung natürlicher Hörer-Vokalisierungen von deutschen und britischen Profi-Schauspielern in einem Tonstudio. Im zweiten Schritt wird eine Methodologie zur Annotation von Hörer-Vokalisierungen erarbeitet, die sowohl Bedeutung als auch Verhalten (Form) umfasst. Der dritte Schritt schlägt ein Realisierungsverfahren vor, die Unit-Selection-Synthese mit Signalmodifikationstechniken kombiniert, um aus Nutzeranfragen angemessene Hörer-Vokalisierungen zu generieren. Schließlich werden Natürlichkeit und Angemessenheit synthetisierter Vokalisierungen mit Hilfe von Hörtests evaluiert. Die Methode wurde im Open-Source-Sprachsynthesesystem MARY implementiert und in den Sensitive Artificial Listener-Demonstrator im Projekt SEMAINE integriert

    Interim research assessment 2003-2005 - Computer Science

    Get PDF
    This report primarily serves as a source of information for the 2007 Interim Research Assessment Committee for Computer Science at the three technical universities in the Netherlands. The report also provides information for others interested in our research activities

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    Utilización de la fase armónica en la detección de voz sintética.

    Get PDF
    156 p.Los sistemas de verificación de locutor (SV) tienen que enfrentarse a la posibilidad de ser atacados mediante técnicas de spoofing. Hoy en día, las tecnologías de conversión de voces y de síntesis de voz adaptada a locutor han avanzado lo suficiente para poder crear voces que sean capaces de engañar a un sistema SV. En esta tesis se propone un módulo de detección de habla sintética (SSD) que puede utilizarse como complemento a un sistema SV, pero que es capaz de funcionar de manera independiente. Lo conforma un clasificador basado en GMM, dotado de modelos de habla humana y sintética. Cada entrada se compara con ambos, y, si la diferencia de verosimilitudes supera un determinado umbral, se acepta como humana, rechazándose en caso contrario. El sistema desarrollado es independiente de locutor. Para la generación de modelos se utilizarán parámetros RPS. Se propone una técnica para reducir la complejidad del proceso de entrenamiento, evitando generar TTSs adaptados o un conversor de voz para cada locutor. Para ello, como la mayoría de los sistemas de adaptación o síntesis modernos hacen uso de vocoders, se propone transcodificar las señales humanas mediante vocoders para obtener de esta forma sus versiones sintéticas, con las que se generarán los modelos sintéticos del clasificador. Se demostrará que se pueden detectar señales sintéticas detectando que se crearon mediante un vocoder. El rendimiento del sistema prueba en diferentes condiciones: con las propias señales transcodificadas o con ataques TTS. Por último, se plantean estrategias para el entrenamiento de modelos para sistemas SSD

    Speech wave-form driven motion synthesis for embodied agents

    Get PDF
    The main objective of this thesis is to synthesise motion from speech, especially in conversation. Based on previous research into different acoustic features or the combination of them were investigated, no one has investigated in estimating head motion from waveform directly, which is the stem of the speech. Thus, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, there are a few problems if we would like to apply speech waveform, 1) high dimensional, where the dimension of the waveform data is much higher than those common acoustic features and thus making the training of the model more difficult, and 2) irrelevant information, which refers to the full information in the original waveform implicating potential cumbrance for neural network training. To resolve these problems, we applied a deep canonical correlated constrainted auto-encoder (DCCCAE) to compress the waveform into low dimensional and highly correlated embedded features with head motion. The estimated head motion was evaluated both objectively and subjectively. In objective evaluation, the result confirmed that DCCCAE enables the creation of a more correlated feature with the head motion than standard AE and other popular spectral features such as MFCC and FBank, and is capable of being used in achieving state-of-the-art results for predicting natural head motion with the advantage of the DCCCAE. Besides investigating the representation learning of the feature, we also explored the LSTM-based regression model for the proposed feature. The LSTM-based models were able to boost the overall performance in the objective evaluation and adapt better to the proposed feature than MFCC. MUSHRA-liked subjective evaluation results suggest that the animations generated by models with the proposed feature were chosen to be better than the other models by the participants of MUSHRA-liked test. A/B test further that the LSTM-based regression model adapts better to the proposed feature. Furthermore, we extended the architecture to estimate the upper body motion as well. We submitted our result to GENEA2020 and our model achieved a higher score than BA in both aspects (human-likeness and appropriateness) according to the participant’s preference, suggesting that the highly correlated feature pair and the sequential estimation helped in improving the model generalisation

    Speech assessment and characterization for law enforcement applications

    No full text
    Speech signals acquired, transmitted or stored in non-ideal conditions are often degraded by one or more effects including, for example, additive noise. These degradations alter the signal properties in a manner that deteriorates the intelligibility or quality of the speech signal. In the law enforcement context such degradations are commonplace due to the limitations in the audio collection methodology, which is often required to be covert. In severe degradation conditions, the acquired signal may become unintelligible, losing its value in an investigation and in less severe conditions, a loss in signal quality may be encountered, which can lead to higher transcription time and cost. This thesis proposes a non-intrusive speech assessment framework from which algorithms for speech quality and intelligibility assessment are derived, to guide the collection and transcription of law enforcement audio. These methods are trained on a large database labelled using intrusive techniques (whose performance is verified with subjective scores) and shown to perform favorably when compared with existing non-intrusive techniques. Additionally, a non-intrusive CODEC identification and verification algorithm is developed which can identify a CODEC with an accuracy of 96.8 % and detect the presence of a CODEC with an accuracy higher than 97 % in the presence of additive noise. Finally, the speech description taxonomy framework is developed, with the aim of characterizing various aspects of a degraded speech signal, including the mechanism that results in a signal with particular characteristics, the vocabulary that can be used to describe those degradations and the measurable signal properties that can characterize the degradations. The taxonomy is implemented as a relational database that facilitates the modeling of the relationships between various attributes of a signal and promises to be a useful tool for training and guiding audio analysts