130 research outputs found

    Nonparallel Emotional Speech Conversion

    Full text link
    We propose a nonparallel data-driven emotional speech conversion method. It enables the transfer of emotion-related characteristics of a speech signal while preserving the speaker's identity and linguistic content. Most existing approaches require parallel data and time alignment, which is not available in most real applications. We achieve nonparallel training based on an unsupervised style transfer technique, which learns a translation model between two distributions instead of a deterministic one-to-one mapping between paired examples. The conversion model consists of an encoder and a decoder for each emotion domain. We assume that the speech signal can be decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion. We tested our method on a nonparallel corpora with four emotions. Both subjective and objective evaluations show the effectiveness of our approach.Comment: Published in INTERSPEECH 2019, 5 pages, 6 figures. Simulation available at http://www.jian-gao.org/emoga

    Perceptual optimization of unit-selection text-to-speech synthesis systems by means of active interactive genetic algorithms

    Get PDF
    The tuning process of Unit Selection TTS (US-TTS) system is usually performed by an expert that typically conducts the task of weighting the cost function by hand. However, hand tuning is costly in terms of the required training time and inaccurate and ambiguous in terms of methodology. With the purpose of easing the task of properly tuning the weights of the cost function, this thesis make its contribution from a perceptual-based approach using of active interactive Genetic Algorithms (aiGAs). The thesis pursues four major guidelines: i) accuracy when tuning the weights, ii) robustness of the obtained weights, iii)real world applicability of the methodology to any cost function design, and iv)finding consensus of the different users when tuning the weights. The experimentation is carried out through a small and medium sized corpus (1.9h) applied to different configurations (type of features) of the US-TTS cost function. The thesis concludes that aiGAs are highly competitive in comparison to other weight tuning techniques from the state-of-the-artPeer ReviewedPostprint (published version

    Experiments on Detection of Voiced Hesitations in Russian Spontaneous Speech

    Get PDF

    Experiments on Detection of Voiced Hesitations in Russian Spontaneous Speech

    Get PDF
    The development and popularity of voice-user interfaces made spontaneous speech processing an important research field. One of the main focus areas in this field is automatic speech recognition (ASR) that enables the recognition and translation of spoken language into text by computers. However, ASR systems often work less efficiently for spontaneous than for read speech, since the former differs from any other type of speech in many ways. And the presence of speech disfluencies is its prominent characteristic. These phenomena are an important feature in human-human communication and at the same time they are a challenging obstacle for the speech processing tasks. In this paper we address an issue of voiced hesitations (filled pauses and sound lengthenings) detection in Russian spontaneous speech by utilizing different machine learning techniques, from grid search and gradient descent in rule-based approaches to such data-driven ones as ELM and SVM based on the automatically extracted acoustic features. Experimental results on the mixed and quality diverse corpus of spontaneous Russian speech indicate the efficiency of the techniques for the task in question, with SVM outperforming other methods

    Speech Activity and Speaker Change Point Detection for Online Streams

    Get PDF
    Disertační práce je věnována dvěma si blízkým řečovým úlohám a následně jejich použití v online prostředí. Konkrétně se jedná o úlohy detekce řeči a detekce změny mluvčího. Ty jsou často nedílnou součástí systémů pro zpracování řeči (např. pro diarizaci mluvčích nebo rozpoznávání řeči), kde slouží pro předzpracování akustického signálu. Obě úlohy jsou v literatuře velmi aktivním tématem, ale většina existujících prací je směřována primárně na offline využití. Nicméně právě online nasazení je nezbytné pro některé řečové aplikace, které musí fungovat v reálném čase (např. monitorovací systémy).Úvodní část disertační práce je tvořena třemi kapitolami. V té první jsou vysvětleny základní pojmy a následně je nastíněno využití obou úloh. Druhá kapitola je věnována současnému poznání a je doplněna o přehled existujících nástrojů. Poslední kapitola se skládá z motivace a z praktického použití zmíněných úloh v monitorovacích systémech. V závěru úvodní části jsou stanoveny cíle práce.Následující dvě kapitoly jsou věnovány teoretickým základům obou úloh. Představují vybrané přístupy, které jsou buď relevantní pro disertační práci (porovnání výsledků), nebo jsou zaměřené na použití v online prostředí.V další kapitole je předložen finální přístup pro detekci řeči. Postupný návrh tohoto přístupu, společně s experimentálním vyhodnocením, je zde detailně rozebrán. Přístup dosahuje nejlepších výsledků na korpusu QUT-NOISE-TIMIT v podmínkách s nízkým a středním zašuměním. Přístup je také začleněn do monitorovacího systému, kde doplňuje svojí funkcionalitou rozpoznávač řeči.Následující kapitola detailně představuje finální přístup pro detekci změny mluvčího. Ten byl navržen v rámci několika po sobě jdoucích experimentů, které tato kapitola také přibližuje. Výsledky získané na databázi COST278 se blíží výsledkům, kterých dosáhl referenční offline systém, ale předložený přístup jich docílil v online módu a to s nízkou latencí.Výstupy disertační práce jsou shrnuty v závěrečné kapitole.The main focus of this thesis lies on two closely interrelated tasks, speech activity detection and speaker change point detection, and their applications in online processing. These tasks commonly play a crucial role of speech preprocessors utilized in speech-processing applications, such as automatic speech recognition or speaker diarization. While their use in offline systems is extensively covered in literature, the number of published works focusing on online use is limited.This is unfortunate, as many speech-processing applications (e.g., monitoring systems) are required to be run in real time.The thesis begins with a three-chapter opening part, where the first introductory chapter explains the basic concepts and outlines the practical use of both tasks. It is followed by a chapter, which reviews the current state of the art and lists the existing toolkits. That part is concluded by a chapter explaining the motivation behind this work and the practical use in monitoring systems; ultimately, this chapter sets the main goals of this thesis.The next two chapters cover the theoretical background of both tasks. They present selected approaches relevant to this work (e.g., used for result comparisons) or focused on online processing.The following chapter proposes the final speech activity detection approach for online use. Within this chapter, a detailed description of the development of this approach is available as well as its thorough experimental evaluation. This approach yields state-of-the-art results under low- and medium-noise conditions on the standardized QUT-NOISE-TIMIT corpus. It is also integrated into a monitoring system, where it supplements a speech recognition system.The final speaker change point detection approach is proposed in the following chapter. It was designed in a series of consecutive experiments, which are extensively detailed in this chapter. An experimental evaluation of this approach on the COST278 database shows the performance of approaching the offline reference system while operating in online mode with low latency.Finally, the last chapter summarizes all the results of this thesis

    Activity Report 2004

    Get PDF

    Synthetic speech detection using phase information

    Get PDF
    Taking advantage of the fact that most of the speech processing techniques neglect the phase information, we seek to detect phase perturbations in order to prevent synthetic impostors attacking Speaker Verification systems. Two Synthetic Speech Detection (SSD) systems that use spectral phase related information are reviewed and evaluated in this work: one based on the Modified Group Delay (MGD), and the other based on the Relative Phase Shift, (RPS). A classical module-based MFCC system is also used as baseline. Different training strategies are proposed and evaluated using both real spoofing samples and copy-synthesized signals from the natural ones, aiming to alleviate the issue of getting real data to train the systems. The recently published ASVSpoof2015 database is used for training and evaluation. Performance with completely unrelated data is also checked using synthetic speech from the Blizzard Challenge as evaluation material. The results prove that phase information can be successfully used for the SSD task even with unknown attacks.This work has been partially supported by the Basque Government (ElkarOla Project, KK-2015/00,098) and the Spanish Ministry of Economy and Competitiveness (Restore project, TEC2015-67,163-C2-1-R)
    corecore