Search CORE

2,730 research outputs found

Nonparallel Emotional Speech Conversion

Author: Chakraborty Deep
Gao Jian
Olaleye Olaitan
Tembine Hamidou
Publication venue: 'International Speech Communication Association'
Publication date: 13/04/2020
Field of study

We propose a nonparallel data-driven emotional speech conversion method. It enables the transfer of emotion-related characteristics of a speech signal while preserving the speaker's identity and linguistic content. Most existing approaches require parallel data and time alignment, which is not available in most real applications. We achieve nonparallel training based on an unsupervised style transfer technique, which learns a translation model between two distributions instead of a deterministic one-to-one mapping between paired examples. The conversion model consists of an encoder and a decoder for each emotion domain. We assume that the speech signal can be decomposed into an emotion-invariant content code and an emotion-related style code in latent space. Emotion conversion is performed by extracting and recombining the content code of the source speech and the style code of the target emotion. We tested our method on a nonparallel corpora with four emotions. Both subjective and objective evaluations show the effectiveness of our approach.Comment: Published in INTERSPEECH 2019, 5 pages, 6 figures. Simulation available at http://www.jian-gao.org/emoga

arXiv.org e-Print Archive

Crossref

Recommended from our members

Analyzing Distributional Learning of Phonemic Categories in Unsupervised DeepNeural Networks

Author: Mesgarani Nima
Nagamine Tasha
Räsänen Okko
Publication venue: eScholarship, University of California
Publication date: 01/01/2016
Field of study

Infants’ speech perception adapts to the phonemic categoriesof their native language, a process assumed to be driven bythe distributional properties of speech. This study investigateswhether deep neural networks (DNNs), the current state-of-the-art in distributional feature learning, are capable oflearning phoneme-like representations of speech in anunsupervised manner. We trained DNNs with unlabeled andlabeled speech and analyzed the activations of each layer withrespect to the phones in the input segments. The analysesreveal that the emergence of phonemic invariance in DNNs isdependent on the availability of phonemic labeling of theinput during the training. No increased phonemic selectivityof the hidden layers was observed in the purely unsupervisednetworks despite successful learning of low-dimensionalrepresentations for speech. This suggests that additionallearning constraints or more sophisticated models are neededto account for the emergence of phone-like categories indistributional learning operating on natural speech

eScholarship - University of California

Examining the acquisition of phonological word-forms with computational experiments

Author: Storkel Holly L.
Vitevitch Michael S.
Publication venue: 'SAGE Publications'
Publication date: 01/12/2013
Field of study

This is the author's accepted manuscript. The original publication is available at http://las.sagepub.com/content/early/2012/10/21/0023830912460513.full.pdfIt has been hypothesized that known words in the lexicon strengthen newly formed representations of novel words, resulting in words with dense neighborhoods being learned more quickly than words with sparse neighborhoods. Tests of this hypothesis in a connectionist network showed that words with dense neighborhoods were learned better than words with sparse neighborhoods when the network was exposed to the words all at once (Experiment 1), or gradually over time, like human word-learners (Experiment 2). This pattern was also observed despite variation in the availability of processing resources in the networks (Experiment 3). A learning advantage for words with sparse neighborhoods was observed only when the network was initially exposed to words with sparse neighborhoods and exposed to dense neighborhoods later in training (Experiment 4). The benefits of computational experiments for increasing our understanding of language processes and for the treatment of language processing disorders are discussed

KU ScholarWorks

PubMed Central

An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods

Author: Hernaez Rioja Inmaculada Concepción
Navas Cordón Eva
Odriozola Sustaeta Igor
Publication venue: 'Elsevier BV'
Publication date: 31/05/2018
Field of study

Preprint del artículo públicado online el 31 de mayo 2018Voice activity detection (VAD) is an essential task in expert systems that rely on oral interfaces. The VAD module detects the presence of human speech and separates speech segments from silences and non-speech noises. The most popular current on-line VAD systems are based on adaptive parameters which seek to cope with varying channel and noise conditions. The main disadvantages of this approach are the need for some initialisation time to properly adjust the parameters to the incoming signal and uncertain performance in the case of poor estimation of the initial parameters. In this paper we propose a novel on-line VAD based only on previous training which does not introduce any delay. The technique is based on a strategy that we have called Multi-Normalisation Scoring (MNS). It consists of obtaining a vector of multiple observation likelihood scores from normalised mel-cepstral coefficients previously computed from different databases. A classifier is then used to label the incoming observation likelihood vector. Encouraging results have been obtained with a Multi-Layer Perceptron (MLP). This technique can generalise for unseen noise levels and types. A validation experiment with two current standard ITU-T VAD algorithms demonstrates the good performance of the method. Indeed, lower classification error rates are obtained for non-speech frames, while results for speech frames are similar.This work was partially supported by the EU (ERDF) under grant TEC2015-67163-C2-1-R (RESTORE) (MINECO/ERDF, EU) and by the Basque Government under grant KK-2017/00043 (BerbaOla)

Crossref

Archivo Digital para la Docencia y la Investigación

Speech and neural network dynamics

Author: Renals Stephen John
Publication venue: The University of Edinburgh
Publication date: 01/01/1990
Field of study

Edinburgh Research Archive

The Microsoft 2017 Conversational Speech Recognition System

Author: Alleva F.
Droppo J.
Huang X.
Stolcke A.
Wu L.
Xiong W.
Publication venue
Publication date: 24/08/2017
Field of study

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set

arXiv.org e-Print Archive

Crossref