2,730 research outputs found
Nonparallel Emotional Speech Conversion
We propose a nonparallel data-driven emotional speech conversion method. It
enables the transfer of emotion-related characteristics of a speech signal
while preserving the speaker's identity and linguistic content. Most existing
approaches require parallel data and time alignment, which is not available in
most real applications. We achieve nonparallel training based on an
unsupervised style transfer technique, which learns a translation model between
two distributions instead of a deterministic one-to-one mapping between paired
examples. The conversion model consists of an encoder and a decoder for each
emotion domain. We assume that the speech signal can be decomposed into an
emotion-invariant content code and an emotion-related style code in latent
space. Emotion conversion is performed by extracting and recombining the
content code of the source speech and the style code of the target emotion. We
tested our method on a nonparallel corpora with four emotions. Both subjective
and objective evaluations show the effectiveness of our approach.Comment: Published in INTERSPEECH 2019, 5 pages, 6 figures. Simulation
available at http://www.jian-gao.org/emoga
Recommended from our members
Analyzing Distributional Learning of Phonemic Categories in Unsupervised DeepNeural Networks
Infants’ speech perception adapts to the phonemic categoriesof their native language, a process assumed to be driven bythe distributional properties of speech. This study investigateswhether deep neural networks (DNNs), the current state-of-the-art in distributional feature learning, are capable oflearning phoneme-like representations of speech in anunsupervised manner. We trained DNNs with unlabeled andlabeled speech and analyzed the activations of each layer withrespect to the phones in the input segments. The analysesreveal that the emergence of phonemic invariance in DNNs isdependent on the availability of phonemic labeling of theinput during the training. No increased phonemic selectivityof the hidden layers was observed in the purely unsupervisednetworks despite successful learning of low-dimensionalrepresentations for speech. This suggests that additionallearning constraints or more sophisticated models are neededto account for the emergence of phone-like categories indistributional learning operating on natural speech
Examining the acquisition of phonological word-forms with computational experiments
This is the author's accepted manuscript. The original publication is available at http://las.sagepub.com/content/early/2012/10/21/0023830912460513.full.pdfIt has been hypothesized that known words in the lexicon strengthen newly formed representations of novel words, resulting in words with dense neighborhoods being learned more quickly than words with sparse neighborhoods. Tests of this hypothesis in a connectionist network showed that words with dense neighborhoods were learned better than words with sparse neighborhoods when the network was exposed to the words all at once (Experiment 1), or gradually over time, like human word-learners (Experiment 2). This pattern was also observed despite variation in the availability of processing resources in the networks (Experiment 3). A learning advantage for words with sparse neighborhoods was observed only when the network was initially exposed to words with sparse neighborhoods and exposed to dense neighborhoods later in training (Experiment 4). The benefits of computational experiments for increasing our understanding of language processes and for the treatment of language processing disorders are discussed
An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods
Preprint del artÃculo públicado online el 31 de mayo 2018Voice activity detection (VAD) is an essential task in expert systems that rely on oral interfaces. The VAD module detects the presence of human speech and separates speech segments from silences and non-speech noises. The most popular current on-line VAD systems are based on adaptive parameters which seek to cope with varying channel and noise conditions. The main disadvantages of this approach are the need for some initialisation time to properly adjust the parameters to the incoming signal and uncertain performance in the case of poor estimation of the initial parameters. In this paper we propose a novel on-line VAD based only on previous training which does not introduce any delay. The technique is based on a strategy that we have called Multi-Normalisation Scoring (MNS). It consists of obtaining a vector of multiple observation likelihood scores from normalised mel-cepstral coefficients previously computed from different databases. A classifier is then used to label the incoming observation likelihood vector. Encouraging results have been obtained with a Multi-Layer Perceptron (MLP). This technique can generalise for unseen noise levels and types. A validation experiment with two current standard ITU-T VAD algorithms demonstrates the good performance of the method. Indeed, lower classification error rates are obtained for non-speech frames, while results for speech frames are similar.This work was partially supported by the EU (ERDF) under grant TEC2015-67163-C2-1-R (RESTORE) (MINECO/ERDF, EU) and by the Basque Government under grant KK-2017/00043 (BerbaOla)
The Microsoft 2017 Conversational Speech Recognition System
We describe the 2017 version of Microsoft's conversational speech recognition
system, in which we update our 2016 system with recent developments in
neural-network-based acoustic and language modeling to further advance the
state of the art on the Switchboard speech recognition task. The system adds a
CNN-BLSTM acoustic model to the set of model architectures we combined
previously, and includes character-based and dialog session aware LSTM language
models in rescoring. For system combination we adopt a two-stage approach,
whereby subsets of acoustic models are first combined at the senone/frame
level, followed by a word-level voting via confusion networks. We also added a
confusion network rescoring step after system combination. The resulting system
yields a 5.1\% word error rate on the 2000 Switchboard evaluation set
- …