Search CORE

12 research outputs found

Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project

Author: Byrne William
Dines John
Garner Philip N.
Gibson Matthew
Guan Yong
Hirsimäki Teemu
Karhila Reima
King Simon
Kurimo Mikko
Liang Hui
Oura Keiichiro
Saheer Lakshmi
Shannon Matt
Shiota Sayaka
Tian Jilei
Tokuda Keiichi
Wester Mirjam
Wu Yi-Jian
Yamagishi Junichi
Publication venue: 7th ISCA Speech Synthesis Workshop
Publication date: 01/01/2010
Field of study

This paper provides an overview of speaker adaptation research carried out in the EMIME speech-to-speech translation (S2ST) project. We focus on how speaker adaptation transforms can be learned from speech in one language and applied to the acoustic models of another language. The adaptation is transferred across languages and/or from recognition models to synthesis models. The various approaches investigated can all be viewed as a process in which a mapping is defined in terms of either acoustic model states or linguistic units. The mapping is used to transfer either speech data or adaptation transforms between the two models. Because the success of speaker adaptation in text-to-speech synthesis is measured by judging speaker similarity, we also discuss issues concerning evaluation of speaker similarity in an S2ST scenario

Infoscience - École polytechnique fédérale de Lausanne

Edinburgh Research Archive

Edinburgh Research Explorer

Probabilistic Amplitude Demodulation features in Speech Synthesis for Improving Prosody

Author: Cernak Milos
Garner Philip N.
Lazaridis Alexandros
Publication venue: Idiap
Publication date: 19/05/2016
Field of study

Abstract Amplitude demodulation (AM) is a signal decomposition technique by which a signal can be decomposed to a product of two signals, i.e, a quickly varying carrier and a slowly varying modulator. In this work, the probabilistic amplitude demodulation (PAD) features are used to improve prosody in speech synthesis. The PAD is applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features are used as a secondary input scheme along with the standard text-based input features in statistical parametric speech syn- thesis. Specifically, deep neural network (DNN)-based speech synthesis is used to evaluate the importance of these features. Objective evaluation has shown that the proposed system using the PAD features has improved mainly prosody modelling; it outperforms the baseline system by approximately 5% in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (F0). The significance of this improvement is validated by subjective evaluation of the overall speech quality, achieving 38.6% over 19.5% preference score in respect to the baseline system, in an ABX test

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Stress and Accent Transmission In HMM-Based Syllable-Context Very Low Bit Rate Speech Coding

Author: Alexandros Lazaridis
Milos Cernak
Petr Motlicek
Philip N Garner
Publication venue
Publication date: 11/04/2020
Field of study

Abstract In this paper, we propose a solution to reconstruct stress and accent contextual factors at the receiver of a very low bitrate speech codec built on recognition/synthesis architecture. In speech synthesis, accent and stress symbols are predicted from the text, which is not available at the receiver side of the speech codec. Therefore, speech signal-based symbols, generated as syllable-level log average F0 and energy acoustic measures, quantized using a scalar quantization, are used instead of accentual and stress symbols for HMM-based speech synthesis. Results from incremental real-time speech synthesis confirmed, that a combination of F0 and energy signal-based symbols can replace their counterparts of text-based binary accent and stress symbols developed for text-to-speech systems. The estimated transmission bit-rate overhead is about 14 bits/second per acoustic measure

CiteSeerX

Speech vocoding for laboratory phonology

Author: Benus Stefan
Cernak Milos
Lazaridis Alexandros
Publication venue: 'Elsevier BV'
Publication date: 19/05/2015
Field of study

Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Progress report of a project in very low bit-rate speech coding

Author: Cernak Milos
Garner Philip N.
Motlicek Petr
Publication venue: Idiap
Publication date: 19/12/2013
Field of study

Background work in various levels of speech coding is reviewed, including unconstrained coding and recognition-synthesis approaches that assume the signal is speech. A pilot project in HMM-TTS based speech coding is then described, in which a comparison with harmonic plus noise modelling is also done. Results of the demonstration project including samples of speech under various transmission situations are presented in an accompanying web page. The report concludes by describing and enumerating the shortcomings of the demonstration system that define directions for future work. This work is a deliverable for the armasuisse funded project “RECOD - Low bit-rate speech coding

Infoscience - École polytechnique fédérale de Lausanne

A percepção da qualidade de voz por brasileiros bilíngues

Author: Ana Paula Petriu Ferreira Engelbert
Denise Cristina Kluge
Publication venue: 'Universidade Federal de Santa Catarina (UFSC)'
Publication date: 01/09/2018
Field of study

Estudos revelam que falantes bilíngues podem alterar a voz quando falam uma L2 em comparação com a voz em L1. Resta saber se essas diferenças são percebidas pelos ouvintes. Assim, o presente estudo aborda a percepção da qualidade de voz por ouvintes brasileiros bilíngues em emissões em português brasileiro (PB) e em inglês (IN) produzidas por falantes também brasileiros e bilíngues. Tal objetivo foi alcançado por meio de um teste de discriminação das vozes em PB e em IN do mesmo falante. Os ouvintes julgaram se as vozes eram iguais ou diferentes, descrevendo suas características caso fossem diferentes. Os resultados apontaram para uma certa variabilidade nos julgamentos, mas também mostraram que os ouvintes são capazes de identificar diferenças de pitch e intensidade entre as línguas, bem como atribuir características de personalidade e emoção à fala nas duas línguas

Directory of Open Access Journals

Talker discrimination across languages

Author: Abe
Bradlow
Goggin
Kreiman
Latorre
Mirjam Wester
Nygaard
Nygaard
Perrachione
Perrachione
Philippon
Sammon
Stanislaw
Stockmal
Thompson
Van Lancker
Winters
Winters
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding

Author: Asaei Afsaneh
Cernak Milos
Garner Philip N.
Lazaridis Alexandros
Publication venue: Idiap
Publication date: 19/04/2016
Field of study

Most current very low bit rate (VLBR) speech coding systems use hidden Markov model (HMM) based speech recognition/synthesis techniques. This allows transmission of information (such as phonemes) segment by segment that decreases the bit rate. However, the encoder based on a phoneme speech recognition may create bursts of segmental errors. Segmental errors are further propagated to optional suprasegmental (such as syllable) information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding creates speech discontinuities and unnatural speech sound artefacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks (NNs) for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on phonological (sub-phonetic) representation of speech, and it is designed as a composition of deep and spiking NNs: a bank of phonological analysers at the transmitter, and a phonological synthesizer at the receiver, both realised as deep NNs, and a spiking NN as an incremental and robust encoder of syllable boundaries for coding of continuous fundamental frequency (F0). A combination of phonological features defines much more sound patterns than phonetic features defined by HMM-based speech coders, and the finer analysis/synthesis code contributes into smoother encoded speech. Listeners significantly prefer the NN-based approach due to fewer discontinuities and speech artefacts of the encoded speech. A single forward pass is required during the speech encoding and decoding. The proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s

Infoscience - École polytechnique fédérale de Lausanne

arXiv.org e-Print Archive

Current trends in multilingual speech processing

Author: BOURLARD HERVÉ
DINES JOHN
GARNER PHILIP
IMSENG DAVID
LIANG HUI
MAGIMAI-DOSS MATHEW
MOTLICEK PETR
SAHEER LAKSHMI
VALENTE FABIO
Publication venue
Publication date: 18/06/2018
Field of study

In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

RERO DOC Digital Library