Search CORE

126 research outputs found

Speech Synthesis Based on Hidden Markov Models

Author: Nankaku Y.
Oura K.
Toda T.
Tokuda K.
Yamagishi J.
Zen H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2013
Field of study

Edinburgh Research Explorer

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Author: A Borthwick
A Ratnaparkhi
AL Berger
AW Black
B Picart
CJ Leggetter
Fahimeh Bahmaninezhad
H Kawahara
H Liang
H Zen
H Zen
H Zen
H Zen
H Zen
H Zen
Hossein Sameti
J Ghomeshi
J Nocedal
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
JJ Odell
K Hashimoto
K Hashimoto
K Oura
K Shinoda
K Tokuda
K Tokuda
K Tokuda
K Yu
K Yu
L Qin
M Bijankhan
M Gibson
MJ Gales
R Kubichek
S Sakai
S Takaki
S Takaki
Simon King
SJ Young
Soheil Khorram
T Drugman
T Drugman
T Koriyama
T Toda
T Toda
T Yoshimura
T Yoshimura
Thomas Drugman
V Rangarajan
VV Digalakis
Y Qian
YJ Wu
YJ Wu
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

Crossref

Springer - Publisher Connector

Edinburgh Research Explorer

Soft context clustering for F0 modeling in HMM-based speech synthesis

Author: A Ratnaparkhi
AJ Hunt
AL Berger
AW Black
C Olaru
H Kawahara
H Liang
H Lu
H Zen
H Zen
H Zen
H Zen
H Zen
H Zen
H Zen
H Zen
H Zen
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
JD Ferguson
JJ Odell
K Hashimoto
K Shinoda
K Tokuda
K Tokuda
K Tokuda
K Tokuda
K Yanagisawa
K Yu
K Yu
K Yu
L Rabiner
M Cernak
M Gibson
M Shannon
MJ Gales
MY Liberman
R Karhila
S Kang
S Khorram
S King
S Sakai
S Takaki
S Takaki
S Takaki
S Takamichi
SE Levinson
SJ Young
SZ Yu
T Drugman
T Drugman
T Dutoi
T Koriyama
T Toda
T Yoshimura
T Yoshimura
T Yoshimura
TK Moon
Y Nankaku
Y Qian
Y Stylianou
Y Yuan
YJ Wu
YJ Wu
ZH Ling
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Crossref

Springer - Publisher Connector

Edinburgh Research Explorer

Factorized context modelling for Text-to-Speech synthesis

Author: King S.
Lu Heng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Because speech units are so context-dependent, a large number of linguistic context features are generally used by HMMbased Text-to-Speech (TTS) speech synthesis systems, via context-dependent models. Since it is impossible to train separate models for every context, decision trees are used to discover the most important combinations of features that should be modelled. The task of the decision tree is very hard- to generalize from a very small observed part of the context feature space to the rest- and they have a major weakness: they cannot directly take advantage of factorial properties: they subdivide the model space based on one feature at a time. We propose a Dynamic Bayesian Network (DBN) based Mixed Memory Markov Model (MMMM) to provide factorization of the context space. The results of a listening test are provided as evidence that the model successfully learns the factorial nature of this space

CiteSeerX

Crossref

Edinburgh Research Explorer

Recommended from our members

A language space representation for speech recognition

Author: Gales MJF
Knill KM
Ragni A
Publication venue: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publication date: 01/01/2015
Field of study

© 2015 IEEE. The number of languages for which speech recognition systems have become available is growing each year. This paper proposes to view languages as points in some rich space, termed language space, where bases are eigen-languages and a particular selection of the projection determines points. Such an approach could not only reduce development costs for each new language but also provide automatic means for language analysis. For the initial proof of the concept, this paper adopts cluster adaptive training (CAT) known for inducing similar spaces for speaker adaptation needs. The CAT approach used in this paper builds on the previous work for language adaptation in speech synthesis and extends it to Gaussian mixture modelling more appropriate for speech recognition. Experiments conducted on IARPA Babel program languages show that such language space representations can outperform language independent models and discover closely related languages in an automatic way

Apollo (Cambridge)

White Rose Research Online

Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding

Author: Asaei Afsaneh
Cernak Milos
Garner Philip N.
Lazaridis Alexandros
Publication venue: Idiap
Publication date: 19/04/2016
Field of study

Most current very low bit rate (VLBR) speech coding systems use hidden Markov model (HMM) based speech recognition/synthesis techniques. This allows transmission of information (such as phonemes) segment by segment that decreases the bit rate. However, the encoder based on a phoneme speech recognition may create bursts of segmental errors. Segmental errors are further propagated to optional suprasegmental (such as syllable) information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding creates speech discontinuities and unnatural speech sound artefacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks (NNs) for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on phonological (sub-phonetic) representation of speech, and it is designed as a composition of deep and spiking NNs: a bank of phonological analysers at the transmitter, and a phonological synthesizer at the receiver, both realised as deep NNs, and a spiking NN as an incremental and robust encoder of syllable boundaries for coding of continuous fundamental frequency (F0). A combination of phonological features defines much more sound patterns than phonetic features defined by HMM-based speech coders, and the finer analysis/synthesis code contributes into smoother encoded speech. Listeners significantly prefer the NN-based approach due to fewer discontinuities and speech artefacts of the encoded speech. A single forward pass is required during the speech encoding and decoding. The proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s

Infoscience - École polytechnique fédérale de Lausanne

arXiv.org e-Print Archive