377 research outputs found
Speech vocoding for laboratory phonology
Using phonological speech vocoding, we propose a platform for exploring
relations between phonology and speech processing, and in broader terms, for
exploring relations between the abstract and physical structures of a speech
signal. Our goal is to make a step towards bridging phonology and speech
processing and to contribute to the program of Laboratory Phonology. We show
three application examples for laboratory phonology: compositional phonological
speech modelling, a comparison of phonological systems and an experimental
phonological parametric text-to-speech (TTS) system. The featural
representations of the following three phonological systems are considered in
this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English
(SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded
speech, we conclude that the latter achieves slightly better results than the
former. However, GP - the most compact phonological speech representation -
performs comparably to the systems with a higher number of phonological
features. The parametric TTS based on phonological speech representation, and
trained from an unlabelled audiobook in an unsupervised manner, achieves
intelligibility of 85% of the state-of-the-art parametric speech synthesis. We
envision that the presented approach paves the way for researchers in both
fields to form meaningful hypotheses that are explicitly testable using the
concepts developed and exemplified in this paper. On the one hand, laboratory
phonologists might test the applied concepts of their theoretical models, and
on the other hand, the speech processing community may utilize the concepts
developed for the theoretical phonological models for improvements of the
current state-of-the-art applications
Probabilistic Amplitude Demodulation features in Speech Synthesis for Improving Prosody
Abstract Amplitude demodulation (AM) is a signal decomposition technique by which a signal can be decomposed to a product of two signals, i.e, a quickly varying carrier and a slowly varying modulator. In this work, the probabilistic amplitude demodulation (PAD) features are used to improve prosody in speech synthesis. The PAD is applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features are used as a secondary input scheme along with the standard text-based input features in statistical parametric speech syn- thesis. Specifically, deep neural network (DNN)-based speech synthesis is used to evaluate the importance of these features. Objective evaluation has shown that the proposed system using the PAD features has improved mainly prosody modelling; it outperforms the baseline system by approximately 5% in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (F0). The significance of this improvement is validated by subjective evaluation of the overall speech quality, achieving 38.6% over 19.5% preference score in respect to the baseline system, in an ABX test
Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding
Most current very low bit rate (VLBR) speech coding systems use hidden Markov
model (HMM) based speech recognition/synthesis techniques. This allows
transmission of information (such as phonemes) segment by segment that
decreases the bit rate. However, the encoder based on a phoneme speech
recognition may create bursts of segmental errors. Segmental errors are further
propagated to optional suprasegmental (such as syllable) information coding.
Together with the errors of voicing detection in pitch parametrization,
HMM-based speech coding creates speech discontinuities and unnatural speech
sound artefacts.
In this paper, we propose a novel VLBR speech coding framework based on
neural networks (NNs) for end-to-end speech analysis and synthesis without
HMMs. The speech coding framework relies on phonological (sub-phonetic)
representation of speech, and it is designed as a composition of deep and
spiking NNs: a bank of phonological analysers at the transmitter, and a
phonological synthesizer at the receiver, both realised as deep NNs, and a
spiking NN as an incremental and robust encoder of syllable boundaries for
coding of continuous fundamental frequency (F0). A combination of phonological
features defines much more sound patterns than phonetic features defined by
HMM-based speech coders, and the finer analysis/synthesis code contributes into
smoother encoded speech. Listeners significantly prefer the NN-based approach
due to fewer discontinuities and speech artefacts of the encoded speech. A
single forward pass is required during the speech encoding and decoding. The
proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s
Stress and Accent Transmission In HMM-Based Syllable-Context Very Low Bit Rate Speech Coding
Abstract In this paper, we propose a solution to reconstruct stress and accent contextual factors at the receiver of a very low bitrate speech codec built on recognition/synthesis architecture. In speech synthesis, accent and stress symbols are predicted from the text, which is not available at the receiver side of the speech codec. Therefore, speech signal-based symbols, generated as syllable-level log average F0 and energy acoustic measures, quantized using a scalar quantization, are used instead of accentual and stress symbols for HMM-based speech synthesis. Results from incremental real-time speech synthesis confirmed, that a combination of F0 and energy signal-based symbols can replace their counterparts of text-based binary accent and stress symbols developed for text-to-speech systems. The estimated transmission bit-rate overhead is about 14 bits/second per acoustic measure
The SIWIS French Speech Synthesis Database ? Design and recording of a high quality French database for speech synthesis
We describe the design and recording of a high quality French speech corpus, aimed at building TTS systems, investigate multiple styles, and emphasis. The data was recorded by a French voice talent, and contains about ten hours of speech, including emphasised words in many different contexts. The database contains more than ten hours of speech and is freely available
Prosody in Swiss French Accents: Investigation using Analysis by Synthesis
It is very common for a language to have different dialects or accents. The different pronunciations of the same words is one of the reasons for the different accents, in the same language. Swiss French accents have similar pronunciation to standard French, but noticeable differences in prosody. In this paper we investigate the use of standard French synthetic acoustic parameters combined with Swiss French prosody in order to evaluate the importance of prosody in modelling Swiss French accents. We use speech synthesis techniques to produce standard French pronunciation with Swiss French duration and intonation. Subjective evaluation to rate the degree of Swiss accent was conducted and showed that prosody modification alone reduces perceived difference between original Swiss accented speech and standard French coupled with original duration and intonation by 29%
- …