319 research outputs found
Speech vocoding for laboratory phonology
Using phonological speech vocoding, we propose a platform for exploring
relations between phonology and speech processing, and in broader terms, for
exploring relations between the abstract and physical structures of a speech
signal. Our goal is to make a step towards bridging phonology and speech
processing and to contribute to the program of Laboratory Phonology. We show
three application examples for laboratory phonology: compositional phonological
speech modelling, a comparison of phonological systems and an experimental
phonological parametric text-to-speech (TTS) system. The featural
representations of the following three phonological systems are considered in
this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English
(SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded
speech, we conclude that the latter achieves slightly better results than the
former. However, GP - the most compact phonological speech representation -
performs comparably to the systems with a higher number of phonological
features. The parametric TTS based on phonological speech representation, and
trained from an unlabelled audiobook in an unsupervised manner, achieves
intelligibility of 85% of the state-of-the-art parametric speech synthesis. We
envision that the presented approach paves the way for researchers in both
fields to form meaningful hypotheses that are explicitly testable using the
concepts developed and exemplified in this paper. On the one hand, laboratory
phonologists might test the applied concepts of their theoretical models, and
on the other hand, the speech processing community may utilize the concepts
developed for the theoretical phonological models for improvements of the
current state-of-the-art applications
Automatic Pronunciation Assessment -- A Review
Pronunciation assessment and its application in computer-aided pronunciation
training (CAPT) have seen impressive progress in recent years. With the rapid
growth in language processing and deep learning over the past few years, there
is a need for an updated review. In this paper, we review methods employed in
pronunciation assessment for both phonemic and prosodic. We categorize the main
challenges observed in prominent research trends, and highlight existing
limitations, and available resources. This is followed by a discussion of the
remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding
Model-based Parametric Prosody Synthesis with Deep Neural Network
Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms found in speech production research. The present study explores an alternative paradigm, namely, model-based parametric prosody synthesis (MPPS), which integrates dynamic mechanisms of human speech production as a core component of F0 generation. In this paradigm, contextual variations in prosody are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. Here the motor model is target approximation (TA), which generates syllable-sized F0 contours with only three motor parameters that are associated to linguistic functions. In this study, we simulate this two-stage process by linking the TA model to a deep neural network (DNN), which learns the âlinguistic-motorâ mapping given the âmotor-acousticâ mapping provided by TA-based syllable-wise F0 production. The proposed prosody modeling system outperforms the HMM-based baseline system in both objective and subjective evaluations
Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding
Most current very low bit rate (VLBR) speech coding systems use hidden Markov
model (HMM) based speech recognition/synthesis techniques. This allows
transmission of information (such as phonemes) segment by segment that
decreases the bit rate. However, the encoder based on a phoneme speech
recognition may create bursts of segmental errors. Segmental errors are further
propagated to optional suprasegmental (such as syllable) information coding.
Together with the errors of voicing detection in pitch parametrization,
HMM-based speech coding creates speech discontinuities and unnatural speech
sound artefacts.
In this paper, we propose a novel VLBR speech coding framework based on
neural networks (NNs) for end-to-end speech analysis and synthesis without
HMMs. The speech coding framework relies on phonological (sub-phonetic)
representation of speech, and it is designed as a composition of deep and
spiking NNs: a bank of phonological analysers at the transmitter, and a
phonological synthesizer at the receiver, both realised as deep NNs, and a
spiking NN as an incremental and robust encoder of syllable boundaries for
coding of continuous fundamental frequency (F0). A combination of phonological
features defines much more sound patterns than phonetic features defined by
HMM-based speech coders, and the finer analysis/synthesis code contributes into
smoother encoded speech. Listeners significantly prefer the NN-based approach
due to fewer discontinuities and speech artefacts of the encoded speech. A
single forward pass is required during the speech encoding and decoding. The
proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s
Automatic sound law induction (Open problems in computational diversity linguistics 3)
This is the fourth of a series of 12 blog posts published in 2019, discussing open problems in computational diversity linguistics. It discusses the problem of automatic sound law induction
Rhythmic unit extraction and modelling for automatic language identification
International audienceThis paper deals with an approach to Automatic Language Identification based on rhythmic modelling. Beside phonetics and phonotactics, rhythm is actually one of the most promising features to be considered for language identification, even if its extraction and modelling are not a straightforward issue. Actually, one of the main problems to address is what to model. In this paper, an algorithm of rhythm extraction is described: using a vowel detection algorithm, rhythmic units related to syllables are segmented. Several parameters are extracted (consonantal and vowel duration, cluster complexity) and modelled with a Gaussian Mixture. Experiments are performed on read speech for 7 languages (English, French, German, Italian, Japanese, Mandarin and Spanish) and results reach up to 86 ± 6% of correct discrimination between stress-timed mora-timed and syllable-timed classes of languages, and to 67 ± 8% percent of correct language identification on average for the 7 languages with utterances of 21 seconds. These results are commented and compared with those obtained with a standard acoustic Gaussian mixture modelling approach (88 ± 5% of correct identification for the 7-languages identification task)
- âŠ