2 research outputs found
A comparative study of estimating articulatory movements from phoneme sequences and acoustic features
Unlike phoneme sequences, movements of speech articulators (lips, tongue,
jaw, velum) and the resultant acoustic signal are known to encode not only the
linguistic message but also carry para-linguistic information. While several
works exist for estimating articulatory movement from acoustic signals, little
is known to what extent articulatory movements can be predicted only from
linguistic information, i.e., phoneme sequence. In this work, we estimate
articulatory movements from three different input representations: R1) acoustic
signal, R2) phoneme sequence, R3) phoneme sequence with timing information.
While an attention network is used for estimating articulatory movement in the
case of R2, BLSTM network is used for R1 and R3. Experiments with ten subjects'
acoustic-articulatory data reveal that the estimation techniques achieve an
average correlation coefficient of 0.85, 0.81, and 0.81 in the case of R1, R2,
and R3 respectively. This indicates that attention network, although uses only
phoneme sequence (R2) without any timing information, results in an estimation
performance similar to that using rich acoustic signal (R1), suggesting that
articulatory motion is primarily driven by the linguistic message. The
correlation coefficient is further improved to 0.88 when R1 and R3 are used
together for estimating articulatory movements.Comment: 5 pages, 5 figures, accepted in ICASSP 202
Articulatory-WaveNet: Autoregressive Model For Acoustic-to-Articulatory Inversion
This paper presents Articulatory-WaveNet, a new approach for
acoustic-to-articulator inversion. The proposed system uses the WaveNet speech
synthesis architecture, with dilated causal convolutional layers using previous
values of the predicted articulatory trajectories conditioned on acoustic
features. The system was trained and evaluated on the ElectroMagnetic
Articulography corpus of Mandarin Accented English (EMA-MAE),consisting of 39
speakers including both native English speakers and native Mandarin speakers
speaking English. Results show significant improvement in both correlation and
RMSE between the generated and true articulatory trajectories for the new
method, with an average correlation of 0.83, representing a 36% relative
improvement over the 0.61 correlation obtained with a baseline Hidden Markov
Model (HMM)-Gaussian Mixture Model (GMM) inversion framework. To the best of
our knowledge, this paper presents the first application of a point-by-point
waveform synthesis approach to the problem of acoustic-to-articulatory
inversion and the results show improved performance compared to previous
methods for speaker dependent acoustic to articulatory inversion