11,566 research outputs found
Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI
Vocal tract configurations play a vital role in generating distinguishable
speech sounds, by modulating the airflow and creating different resonant
cavities in speech production. They contain abundant information that can be
utilized to better understand the underlying speech production mechanism. As a
step towards automatic mapping of vocal tract shape geometry to acoustics, this
paper employs effective video action recognition techniques, like Long-term
Recurrent Convolutional Networks (LRCN) models, to identify different
vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract.
Such a model typically combines a CNN based deep hierarchical visual feature
extractor with Recurrent Networks, that ideally makes the network
spatio-temporally deep enough to learn the sequential dynamics of a short video
clip for video classification tasks. We use a database consisting of 2D
real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The
comparative performances of this class of algorithms under various parameter
settings and for various classification tasks are discussed. Interestingly, the
results show a marked difference in the model performance in the context of
speech classification with respect to generic sequence or video classification
tasks.Comment: To appear in the INTERSPEECH 2018 Proceeding
Structured Sequence Modeling with Graph Convolutional Recurrent Networks
This paper introduces Graph Convolutional Recurrent Network (GCRN), a deep
learning model able to predict structured sequences of data. Precisely, GCRN is
a generalization of classical recurrent neural networks (RNN) to data
structured by an arbitrary graph. Such structured sequences can represent
series of frames in videos, spatio-temporal measurements on a network of
sensors, or random walks on a vocabulary graph for natural language modeling.
The proposed model combines convolutional neural networks (CNN) on graphs to
identify spatial structures and RNN to find dynamic patterns. We study two
possible architectures of GCRN, and apply the models to two practical problems:
predicting moving MNIST data, and modeling natural language with the Penn
Treebank dataset. Experiments show that exploiting simultaneously graph spatial
and dynamic information about data can improve both precision and learning
speed
- …