7 research outputs found
The Mirrornet : Learning Audio Synthesizer Controls Inspired by Sensorimotor Interaction
Experiments to understand the sensorimotor neural interactions in the human
cortical speech system support the existence of a bidirectional flow of
interactions between the auditory and motor regions. Their key function is to
enable the brain to `learn' how to control the vocal tract for speech
production. This idea is the impetus for the recently proposed "MirrorNet", a
constrained autoencoder architecture. In this paper, the MirrorNet is applied
to learn, in an unsupervised manner, the controls of a specific audio
synthesizer (DIVA) to produce melodies only from their auditory spectrograms.
The results demonstrate how the MirrorNet discovers the synthesizer parameters
to generate the melodies that closely resemble the original and those of unseen
melodies, and even determine the best set parameters to approximate renditions
of complex piano melodies generated by a different synthesizer. This
generalizability of the MirrorNet illustrates its potential to discover from
sensory data the controls of arbitrary motor-plants
Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables
The performance of deep learning models depends significantly on their
capacity to encode input features efficiently and decode them into meaningful
outputs. Better input and output representation has the potential to boost
models' performance and generalization. In the context of
acoustic-to-articulatory speech inversion (SI) systems, we study the impact of
utilizing speech representations acquired via self-supervised learning (SSL)
models, such as HuBERT compared to conventional acoustic features.
Additionally, we investigate the incorporation of novel tract variables (TVs)
through an improved geometric transformation model. By combining these two
approaches, we improve the Pearson product-moment correlation (PPMC) scores
which evaluate the accuracy of TV estimation of the SI system from 0.7452 to
0.8141, a 6.9% increase. Our findings underscore the profound influence of rich
feature representations from SSL models and improved geometric transformations
with target TVs on the enhanced functionality of SI systems
Multimodal Approach for Assessing Neuromotor Coordination in Schizophrenia Using Convolutional Neural Networks
This study investigates the speech articulatory coordination in schizophrenia
subjects exhibiting strong positive symptoms (e.g. hallucinations and
delusions), using two distinct channel-delay correlation methods. We show that
the schizophrenic subjects with strong positive symptoms and who are markedly
ill pose complex articulatory coordination pattern in facial and speech
gestures than what is observed in healthy subjects. This distinction in speech
coordination pattern is used to train a multimodal convolutional neural network
(CNN) which uses video and audio data during speech to distinguish
schizophrenic patients with strong positive symptoms from healthy subjects. We
also show that the vocal tract variables (TVs) which correspond to place of
articulation and glottal source outperform the Mel-frequency Cepstral
Coefficients (MFCCs) when fused with Facial Action Units (FAUs) in the proposed
multimodal network. For the clinical dataset we collected, our best performing
multimodal network improves the mean F1 score for detecting schizophrenia by
around 18% with respect to the full vocal tract coordination (FVTC) baseline
method implemented with fusing FAUs and MFCCs.Comment: 5 pages. arXiv admin note: text overlap with arXiv:2102.0705
Audio Data Augmentation for Acoustic-to-articulatory Speech Inversion using Bidirectional Gated RNNs
Data augmentation has proven to be a promising prospect in improving the
performance of deep learning models by adding variability to training data. In
previous work with developing a noise robust acoustic-to-articulatory speech
inversion system, we have shown the importance of noise augmentation to improve
the performance of speech inversion in noisy speech. In this work, we compare
and contrast different ways of doing data augmentation and show how this
technique improves the performance of articulatory speech inversion not only on
noisy speech, but also on clean speech data. We also propose a Bidirectional
Gated Recurrent Neural Network as the speech inversion system instead of the
previously used feed forward neural network. The inversion system uses
mel-frequency cepstral coefficients (MFCCs) as the input acoustic features and
six vocal tract-variables (TVs) as the output articulatory features. The
Performance of the system was measured by computing the correlation between
estimated and actual TVs on the U. Wisc. X-ray Microbeam database. The proposed
speech inversion system shows a 5% relative improvement in correlation over the
baseline noise robust system for clean speech data. The pre-trained model, when
adapted to each unseen speaker in the test set, improves the average
correlation by another 6%.Comment: EUSIPCO 202
Speaker-independent Speech Inversion for Estimation of Nasalance
The velopharyngeal (VP) valve regulates the opening between the nasal and
oral cavities. This valve opens and closes through a coordinated motion of the
velum and pharyngeal walls. Nasalance is an objective measure derived from the
oral and nasal acoustic signals that correlate with nasality. In this work, we
evaluate the degree to which the nasalance measure reflects fine-grained
patterns of VP movement by comparison with simultaneously collected direct
measures of VP opening using high-speed nasopharyngoscopy (HSN). We show that
nasalance is significantly correlated with the HSN signal, and that both match
expected patterns of nasality. We then train a temporal convolution-based
speech inversion system in a speaker-independent fashion to estimate VP
movement for nasality, using nasalance as the ground truth. In further
experiments, we also show the importance of incorporating source features (from
glottal activity) to improve nasality prediction.Comment: Interspeech 202
Acoustic-to-Articulatory Speech Inversion Features for Mispronunciation Detection of /r/ in Child Speech Sound Disorders
Acoustic-to-articulatory speech inversion could enhance automated clinical
mispronunciation detection to provide detailed articulatory feedback
unattainable by formant-based mispronunciation detection algorithms; however,
it is unclear the extent to which a speech inversion system trained on adult
speech performs in the context of (1) child and (2) clinical speech. In the
absence of an articulatory dataset in children with rhotic speech sound
disorders, we show that classifiers trained on tract variables from
acoustic-to-articulatory speech inversion meet or exceed the performance of
state-of-the-art features when predicting clinician judgment of rhoticity.
Index Terms: rhotic, speech sound disorder, mispronunciation detectionComment: *denotes equal contribution. To appear in Proceedings of the Annual
Conference of the International Speech Communication Association, INTERSPEECH
202
Acoustic-to-articulatory Speech Inversion with Multi-task Learning
Multi-task learning (MTL) frameworks have proven to be effective in diverse
speech related tasks like automatic speech recognition (ASR) and speech emotion
recognition. This paper proposes a MTL framework to perform
acoustic-to-articulatory speech inversion by simultaneously learning an
acoustic to phoneme mapping as a shared task. We use the Haskins Production
Rate Comparison (HPRC) database which has both the electromagnetic
articulography (EMA) data and the corresponding phonetic transcriptions.
Performance of the system was measured by computing the correlation between
estimated and actual tract variables (TVs) from the acoustic to articulatory
speech inversion task. The proposed MTL based Bidirectional Gated Recurrent
Neural Network (RNN) model learns to map the input acoustic features to nine
TVs while outperforming the baseline model trained to perform only acoustic to
articulatory inversion