637 research outputs found
Tongue contour extraction from ultrasound images based on deep neural network
Studying tongue motion during speech using ultrasound is a standard
procedure, but automatic ultrasound image labelling remains a challenge, as
standard tongue shape extraction methods typically require human intervention.
This article presents a method based on deep neural networks to automatically
extract tongue contour from ultrasound images on a speech dataset. We use a
deep autoencoder trained to learn the relationship between an image and its
related contour, so that the model is able to automatically reconstruct
contours from the ultrasound image alone. In this paper, we use an automatic
labelling algorithm instead of time-consuming hand-labelling during the
training process, and estimate the performances of both automatic labelling and
contour extraction as compared to hand-labelling. Observed results show quality
scores comparable to the state of the art.Comment: 5 pages, 3 figures, published in The International Congress of
Phonetic Sciences, 201
Updating the silent speech challenge benchmark with deep learning
The 2010 Silent Speech Challenge benchmark is updated with new results
obtained in a Deep Learning strategy, using the same input features and
decoding strategy as in the original article. A Word Error Rate of 6.4% is
obtained, compared to the published value of 17.4%. Additional results
comparing new auto-encoder-based features with the original features at reduced
dimensionality, as well as decoding scenarios on two different language models,
are also presented. The Silent Speech Challenge archive has been updated to
contain both the original and the new auto-encoder features, in addition to the
original raw data.Comment: 25 pages, 6 page
Real-time Ultrasound-enhanced Multimodal Imaging of Tongue using 3D Printable Stabilizer System: A Deep Learning Approach
Despite renewed awareness of the importance of articulation, it remains a
challenge for instructors to handle the pronunciation needs of language
learners. There are relatively scarce pedagogical tools for pronunciation
teaching and learning. Unlike inefficient, traditional pronunciation
instructions like listening and repeating, electronic visual feedback (EVF)
systems such as ultrasound technology have been employed in new approaches.
Recently, an ultrasound-enhanced multimodal method has been developed for
visualizing tongue movements of a language learner overlaid on the face-side of
the speaker's head. That system was evaluated for several language courses via
a blended learning paradigm at the university level. The result was asserted
that visualizing the articulator's system as biofeedback to language learners
will significantly improve articulation learning efficiency. In spite of the
successful usage of multimodal techniques for pronunciation training, it still
requires manual works and human manipulation. In this article, we aim to
contribute to this growing body of research by addressing difficulties of the
previous approaches by proposing a new comprehensive, automatic, real-time
multimodal pronunciation training system, benefits from powerful artificial
intelligence techniques. The main objective of this research was to combine the
advantages of ultrasound technology, three-dimensional printing, and deep
learning algorithms to enhance the performance of previous systems. Our
preliminary pedagogical evaluation of the proposed system revealed a
significant improvement in flexibility, control, robustness, and autonomy.Comment: 12 figures, 1 tabl
DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging
Speech sounds are produced as the coordinated movement of the speaking
organs. There are several available methods to model the relation of
articulatory movements and the resulting speech signal. The reverse problem is
often called as acoustic-to-articulatory inversion (AAI). In this paper we have
implemented several different Deep Neural Networks (DNNs) to estimate the
articulatory information from the acoustic signal. There are several previous
works related to performing this task, but most of them are using
ElectroMagnetic Articulography (EMA) for tracking the articulatory movement.
Compared to EMA, Ultrasound Tongue Imaging (UTI) is a technique of higher
cost-benefit if we take into account equipment cost, portability, safety and
visualized structures. Seeing that, our goal is to train a DNN to obtain UT
images, when using speech as input. We also test two approaches to represent
the articulatory information: 1) the EigenTongue space and 2) the raw
ultrasound image. As an objective quality measure for the reconstructed UT
images, we use MSE, Structural Similarity Index (SSIM) and Complex-Wavelet SSIM
(CW-SSIM). Our experimental results show that CW-SSIM is the most useful error
measure in the UTI context. We tested three different system configurations: a)
simple DNN composed of 2 hidden layers with 64x64 pixels of an UTI file as
target; b) the same simple DNN but with ultrasound images projected to the
EigenTongue space as the target; c) and a more complex DNN composed of 5 hidden
layers with UTI files projected to the EigenTongue space. In a subjective
experiment the subjects found that the neural networks with two hidden layers
were more suitable for this inversion task.Comment: 8 pages, 5 figures, Accepted to IJCNN 201
Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces
When using ultrasound video as input, Deep Neural Network-based Silent Speech
Interfaces usually rely on the whole image to estimate the spectral parameters
required for the speech synthesis step. Although this approach is quite
straightforward, and it permits the synthesis of understandable speech, it has
several disadvantages as well. Besides the inability to capture the relations
between close regions (i.e. pixels) of the image, this pixel-by-pixel
representation of the image is also quite uneconomical. It is easy to see that
a significant part of the image is irrelevant for the spectral parameter
estimation task as the information stored by the neighbouring pixels is
redundant, and the neural network is quite large due to the large number of
input features. To resolve these issues, in this study we train an autoencoder
neural network on the ultrasound image; the estimation of the spectral speech
parameters is done by a second DNN, using the activations of the bottleneck
layer of the autoencoder network as features. In our experiments, the proposed
method proved to be more efficient than the standard approach: the measured
normalized mean squared error scores were lower, while the correlation values
were higher in each case. Based on the result of a listening test, the
synthesized utterances also sounded more natural to native speakers. A further
advantage of our proposed approach is that, due to the (relatively) small size
of the bottleneck layer, we can utilize several consecutive ultrasound images
during estimation without a significant increase in the network size, while
significantly increasing the accuracy of parameter estimation.Comment: 8 pages, 6 figures, Accepted to IJCNN 201
IrisNet: Deep Learning for Automatic and Real-time Tongue Contour Tracking in Ultrasound Video Data using Peripheral Vision
The progress of deep convolutional neural networks has been successfully
exploited in various real-time computer vision tasks such as image
classification and segmentation. Owing to the development of computational
units, availability of digital datasets, and improved performance of deep
learning models, fully automatic and accurate tracking of tongue contours in
real-time ultrasound data became practical only in recent years. Recent studies
have shown that the performance of deep learning techniques is significant in
the tracking of ultrasound tongue contours in real-time applications such as
pronunciation training using multimodal ultrasound-enhanced approaches. Due to
the high correlation between ultrasound tongue datasets, it is feasible to have
a general model that accomplishes automatic tongue tracking for almost all
datasets. In this paper, we proposed a deep learning model comprises of a
convolutional module mimicking the peripheral vision ability of the human eye
to handle real-time, accurate, and fully automatic tongue contour tracking
tasks, applicable for almost all primary ultrasound tongue datasets.
Qualitative and quantitative assessment of IrisNet on different ultrasound
tongue datasets and PASCAL VOC2012 revealed its outstanding generalization
achievement in compare with similar techniques
Deep Learning for Automatic Tracking of Tongue Surface in Real-time Ultrasound Videos, Landmarks instead of Contours
One usage of medical ultrasound imaging is to visualize and characterize
human tongue shape and motion during a real-time speech to study healthy or
impaired speech production. Due to the low-contrast characteristic and noisy
nature of ultrasound images, it might require expertise for non-expert users to
recognize tongue gestures in applications such as visual training of a second
language. Moreover, quantitative analysis of tongue motion needs the tongue
dorsum contour to be extracted, tracked, and visualized. Manual tongue contour
extraction is a cumbersome, subjective, and error-prone task. Furthermore, it
is not a feasible solution for real-time applications. The growth of deep
learning has been vigorously exploited in various computer vision tasks,
including ultrasound tongue contour tracking. In the current methods, the
process of tongue contour extraction comprises two steps of image segmentation
and post-processing. This paper presents a new novel approach of automatic and
real-time tongue contour tracking using deep neural networks. In the proposed
method, instead of the two-step procedure, landmarks of the tongue surface are
tracked. This novel idea enables researchers in this filed to benefits from
available previously annotated databases to achieve high accuracy results. Our
experiment disclosed the outstanding performances of the proposed technique in
terms of generalization, performance, and accuracy.Comment: 8 pages, 5 figure
A CNN-based tool for automatic tongue contour tracking in ultrasound images
For speech research, ultrasound tongue imaging provides a non-invasive means
for visualizing tongue position and movement during articulation. Extracting
tongue contours from ultrasound images is a basic step in analyzing ultrasound
data but this task often requires non-trivial manual annotation. This study
presents an open source tool for fully automatic tracking of tongue contours in
ultrasound frames using neural network based methods. We have implemented and
systematically compared two convolutional neural networks, U-Net and
DenseU-Net, under different conditions. Though both models can perform
automatic contour tracking with comparable accuracy, Dense U-Net architecture
seems more generalizable across test datasets while U-Net has faster extraction
speed. Our comparison also shows that the choice of loss function and data
augmentation have a greater effect on tracking performance in this task. This
public available segmentation tool shows considerable promise for the automated
tongue contour annotation of ultrasound images in speech research
Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images
Thousands of individuals need surgical removal of their larynx due to
critical diseases every year and therefore, require an alternative form of
communication to articulate speech sounds after the loss of their voice box.
This work addresses the articulatory-to-acoustic mapping problem based on
ultrasound (US) tongue images for the development of a silent-speech interface
(SSI) that can provide them with an assistance in their daily interactions. Our
approach targets automatically extracting tongue movement information by
selecting an optimal feature set from US images and mapping these features to
the acoustic space. We use a novel deep learning architecture to map US tongue
images from the US probe placed beneath a subject's chin to formants that we
call, Ultrasound2Formant (U2F) Net. It uses hybrid spatio-temporal 3D
convolutions followed by feature shuffling, for the estimation and tracking of
vowel formants from US images. The formant values are then utilized to
synthesize continuous time-varying vowel trajectories, via Klatt Synthesizer.
Our best model achieves R-squared (R^2) measure of 99.96% for the regression
task. Our network lays the foundation for an SSI as it successfully tracks the
tongue contour automatically as an internal representation without any explicit
annotation.Comment: Accepted for publication in MICCAI 202
Transfer Learning for Ultrasound Tongue Contour Extraction with Different Domains
Medical ultrasound technology is widely used in routine clinical applications
such as disease diagnosis and treatment as well as other applications like
real-time monitoring of human tongue shapes and motions as visual feedback in
second language training. Due to the low-contrast characteristic and noisy
nature of ultrasound images, it might require expertise for non-expert users to
recognize tongue gestures. Manual tongue segmentation is a cumbersome,
subjective, and error-prone task. Furthermore, it is not a feasible solution
for real-time applications. In the last few years, deep learning methods have
been used for delineating and tracking tongue dorsum. Deep convolutional neural
networks (DCNNs), which have shown to be successful in medical image analysis
tasks, are typically weak for the same task on different domains. In many
cases, DCNNs trained on data acquired with one ultrasound device, do not
perform well on data of varying ultrasound device or acquisition protocol.
Domain adaptation is an alternative solution for this difficulty by
transferring the weights from the model trained on a large annotated legacy
dataset to a new model for adapting on another different dataset using
fine-tuning. In this study, after conducting extensive experiments, we
addressed the problem of domain adaptation on small ultrasound datasets for
tongue contour extraction. We trained a U-net network comprises of an
encoder-decoder path from scratch, and then with several surrogate scenarios,
some parts of the trained network were fine-tuned on another dataset as the
domain-adapted networks. We repeat scenarios from target to source domains to
find a balance point for knowledge transfer from source to target and vice
versa. The performance of new fine-tuned networks was evaluated on the same
task with images from different domains.Comment: 3 figures, 9 pages, 1 table, 16 reference
- …