9,590 research outputs found
Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech
The rapid population aging has stimulated the development of assistive
devices that provide personalized medical support to the needies suffering from
various etiologies. One prominent clinical application is a computer-assisted
speech training system which enables personalized speech therapy to patients
impaired by communicative disorders in the patient's home environment. Such a
system relies on the robust automatic speech recognition (ASR) technology to be
able to provide accurate articulation feedback. With the long-term aim of
developing off-the-shelf ASR systems that can be incorporated in clinical
context without prior speaker information, we compare the ASR performance of
speaker-independent bottleneck and articulatory features on dysarthric speech
used in conjunction with dedicated neural network-based acoustic models that
have been shown to be robust against spectrotemporal deviations. We report ASR
performance of these systems on two dysarthric speech datasets of different
characteristics to quantify the achieved performance gains. Despite the
remaining performance gap between the dysarthric and normal speech, significant
improvements have been reported on both datasets using speaker-independent ASR
architectures.Comment: to appear in Computer Speech & Language -
https://doi.org/10.1016/j.csl.2019.05.002 - arXiv admin note: substantial
text overlap with arXiv:1807.1094
Deep HMResNet Model for Human Activity-Aware Robotic Systems
Endowing the robotic systems with cognitive capabilities for recognizing
daily activities of humans is an important challenge, which requires
sophisticated and novel approaches. Most of the proposed approaches explore
pattern recognition techniques which are generally based on hand-crafted
features or learned features. In this paper, a novel Hierarchal Multichannel
Deep Residual Network (HMResNet) model is proposed for robotic systems to
recognize daily human activities in the ambient environments. The introduced
model is comprised of multilevel fusion layers. The proposed Multichannel 1D
Deep Residual Network model is, at the features level, combined with a
Bottleneck MLP neural network to automatically extract robust features
regardless of the hardware configuration and, at the decision level, is fully
connected with an MLP neural network to recognize daily human activities.
Empirical experiments on real-world datasets and an online demonstration are
used for validating the proposed model. Results demonstrated that the proposed
model outperforms the baseline models in daily human activity recognition.Comment: Presented at AI-HRI AAAI-FSS, 2018 (arXiv:1809.06606
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
This paper presents a novel framework for automatic speech-driven gesture
generation, applicable to human-agent interaction including both virtual agents
and robots. Specifically, we extend recent deep-learning-based, data-driven
methods for speech-driven gesture generation by incorporating representation
learning. Our model takes speech as input and produces gestures as output, in
the form of a sequence of 3D coordinates. Our approach consists of two steps.
First, we learn a lower-dimensional representation of human motion using a
denoising autoencoder neural network, consisting of a motion encoder MotionE
and a motion decoder MotionD. The learned representation preserves the most
important aspects of the human pose variation while removing less relevant
variation. Second, we train a novel encoder network SpeechE to map from speech
to a corresponding motion representation with reduced dimensionality. At test
time, the speech encoder and the motion decoder networks are combined: SpeechE
predicts motion representations based on a given speech signal and MotionD then
decodes these representations to produce motion sequences. We evaluate
different representation sizes in order to find the most effective
dimensionality for the representation. We also evaluate the effects of using
different speech features as input to the model. We find that mel-frequency
cepstral coefficients (MFCCs), alone or combined with prosodic features,
perform the best. The results of a subsequent user study confirm the benefits
of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code
is available at
https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode
The information bottleneck method
We define the relevant information in a signal as being the
information that this signal provides about another signal y\in \Y. Examples
include the information that face images provide about the names of the people
portrayed, or the information that speech sounds provide about the words
spoken. Understanding the signal requires more than just predicting , it
also requires specifying which features of \X play a role in the prediction.
We formalize this problem as that of finding a short code for \X that
preserves the maximum information about \Y. That is, we squeeze the
information that \X provides about \Y through a `bottleneck' formed by a
limited set of codewords \tX. This constrained optimization problem can be
seen as a generalization of rate distortion theory in which the distortion
measure d(x,\x) emerges from the joint statistics of \X and \Y. This
approach yields an exact set of self consistent equations for the coding rules
X \to \tX and \tX \to \Y. Solutions to these equations can be found by a
convergent re-estimation method that generalizes the Blahut-Arimoto algorithm.
Our variational principle provides a surprisingly rich framework for discussing
a variety of problems in signal processing and learning, as will be described
in detail elsewhere
- …