5,440 research outputs found
Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech
The rapid population aging has stimulated the development of assistive
devices that provide personalized medical support to the needies suffering from
various etiologies. One prominent clinical application is a computer-assisted
speech training system which enables personalized speech therapy to patients
impaired by communicative disorders in the patient's home environment. Such a
system relies on the robust automatic speech recognition (ASR) technology to be
able to provide accurate articulation feedback. With the long-term aim of
developing off-the-shelf ASR systems that can be incorporated in clinical
context without prior speaker information, we compare the ASR performance of
speaker-independent bottleneck and articulatory features on dysarthric speech
used in conjunction with dedicated neural network-based acoustic models that
have been shown to be robust against spectrotemporal deviations. We report ASR
performance of these systems on two dysarthric speech datasets of different
characteristics to quantify the achieved performance gains. Despite the
remaining performance gap between the dysarthric and normal speech, significant
improvements have been reported on both datasets using speaker-independent ASR
architectures.Comment: to appear in Computer Speech & Language -
https://doi.org/10.1016/j.csl.2019.05.002 - arXiv admin note: substantial
text overlap with arXiv:1807.1094
Combining phonological and acoustic ASR-free features for pathological speech intelligibility assessment
Intelligibility is widely used to measure the severity of articulatory problems in pathological speech. Recently, a number of automatic intelligibility assessment tools have been developed. Most of them use automatic speech recognizers (ASR) to compare the patient's utterance with the target text. These methods are bound to one language and tend to be less accurate when speakers hesitate or make reading errors. To circumvent these problems, two different ASR-free methods were developed over the last few years, only making use of the acoustic or phonological properties of the utterance. In this paper, we demonstrate that these ASR-free techniques are also able to predict intelligibility in other languages. Moreover, they show to be complementary, resulting in even better intelligibility predictions when both methods are combined
Learning to detect dysarthria from raw speech
Speech classifiers of paralinguistic traits traditionally learn from diverse
hand-crafted low-level features, by selecting the relevant information for the
task at hand. We explore an alternative to this selection, by learning jointly
the classifier, and the feature extraction. Recent work on speech recognition
has shown improved performance over speech features by learning from the
waveform. We extend this approach to paralinguistic classification and propose
a neural network that can learn a filterbank, a normalization factor and a
compression power from the raw speech, jointly with the rest of the
architecture. We apply this model to dysarthria detection from sentence-level
audio recordings. Starting from a strong attention-based baseline on which
mel-filterbanks outperform standard low-level descriptors, we show that
learning the filters or the normalization and compression improves over fixed
features by 10% absolute accuracy. We also observe a gain over OpenSmile
features by learning jointly the feature extraction, the normalization, and the
compression factor with the architecture. This constitutes a first attempt at
learning jointly all these operations from raw audio for a speech
classification task.Comment: 5 pages, 3 figures, submitted to ICASS
Recognizing Speech in a Novel Accent: The Motor Theory of Speech Perception Reframed
The motor theory of speech perception holds that we perceive the speech of
another in terms of a motor representation of that speech. However, when we
have learned to recognize a foreign accent, it seems plausible that recognition
of a word rarely involves reconstruction of the speech gestures of the speaker
rather than the listener. To better assess the motor theory and this
observation, we proceed in three stages. Part 1 places the motor theory of
speech perception in a larger framework based on our earlier models of the
adaptive formation of mirror neurons for grasping, and for viewing extensions
of that mirror system as part of a larger system for neuro-linguistic
processing, augmented by the present consideration of recognizing speech in a
novel accent. Part 2 then offers a novel computational model of how a listener
comes to understand the speech of someone speaking the listener's native
language with a foreign accent. The core tenet of the model is that the
listener uses hypotheses about the word the speaker is currently uttering to
update probabilities linking the sound produced by the speaker to phonemes in
the native language repertoire of the listener. This, on average, improves the
recognition of later words. This model is neutral regarding the nature of the
representations it uses (motor vs. auditory). It serve as a reference point for
the discussion in Part 3, which proposes a dual-stream neuro-linguistic
architecture to revisits claims for and against the motor theory of speech
perception and the relevance of mirror neurons, and extracts some implications
for the reframing of the motor theory
Speech Enhancement Guided by Contextual Articulatory Information
Previous studies have confirmed the effectiveness of leveraging articulatory
information to attain improved speech enhancement (SE) performance. By
augmenting the original acoustic features with the place/manner of articulatory
features, the SE process can be guided to consider the articulatory properties
of the input speech when performing enhancement. Hence, we believe that the
contextual information of articulatory attributes should include useful
information and can further benefit SE in different languages. In this study,
we propose an SE system that improves its performance through optimizing the
contextual articulatory information in enhanced speech for both English and
Mandarin. We optimize the contextual articulatory information through
joint-train the SE model with an end-to-end automatic speech recognition (E2E
ASR) model, predicting the sequence of broad phone classes (BPC) instead of the
word sequences. Meanwhile, two training strategies are developed to train the
SE system based on the BPC-based ASR: multitask-learning and deep-feature
training strategies. Experimental results on the TIMIT and TMHINT dataset
confirm that the contextual articulatory information facilitates an SE system
in achieving better results than the traditional Acoustic Model(AM). Moreover,
in contrast to another SE system that is trained with monophonic ASR, the
BPC-based ASR (providing contextual articulatory information) can improve the
SE performance more effectively under different signal-to-noise ratios(SNR).Comment: Will be submitted to TASL
Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints
We propose a first step toward multilingual end-to-end automatic speech
recognition (ASR) by integrating knowledge about speech articulators. The key
idea is to leverage a rich set of fundamental units that can be defined
"universally" across all spoken languages, referred to as speech attributes,
namely manner and place of articulation. Specifically, several deterministic
attribute-to-phoneme mapping matrices are constructed based on the predefined
set of universal attribute inventory, which projects the knowledge-rich
articulatory attribute logits, into output phoneme logits. The mapping puts
knowledge-based constraints to limit inconsistency with acoustic-phonetic
evidence in the integrated prediction. Combined with phoneme recognition, our
phone recognizer is able to infer from both attribute and phoneme information.
The proposed joint multilingual model is evaluated through phoneme recognition.
In multilingual experiments over 6 languages on benchmark datasets LibriSpeech
and CommonVoice, we find that our proposed solution outperforms conventional
multilingual approaches with a relative improvement of 6.85% on average, and it
also demonstrates a much better performance compared to monolingual model.
Further analysis conclusively demonstrates that the proposed solution
eliminates phoneme predictions that are inconsistent with attributes
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
- …