2,290 research outputs found
Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech
The rapid population aging has stimulated the development of assistive
devices that provide personalized medical support to the needies suffering from
various etiologies. One prominent clinical application is a computer-assisted
speech training system which enables personalized speech therapy to patients
impaired by communicative disorders in the patient's home environment. Such a
system relies on the robust automatic speech recognition (ASR) technology to be
able to provide accurate articulation feedback. With the long-term aim of
developing off-the-shelf ASR systems that can be incorporated in clinical
context without prior speaker information, we compare the ASR performance of
speaker-independent bottleneck and articulatory features on dysarthric speech
used in conjunction with dedicated neural network-based acoustic models that
have been shown to be robust against spectrotemporal deviations. We report ASR
performance of these systems on two dysarthric speech datasets of different
characteristics to quantify the achieved performance gains. Despite the
remaining performance gap between the dysarthric and normal speech, significant
improvements have been reported on both datasets using speaker-independent ASR
architectures.Comment: to appear in Computer Speech & Language -
https://doi.org/10.1016/j.csl.2019.05.002 - arXiv admin note: substantial
text overlap with arXiv:1807.1094
ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION
Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria.
Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal.
Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks
- …