7,614 research outputs found
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks
NeuroHeed: Neuro-Steered Speaker Extraction using EEG Signals
Humans possess the remarkable ability to selectively attend to a single
speaker amidst competing voices and background noise, known as selective
auditory attention. Recent studies in auditory neuroscience indicate a strong
correlation between the attended speech signal and the corresponding brain's
elicited neuronal activities, which the latter can be measured using affordable
and non-intrusive electroencephalography (EEG) devices. In this study, we
present NeuroHeed, a speaker extraction model that leverages EEG signals to
establish a neuronal attractor which is temporally associated with the speech
stimulus, facilitating the extraction of the attended speech signal in a
cocktail party scenario. We propose both an offline and an online NeuroHeed,
with the latter designed for real-time inference. In the online NeuroHeed, we
additionally propose an autoregressive speaker encoder, which accumulates past
extracted speech signals for self-enrollment of the attended speaker
information into an auditory attractor, that retains the attentional momentum
over time. Online NeuroHeed extracts the current window of the speech signals
with guidance from both attractors. Experimental results demonstrate that
NeuroHeed effectively extracts brain-attended speech signals, achieving high
signal quality, excellent perceptual quality, and intelligibility in a
two-speaker scenario
Talking Nets: A Multi-Agent Connectionist Approach to Communication and Trust between Individuals
A multi-agent connectionist model is proposed that consists of a collection of individual recurrent networks that communicate with each other, and as such is a network of networks. The individual recurrent networks simulate the process of information uptake, integration and memorization within individual agents, while the communication of beliefs and opinions between agents is propagated along connections between the individual networks. A crucial aspect in belief updating based on information from other agents is the trust in the information provided. In the model, trust is determined by the consistency with the receiving agents’ existing beliefs, and results in changes of the connections between individual networks, called trust weights. Thus activation spreading and weight change between individual networks is analogous to standard connectionist processes, although trust weights take a specific function. Specifically, they lead to a selective propagation and thus filtering out of less reliable information, and they implement Grice’s (1975) maxims of quality and quantity in communication. The unique contribution of communicative mechanisms beyond intra-personal processing of individual networks was explored in simulations of key phenomena involving persuasive communication and polarization, lexical acquisition, spreading of stereotypes and rumors, and a lack of sharing unique information in group decisions
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
Relating EEG to continuous speech using deep neural networks: a review
Objective. When a person listens to continuous speech, a corresponding
response is elicited in the brain and can be recorded using
electroencephalography (EEG). Linear models are presently used to relate the
EEG recording to the corresponding speech signal. The ability of linear models
to find a mapping between these two signals is used as a measure of neural
tracking of speech. Such models are limited as they assume linearity in the
EEG-speech relationship, which omits the nonlinear dynamics of the brain. As an
alternative, deep learning models have recently been used to relate EEG to
continuous speech, especially in auditory attention decoding (AAD) and
single-speech-source paradigms. Approach. This paper reviews and comments on
deep-learning-based studies that relate EEG to continuous speech in AAD and
single-speech-source paradigms. We point out recurrent methodological pitfalls
and the need for a standard benchmark of model analysis. Main results. We
gathered 29 studies. The main methodological issues we found are biased
cross-validations, data leakage leading to over-fitted models, or
disproportionate data size compared to the model's complexity. In addition, we
address requirements for a standard benchmark model analysis, such as public
datasets, common evaluation metrics, and good practices for the match-mismatch
task. Significance. We are the first to present a review paper summarizing the
main deep-learning-based studies that relate EEG to speech while addressing
methodological pitfalls and important considerations for this newly expanding
field. Our study is particularly relevant given the growing application of deep
learning in EEG-speech decoding
- …