1,280 research outputs found
Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras
We propose a method to address audio-visual target speaker enhancement in
multi-talker environments using event-driven cameras. State of the art
audio-visual speech separation methods shows that crucial information is the
movement of the facial landmarks related to speech production. However, all
approaches proposed so far work offline, using frame-based video input, making
it difficult to process an audio-visual signal with low latency, for online
applications. In order to overcome this limitation, we propose the use of
event-driven cameras and exploit compression, high temporal resolution and low
latency, for low cost and low latency motion feature extraction, going towards
online embedded audio-visual speech processing. We use the event-driven optical
flow estimation of the facial landmarks as input to a stacked Bidirectional
LSTM trained to predict an Ideal Amplitude Mask that is then used to filter the
noisy audio, to obtain the audio signal of the target speaker. The presented
approach performs almost on par with the frame-based approach, with very low
latency and computational cost.Comment: Accepted at ISCAS 202
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks
Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras
In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion
Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction
Target speech extraction aims to extract, based on a given conditioning cue,
a target speech signal that is corrupted by interfering sources, such as noise
or competing speakers. Building upon the achievements of the state-of-the-art
(SOTA) time-frequency speaker separation model TF-GridNet, we propose
AV-GridNet, a visual-grounded variant that incorporates the face recording of a
target speaker as a conditioning factor during the extraction process.
Recognizing the inherent dissimilarities between speech and noise signals as
interfering sources, we also propose SAV-GridNet, a scenario-aware model that
identifies the type of interfering scenario first and then applies a dedicated
expert model trained specifically for that scenario. Our proposed model
achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhancement
Challenge, outperforming other models by a significant margin, objectively and
in a listening test. We also perform an extensive analysis of the results under
the two scenarios.Comment: Accepted by ASRU 202
- …