6,390 research outputs found
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks
Speaker Recognition using Supra-segmental Level Excitation Information
Speaker specific information present in the excitation signal is mostly viewed from sub-segmental, segmental and supra-segmental levels. In this work, the supra-segmental level information is explored for recognizing speakers. Earlier study has shown that, combined use of pitch and epoch strength vectors provides useful supra-segmental information. However, the speaker recognition accuracy achieved by supra-segmental level feature is relatively poor than other levels source information. May be the modulation information present at the supra-segmental level of the excitation signal is not manifested properly in pith and epoch strength vectors. We propose a method to model the supra-segmental level modulation information from residual mel frequency cepstral coefficient (R-MFCC) trajectories. The evidences from R-MFCC trajectories combined with pitch and epoch strength vectors are proposed to represent supra-segmental information. Experimental results show that compared to pitch and epoch strength vectors, the proposed approach provides relatively improved performance. Further, the proposed supra-segmental level information is relatively more complimentary to other levels information
Speaker-following Video Subtitles
We propose a new method for improving the presentation of subtitles in video
(e.g. TV and movies). With conventional subtitles, the viewer has to constantly
look away from the main viewing area to read the subtitles at the bottom of the
screen, which disrupts the viewing experience and causes unnecessary eyestrain.
Our method places on-screen subtitles next to the respective speakers to allow
the viewer to follow the visual content while simultaneously reading the
subtitles. We use novel identification algorithms to detect the speakers based
on audio and visual information. Then the placement of the subtitles is
determined using global optimization. A comprehensive usability study indicated
that our subtitle placement method outperformed both conventional
fixed-position subtitling and another previous dynamic subtitling method in
terms of enhancing the overall viewing experience and reducing eyestrain
Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect
We study the cocktail party problem and propose a novel attention network
called Tune-In, abbreviated for training under negative environments with
interference. It firstly learns two separate spaces of speaker-knowledge and
speech-stimuli based on a shared feature space, where a new block structure is
designed as the building block for all spaces, and then cooperatively solves
different tasks. Between the two spaces, information is cast towards each other
via a novel cross- and dual-attention mechanism, mimicking the bottom-up and
top-down processes of a human's cocktail party effect. It turns out that
substantially discriminative and generalizable speaker representations can be
learnt in severely interfered conditions via our self-supervised training. The
experimental results verify this seeming paradox. The learnt speaker embedding
has superior discriminative power than a standard speaker verification method;
meanwhile, Tune-In achieves remarkably better speech separation performances in
terms of SI-SNRi and SDRi consistently in all test modes, and especially at
lower memory and computational consumption, than state-of-the-art benchmark
systems.Comment: Accepted in AAAI 202
Speech and crosstalk detection in multichannel audio
The analysis of scenarios in which a number of microphones record the activity of speakers, such as in a round-table meeting, presents a number of computational challenges. For example, if each participant wears a microphone, speech from both the microphone's wearer (local speech) and from other participants (crosstalk) is received. The recorded audio can be broadly classified in four ways: local speech, crosstalk plus local speech, crosstalk alone and silence. We describe two experiments related to the automatic classification of audio into these four classes. The first experiment attempted to optimize a set of acoustic features for use with a Gaussian mixture model (GMM) classifier. A large set of potential acoustic features were considered, some of which have been employed in previous studies. The best-performing features were found to be kurtosis, "fundamentalness," and cross-correlation metrics. The second experiment used these features to train an ergodic hidden Markov model classifier. Tests performed on a large corpus of recorded meetings show classification accuracies of up to 96%, and automatic speech recognition performance close to that obtained using ground truth segmentation
Selective Kernel Attention for Robust Speaker Verification
Recent state-of-the-art speaker verification architectures adopt multi-scale
processing and frequency-channel attention techniques. However, their full
potential may not have been exploited because these techniques' receptive
fields are fixed where most convolutional layers operate with specified kernel
sizes such as 1, 3 or 5. We aim to further improve this line of research by
introducing a selective kernel attention (SKA) mechanism. The SKA mechanism
allows each convolutional layer to adaptively select the kernel size in a
data-driven fashion based on an attention mechanism that exploits both
frequency and channel domain using the previous layer's output. We propose
three module variants using the SKA mechanism whereby two modules are applied
in front of an ECAPA-TDNN model, and the other is combined with the Res2Net
backbone block. Experimental results demonstrate that our proposed model
consistently outperforms the conventional counterpart on the three different
evaluation protocols in terms of both equal error rate and minimum detection
cost function. In addition, we present a detailed analysis that helps
understand how the SKA module works.Comment: Submitted to INTERSPEECH 2022. 5 pages, 3 figures, 1 tabl
Contrastive Separative Coding for Self-supervised Representation Learning
To extract robust deep representations from long sequential modeling of
speech data, we propose a self-supervised learning approach, namely Contrastive
Separative Coding (CSC). Our key finding is to learn such representations by
separating the target signal from contrastive interfering signals. First, a
multi-task separative encoder is built to extract shared separable and
discriminative embedding; secondly, we propose a powerful cross-attention
mechanism performed over speaker representations across various interfering
conditions, allowing the model to focus on and globally aggregate the most
critical information to answer the "query" (current bottom-up embedding) while
paying less attention to interfering, noisy, or irrelevant parts; lastly, we
form a new probabilistic contrastive loss which estimates and maximizes the
mutual information between the representations and the global speaker vector.
While most prior unsupervised methods have focused on predicting the future,
neighboring, or missing samples, we take a different perspective of predicting
the interfered samples. Moreover, our contrastive separative loss is free from
negative sampling. The experiment demonstrates that our approach can learn
useful representations achieving a strong speaker verification performance in
adverse conditions.Comment: Accepted in ICASSP 202
- …