4,291 research outputs found
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
Multilevel and session variability compensated language recognition: ATVS-UAM systems at NIST LRE 2009
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. J. Gonzalez-Dominguez, I. Lopez-Moreno, J. Franco-Pedroso, D. Ramos, D. T. Toledano, and J. Gonzalez-Rodriguez, "Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009" IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 6, pp. 1084 – 1093, December 2010This work presents the systems submitted by the
ATVS Biometric Recognition Group to the 2009 Language Recognition
Evaluation (LRE’09), organized by NIST. New challenges
included in this LRE edition can be summarized by three main
differences with respect to past evaluations. Firstly, the number
of languages to be recognized expanded to 23 languages from 14
in 2007, and 7 in 2005. Secondly, the data variability has been
increased by including telephone speech excerpts extracted from
Voice of America (VOA) radio broadcasts through Internet in
addition to Conversational Telephone Speech (CTS). The third
difference was the volume of data, involving in this evaluation
up to 2 terabytes of speech data for development, which is an
order of magnitude greater than past evaluations. LRE’09 thus
required participants to develop robust systems able not only to
successfully face the session variability problem but also to do
it with reasonable computational resources. ATVS participation
consisted of state-of-the-art acoustic and high-level systems focussing
on these issues. Furthermore, the problem of finding a
proper combination and calibration of the information obtained
at different levels of the speech signal was widely explored in this
submission. In this work, two original contributions were developed.
The first contribution was applying a session variability
compensation scheme based on Factor Analysis (FA) within the
statistics domain into a SVM-supervector (SVM-SV) approach.
The second contribution was the employment of a novel backend
based on anchor models in order to fuse individual systems
prior to one-vs-all calibration via logistic regression. Results both
in development and evaluation corpora show the robustness and
excellent performance of the submitted systems, exemplified by
our system ranked 2nd in the 30 second open-set condition, with
remarkably scarce computational resources.This work has been supported by the Spanish Ministry of Education under project TEC2006-13170-C02-01. Javier
Gonzalez-Dominguez also thanks Spanish Ministry of Education for supporting his doctoral research under project
TEC2006-13141-C03-03. Special thanks are given to Dr. David Van Leeuwen from TNO Human Factors (Utrech, The
Netherlands) for his strong collaboration, valuable discussions and ideas. Also, authors thank to Dr. Patrick Lucey for his
final support on (non-target) Australian English review of the manuscript
- …