136 research outputs found
Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques
The growing use of voice user interfaces has led to a surge in the collection
and storage of speech data. While data collection allows for the development of
efficient tools powering most speech services, it also poses serious privacy
issues for users as centralized storage makes private personal speech data
vulnerable to cyber threats. With the increasing use of voice-based digital
assistants like Amazon's Alexa, Google's Home, and Apple's Siri, and with the
increasing ease with which personal speech data can be collected, the risk of
malicious use of voice-cloning and speaker/gender/pathological/etc. recognition
has increased.
This thesis proposes solutions for anonymizing speech and evaluating the
degree of the anonymization. In this work, anonymization refers to making
personal speech data unlinkable to an identity while maintaining the usefulness
(utility) of the speech signal (e.g., access to linguistic content). We start
by identifying several challenges that evaluation protocols need to consider to
evaluate the degree of privacy protection properly. We clarify how
anonymization systems must be configured for evaluation purposes and highlight
that many practical deployment configurations do not permit privacy evaluation.
Furthermore, we study and examine the most common voice conversion-based
anonymization system and identify its weak points before suggesting new methods
to overcome some limitations. We isolate all components of the anonymization
system to evaluate the degree of speaker PPI associated with each of them.
Then, we propose several transformation methods for each component to reduce as
much as possible speaker PPI while maintaining utility. We promote
anonymization algorithms based on quantization-based transformation as an
alternative to the most-used and well-known noise-based approach. Finally, we
endeavor a new attack method to invert anonymization.Comment: PhD Thesis Pierre Champion | Universit\'e de Lorraine - INRIA Nancy |
for associated source code, see https://github.com/deep-privacy/SA-toolki
Security and Privacy Problems in Voice Assistant Applications: A Survey
Voice assistant applications have become omniscient nowadays. Two models that
provide the two most important functions for real-life applications (i.e.,
Google Home, Amazon Alexa, Siri, etc.) are Automatic Speech Recognition (ASR)
models and Speaker Identification (SI) models. According to recent studies,
security and privacy threats have also emerged with the rapid development of
the Internet of Things (IoT). The security issues researched include attack
techniques toward machine learning models and other hardware components widely
used in voice assistant applications. The privacy issues include technical-wise
information stealing and policy-wise privacy breaches. The voice assistant
application takes a steadily growing market share every year, but their privacy
and security issues never stopped causing huge economic losses and endangering
users' personal sensitive information. Thus, it is important to have a
comprehensive survey to outline the categorization of the current research
regarding the security and privacy problems of voice assistant applications.
This paper concludes and assesses five kinds of security attacks and three
types of privacy threats in the papers published in the top-tier conferences of
cyber security and voice domain.Comment: 5 figure
Normal-to-Lombard Adaptation of Speech Synthesis Using Long Short-Term Memory Recurrent Neural Networks
In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Peer reviewe
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
- …