26 research outputs found
Scaling and Bias Codes for Modeling Speaker-Adaptive DNN-based Speech Synthesis Systems
Most neural-network based speaker-adaptive acoustic models for speech
synthesis can be categorized into either layer-based or input-code approaches.
Although both approaches have their own pros and cons, most existing works on
speaker adaptation focus on improving one or the other. In this paper, after we
first systematically overview the common principles of neural-network based
speaker-adaptive models, we show that these approaches can be represented in a
unified framework and can be generalized further. More specifically, we
introduce the use of scaling and bias codes as generalized means for
speaker-adaptive transformation. By utilizing these codes, we can create a more
efficient factorized speaker-adaptive model and capture advantages of both
approaches while reducing their disadvantages. The experiments show that the
proposed method can improve the performance of speaker adaptation compared with
speaker adaptation based on the conventional input code.Comment: Accepted for 2018 IEEE Workshop on Spoken Language Technology (SLT),
Athens, Greec
Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing
Self-supervised learning (SSL) for rich speech representations has achieved
empirical success in low-resource Automatic Speech Recognition (ASR) and other
speech processing tasks, which can mitigate the necessity of a large amount of
transcribed speech and thus has driven a growing demand for on-device ASR and
other speech processing. However, advanced speech SSL models have become
increasingly large, which contradicts the limited on-device resources. This gap
could be more severe in multilingual/multitask scenarios requiring
simultaneously recognizing multiple languages or executing multiple speech
processing tasks. Additionally, strongly overparameterized speech SSL models
tend to suffer from overfitting when being finetuned on low-resource speech
corpus. This work aims to enhance the practical usage of speech SSL models
towards a win-win in both enhanced efficiency and alleviated overfitting via
our proposed S-Router framework, which for the first time discovers that
simply discarding no more than 10\% of model weights via only finetuning model
connections of speech SSL models can achieve better accuracy over standard
weight finetuning on downstream speech processing tasks. More importantly,
S-Router can serve as an all-in-one technique to enable (1) a new
finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a
state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively
analyze the learned speech representation. We believe S-Router has provided
a new perspective for practical deployment of speech SSL models. Our codes are
available at: https://github.com/GATECH-EIC/S3-Router.Comment: Accepted at NeurIPS 202
Conditioning Text-to-Speech synthesis on dialect accent: a case study
Modern text-to-speech systems are modular in many different ways. In recent years, end-users gained the ability to control speech attributes such as degree of emotion, rhythm and timbre, along with other suprasegmental features. More ambitious objectives are related to modelling a combination of speakers and languages, e.g. to enable cross-speaker language transfer. Though, no prior work has been done on the more fine-grained analysis of regional accents. To fill this gap, in this thesis we present practical end-to-end solutions to synthesise speech while controlling within-country variations of the same language, and we do so for 6 different dialects of the British Isles. In particular, we first conduct an extensive study of the speaker verification field and tweak state-of-the-art embedding models to work with dialect accents. Then, we adapt standard acoustic models and voice conversion systems by conditioning them on dialect accent representations and finally compare our custom pipelines with a cutting-edge end-to-end architecture from the multi-lingual world. Results show that the adopted models are suitable and have enough capacity to accomplish the task of regional accent conversion. Indeed, we are able to produce speech closely resembling the selected speaker and dialect accent, where the most accurate synthesis is obtained via careful fine-tuning of the multi-lingual model to the multi-dialect case. Finally, we delineate limitations of our multi-stage approach and propose practical mitigations, to be explored in future work
Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques
The growing use of voice user interfaces has led to a surge in the collection
and storage of speech data. While data collection allows for the development of
efficient tools powering most speech services, it also poses serious privacy
issues for users as centralized storage makes private personal speech data
vulnerable to cyber threats. With the increasing use of voice-based digital
assistants like Amazon's Alexa, Google's Home, and Apple's Siri, and with the
increasing ease with which personal speech data can be collected, the risk of
malicious use of voice-cloning and speaker/gender/pathological/etc. recognition
has increased.
This thesis proposes solutions for anonymizing speech and evaluating the
degree of the anonymization. In this work, anonymization refers to making
personal speech data unlinkable to an identity while maintaining the usefulness
(utility) of the speech signal (e.g., access to linguistic content). We start
by identifying several challenges that evaluation protocols need to consider to
evaluate the degree of privacy protection properly. We clarify how
anonymization systems must be configured for evaluation purposes and highlight
that many practical deployment configurations do not permit privacy evaluation.
Furthermore, we study and examine the most common voice conversion-based
anonymization system and identify its weak points before suggesting new methods
to overcome some limitations. We isolate all components of the anonymization
system to evaluate the degree of speaker PPI associated with each of them.
Then, we propose several transformation methods for each component to reduce as
much as possible speaker PPI while maintaining utility. We promote
anonymization algorithms based on quantization-based transformation as an
alternative to the most-used and well-known noise-based approach. Finally, we
endeavor a new attack method to invert anonymization.Comment: PhD Thesis Pierre Champion | Universit\'e de Lorraine - INRIA Nancy |
for associated source code, see https://github.com/deep-privacy/SA-toolki
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
Automatic speech recognition (ASR) has recently become an important challenge
when using deep learning (DL). It requires large-scale training datasets and
high computational and storage resources. Moreover, DL techniques and machine
learning (ML) approaches in general, hypothesize that training and testing data
come from the same domain, with the same input feature space and data
distribution characteristics. This assumption, however, is not applicable in
some real-world artificial intelligence (AI) applications. Moreover, there are
situations where gathering real data is challenging, expensive, or rarely
occurring, which can not meet the data requirements of DL models. deep transfer
learning (DTL) has been introduced to overcome these issues, which helps
develop high-performing models using real datasets that are small or slightly
different but related to the training data. This paper presents a comprehensive
survey of DTL-based ASR frameworks to shed light on the latest developments and
helps academics and professionals understand current challenges. Specifically,
after presenting the DTL background, a well-designed taxonomy is adopted to
inform the state-of-the-art. A critical analysis is then conducted to identify
the limitations and advantages of each framework. Moving on, a comparative
study is introduced to highlight the current challenges before deriving
opportunities for future research
Deep learning-based automatic analysis of social interactions from wearable data for healthcare applications
PhD ThesisSocial interactions of people with Late Life Depression (LLD) could be an objective measure
of social functioning due to the association between LLD and poor social functioning. The
utilisation of wearable computing technologies is a relatively new approach within healthcare
and well-being application sectors. Recently, the design and development of wearable
technologies and systems for health and well-being monitoring have attracted attention both
of the clinical and scientific communities. Mainly because the current clinical practice of –
typically rather sporadic – clinical behaviour assessments are often administered in artificial
settings. As a result, it does not provide a realistic impression of a patient’s condition
and thus does not lead to sufficient diagnosis and care. However, wearable behaviour
monitors have the potential for continuous, objective assessment of behaviour and wider
social interactions and thereby allowing for capturing naturalistic data without any constraints
on the place of recording or any typical limitations of the lab-setting research. Such data from
naturalistic ambient environments would facilitate automated transmission and analysis by
having no constraints on the recordings, allowing for a more timely and accurate assessment
of depressive symptoms. In response to this artificial setting issue, this thesis focuses on
the analysis and assessment of the different aspects of social interactions in naturalistic
environments using deep learning algorithms. That could lead to improvements in both
diagnosis and treatment.
The advantages of using deep learning are that there is no need for hand-crafted features
engineering and this leads to using the raw data with minimal pre-processing compared to
classical machine learning approaches and also its scalability and ability to generalise. The
main dataset used in this thesis is recorded by a wrist worn device designed at Newcastle
University. This device has multiple sensors including microphone, tri-axial accelerometer,
light sensor and proximity sensor. In this thesis, only microphone and tri-axial accelerometer
are used for the social interaction analysis. The other sensors are not used since they need
more calibration from the user which in this will be the elderly people with depression.
Hence, it was not feasible in this scenario. Novel deep learning models are proposed to
automatically analyse two aspects of social interactions (the verbal interactions/acoustic
communications and physical activities/movement patterns). Verbal Interactions include
the total quantity of speech, who is talking to whom and when and how much engagement
the wearer contributed in the conversations. The physical activity analysis includes activity
recognition and the quantity of each activity and sleep patterns.
This thesis is composed of three main stages, two of them discuss the acoustic analysis
and the third stage describes the movement pattern analysis. The acoustic analysis starts
with speech detection in which each segment of the recording is categorised as speech or
non-speech. This segment classification is achieved by a novel deep learning model that
leverages bi-directional Long Short-Term Memory with gated activation units combined
with Maxout Networks as well as a combination of two optimisers. After detecting speech
segments from audio data, the next stage is detecting how much engagement the wearer has
in any conversation throughout these speech events based on detecting the wearer of the
device using a variant model of the previous one that combines the convolutional autoencoder
with bi-directional Long Short-Term Memory. Following this, the system then detects the
spoken parts of the main speaker/wearer and therefore detects the conversational turn-taking
but only includes the turn taking between the wearer and other speakers and not every speaker
in the conversation. This stage did not take into account the semantics of the speakers due
to the ethical constraints of the main dataset (Depression dataset) and therefore it was not
possible to listen to the data by any means or even have any information about the contents.
So, it is a good idea to be considered for future work.
Stage 3 involves the physical activity analysis that is inferring the elementary physical
activities and movement patterns. These elementary patterns include sedentary actions,
walking, mixed activities, cycling, using vehicles as well as the sleep patterns. The predictive
model used is based on Random Forests and Hidden Markov Models. In all stages the
methods presented in this thesis have been compared to the state-of-the-art in processing
audio, accelerometer data, respectively, to thoroughly assess their contribution. Following
these stages is a thorough analysis of the interplay between acoustic interaction and physical
movement patterns and the depression key clinical variables resulting to the outcomes of
the previous stages. The main reason for not using deep learning in this stage unlike the
previous stages is that the main dataset (Depression dataset) did not have any annotations
for the speech or even the activity due to the ethical constraints as mentioned. Furthermore,
the training dataset (Discussion dataset) did not have any annotations for the accelerometer
data where the data is recorded freely and there is no camera attached to device to make it
possible to be annotated afterwards.Newton-Mosharafa Fund and
the mission sector and cultural affairs, ministry of Higher Education in Egypt
Principled methods for mixtures processing
This document is my thesis for getting the habilitation à diriger des recherches, which is the french diploma that is required to fully supervise Ph.D. students. It summarizes the research I did in the last 15 years and also provides the shortÂterm research directions and applications I want to investigate. Regarding my past research, I first describe the work I did on probabilistic audio modeling, including the separation of Gaussian and αÂstable stochastic processes. Then, I mention my work on deep learning applied to audio, which rapidly turned into a large effort for community service. Finally, I present my contributions in machine learning, with some works on hardware compressed sensing and probabilistic generative models.My research programme involves a theoretical part that revolves around probabilistic machine learning, and an applied part that concerns the processing of time series arising in both audio and life sciences