1,327 research outputs found
Human and Machine Speaker Recognition Based on Short Trivial Events
Trivial events are ubiquitous in human to human conversations, e.g., cough,
laugh and sniff. Compared to regular speech, these trivial events are usually
short and unclear, thus generally regarded as not speaker discriminative and so
are largely ignored by present speaker recognition research. However, these
trivial events are highly valuable in some particular circumstances such as
forensic examination, as they are less subjected to intentional change, so can
be used to discover the genuine speaker from disguised speech. In this paper,
we collect a trivial event speech database that involves 75 speakers and 6
types of events, and report preliminary speaker recognition results on this
database, by both human listeners and machines. Particularly, the deep feature
learning technique recently proposed by our group is utilized to analyze and
recognize the trivial events, which leads to acceptable equal error rates
(EERs) despite the extremely short durations (0.2-0.5 seconds) of these events.
Comparing different types of events, 'hmm' seems more speaker discriminative.Comment: ICASSP 201
Deep factorization for speech signal
Various informative factors mixed in speech signals, leading to great
difficulty when decoding any of the factors. An intuitive idea is to factorize
each speech frame into individual informative factors, though it turns out to
be highly difficult. Recently, we found that speaker traits, which were assumed
to be long-term distributional properties, are actually short-time patterns,
and can be learned by a carefully designed deep neural network (DNN). This
discovery motivated a cascade deep factorization (CDF) framework that will be
presented in this paper. The proposed framework infers speech factors in a
sequential way, where factors previously inferred are used as conditional
variables when inferring other factors. We will show that this approach can
effectively factorize speech signals, and using these factors, the original
speech spectrum can be recovered with a high accuracy. This factorization and
reconstruction approach provides potential values for many speech processing
tasks, e.g., speaker recognition and emotion recognition, as will be
demonstrated in the paper.Comment: Accepted by ICASSP 2018. arXiv admin note: substantial text overlap
with arXiv:1706.0177
Full-info Training for Deep Speaker Feature Learning
In recent studies, it has shown that speaker patterns can be learned from
very short speech segments (e.g., 0.3 seconds) by a carefully designed
convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the
model to discriminate the speakers in the training data, frame-level speaker
features can be derived from the last hidden layer. In spite of its good
performance, a potential problem of the present model is that it involves a
parametric classifier, i.e., the last affine layer, which may consume some
discriminative knowledge, thus leading to `information leak' for the feature
learning. This paper presents a full-info training approach that discards the
parametric classifier and enforces all the discriminative knowledge learned by
the feature net. Our experiments on the Fisher database demonstrate that this
new training scheme can produce more coherent features, leading to consistent
and notable performance improvement on the speaker verification task.Comment: Accepted by ICASSP 201
Laugh Betrays You? Learning Robust Speaker Representation From Speech Containing Non-Verbal Fragments
The success of automatic speaker verification shows that discriminative
speaker representations can be extracted from neutral speech. However, as a
kind of non-verbal voice, laughter should also carry speaker information
intuitively. Thus, this paper focuses on exploring speaker verification about
utterances containing non-verbal laughter segments. We collect a set of clips
with laughter components by conducting a laughter detection script on VoxCeleb
and part of the CN-Celeb dataset. To further filter untrusted clips,
probability scores are calculated by our binary laughter detection classifier,
which is pre-trained by pure laughter and neutral speech. After that, based on
the clips whose scores are over the threshold, we construct trials under two
different evaluation scenarios: Laughter-Laughter (LL) and Speech-Laughter
(SL). Then a novel method called Laughter-Splicing based Network (LSN) is
proposed, which can significantly boost performance in both scenarios and
maintain the performance on the neutral speech, such as the VoxCeleb1 test set.
Specifically, our system achieves relative 20% and 22% improvement on
Laughter-Laughter and Speech-Laughter trials, respectively. The meta-data and
sample clips have been released at https://github.com/nevermoreLin/Laugh_LSN.Comment: Submitted to ICASSP202
TB or not TB? Acoustic cough analysis for tuberculosis classification
In this work, we explore recurrent neural network architectures for
tuberculosis (TB) cough classification. In contrast to previous unsuccessful
attempts to implement deep architectures in this domain, we show that a basic
bidirectional long short-term memory network (BiLSTM) can achieve improved
performance. In addition, we show that by performing greedy feature selection
in conjunction with a newly-proposed attention-based architecture that learns
patient invariant features, substantially better generalisation can be achieved
compared to a baseline and other considered architectures. Furthermore, this
attention mechanism allows an inspection of the temporal regions of the audio
signal considered to be important for classification to be performed. Finally,
we develop a neural style transfer technique to infer idealised inputs which
can subsequently be analysed. We find distinct differences between the
idealised power spectra of TB and non-TB coughs, which provide clues about the
origin of the features in the audio signal.Comment: Accepted for publication at Interspeech 202
Mental Health Monitoring from Speech and Language
Concern for mental health has increased in the last years due to its impact in people life quality and its consequential effect on healthcare systems. Automatic systems that can help in the diagnosis, symptom monitoring, alarm generation etc. are an emerging technology that has provided several challenges to the scientific community. The goal of this work is to design a system capable of distinguishing between healthy and depressed and/or anxious subjects, in a realistic environment, using their speech. The system is based on efficient representations of acoustic signals and text representations extracted within the self-supervised paradigm. Considering the good results achieved by using acoustic signals, another set of experiments was carried out in order to detect the specific illness. An analysis of the emotional information and its impact in the presented task is also tackled as an additional contribution.This work was partially funded by the European Commission, grant number 823907 and the Spanish Ministry of Science under grant TIN2017-85854-C4-3-R
- …