6 research outputs found
Human and Machine Speaker Recognition Based on Short Trivial Events
Trivial events are ubiquitous in human to human conversations, e.g., cough,
laugh and sniff. Compared to regular speech, these trivial events are usually
short and unclear, thus generally regarded as not speaker discriminative and so
are largely ignored by present speaker recognition research. However, these
trivial events are highly valuable in some particular circumstances such as
forensic examination, as they are less subjected to intentional change, so can
be used to discover the genuine speaker from disguised speech. In this paper,
we collect a trivial event speech database that involves 75 speakers and 6
types of events, and report preliminary speaker recognition results on this
database, by both human listeners and machines. Particularly, the deep feature
learning technique recently proposed by our group is utilized to analyze and
recognize the trivial events, which leads to acceptable equal error rates
(EERs) despite the extremely short durations (0.2-0.5 seconds) of these events.
Comparing different types of events, 'hmm' seems more speaker discriminative.Comment: ICASSP 201
Full-info Training for Deep Speaker Feature Learning
In recent studies, it has shown that speaker patterns can be learned from
very short speech segments (e.g., 0.3 seconds) by a carefully designed
convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the
model to discriminate the speakers in the training data, frame-level speaker
features can be derived from the last hidden layer. In spite of its good
performance, a potential problem of the present model is that it involves a
parametric classifier, i.e., the last affine layer, which may consume some
discriminative knowledge, thus leading to `information leak' for the feature
learning. This paper presents a full-info training approach that discards the
parametric classifier and enforces all the discriminative knowledge learned by
the feature net. Our experiments on the Fisher database demonstrate that this
new training scheme can produce more coherent features, leading to consistent
and notable performance improvement on the speaker verification task.Comment: Accepted by ICASSP 201
Deep factorization for speech signal
Various informative factors mixed in speech signals, leading to great
difficulty when decoding any of the factors. An intuitive idea is to factorize
each speech frame into individual informative factors, though it turns out to
be highly difficult. Recently, we found that speaker traits, which were assumed
to be long-term distributional properties, are actually short-time patterns,
and can be learned by a carefully designed deep neural network (DNN). This
discovery motivated a cascade deep factorization (CDF) framework that will be
presented in this paper. The proposed framework infers speech factors in a
sequential way, where factors previously inferred are used as conditional
variables when inferring other factors. We will show that this approach can
effectively factorize speech signals, and using these factors, the original
speech spectrum can be recovered with a high accuracy. This factorization and
reconstruction approach provides potential values for many speech processing
tasks, e.g., speaker recognition and emotion recognition, as will be
demonstrated in the paper.Comment: Accepted by ICASSP 2018. arXiv admin note: substantial text overlap
with arXiv:1706.0177