9,403 research outputs found
Speaker Identification for Swiss German with Spectral and Rhythm Features
We present results of speech rhythm analysis for automatic speaker identification. We expand previous experiments using similar methods for language identification. Features describing the rhythmic properties of salient changes in signal components are extracted and used in an speaker identification task to determine to which extent they are descriptive of speaker variability. We also test the performance of state-of-the-art but simple-to-extract frame-based features. The paper focus is the evaluation on one corpus (swiss german, TEVOID) using support vector machines. Results suggest that the general spectral features can provide very good performance on this dataset, whereas the rhythm features are not as successful in the task, indicating either the lack of suitability for this task or the dataset specificity
Deep Multimodal Speaker Naming
Automatic speaker naming is the problem of localizing as well as identifying
each speaking character in a TV/movie/live show video. This is a challenging
problem mainly attributes to its multimodal nature, namely face cue alone is
insufficient to achieve good performance. Previous multimodal approaches to
this problem usually process the data of different modalities individually and
merge them using handcrafted heuristics. Such approaches work well for simple
scenes, but fail to achieve high performance for speakers with large appearance
variations. In this paper, we propose a novel convolutional neural networks
(CNN) based learning framework to automatically learn the fusion function of
both face and audio cues. We show that without using face tracking, facial
landmark localization or subtitle/transcript, our system with robust multimodal
feature extraction is able to achieve state-of-the-art speaker naming
performance evaluated on two diverse TV series. The dataset and implementation
of our algorithm are publicly available online
Using the beat histogram for speech rhythm description and language identification
In this paper we present a novel approach for the description of speech rhythm and the extraction of rhythm-related features for automatic language identification (LID). Previous methods have extracted speech rhythm through the calculation of features based on salient elements of speech such as consonants, vowels and syllables. We present how an automatic rhythm extraction method borrowed from music information retrieval, the beat histogram, can be adapted for the analysis of speech rhythm by defining the most relevant novelty functions in the speech signal and extracting features describing their periodicities. We have evaluated those features in a rhythm-based LID task for two multilingual speech corpora using support vector machines, including feature selection methods to identify the most informative descriptors. Results suggest that the method is successful in describing speech rhythm and provides LID classification accuracy comparable to or better than that of other approaches, without the need for a preceding segmentation or annotation of the speech signal. Concerning rhythm typology, the rhythm class hypothesis in its original form seems to be only partly confirmed by our results
Human and Machine Speaker Recognition Based on Short Trivial Events
Trivial events are ubiquitous in human to human conversations, e.g., cough,
laugh and sniff. Compared to regular speech, these trivial events are usually
short and unclear, thus generally regarded as not speaker discriminative and so
are largely ignored by present speaker recognition research. However, these
trivial events are highly valuable in some particular circumstances such as
forensic examination, as they are less subjected to intentional change, so can
be used to discover the genuine speaker from disguised speech. In this paper,
we collect a trivial event speech database that involves 75 speakers and 6
types of events, and report preliminary speaker recognition results on this
database, by both human listeners and machines. Particularly, the deep feature
learning technique recently proposed by our group is utilized to analyze and
recognize the trivial events, which leads to acceptable equal error rates
(EERs) despite the extremely short durations (0.2-0.5 seconds) of these events.
Comparing different types of events, 'hmm' seems more speaker discriminative.Comment: ICASSP 201
- …