5,056 research outputs found
Multimodal Diarization Systems by Training Enrollment Models as Identity Representations
This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity to each segment where a person is detected. In this work, we implemented two different subsystems to address this task using the audio and the video from audiovisual files separately. To develop our subsystems, we used the state-of-the-art speaker and face verification embeddings extracted from publicly available deep neural networks (DNN). Different clustering techniques were also employed in combination with the tracking and identity assignment process. Furthermore, we included a novel back-end approach in the face verification subsystem to train an enrollment model for each identity, which we have previously shown to improve the results compared to the average of the enrollment data. Using this approach, we trained a learnable vector to represent each enrollment character. The loss function employed to train this vector was an approximated version of the detection cost function (aDCF) which is inspired by the DCF widely used metric to measure performance in verification tasks. In this paper, we also focused on exploring and analyzing the effect of training this vector with several configurations of this objective loss function. This analysis allows us to assess the impact of the configuration parameters of the loss in the amount and type of errors produced by the system
Physiologically-Motivated Feature Extraction Methods for Speaker Recognition
Speaker recognition has received a great deal of attention from the speech community, and significant gains in robustness and accuracy have been obtained over the past decade. However, the features used for identification are still primarily representations of overall spectral characteristics, and thus the models are primarily phonetic in nature, differentiating speakers based on overall pronunciation patterns. This creates difficulties in terms of the amount of enrollment data and complexity of the models required to cover the phonetic space, especially in tasks such as identification where enrollment and testing data may not have similar phonetic coverage. This dissertation introduces new features based on vocal source characteristics intended to capture physiological information related to the laryngeal excitation energy of a speaker. These features, including RPCC, GLFCC and TPCC, represent the unique characteristics of speech production not represented in current state-of-the-art speaker identification systems. The proposed features are evaluated through three experimental paradigms including cross-lingual speaker identification, cross song-type avian speaker identification and mono-lingual speaker identification. The experimental results show that the proposed features provide information about speaker characteristics that is significantly different in nature from the phonetically-focused information present in traditional spectral features. The incorporation of the proposed glottal source features offers significant overall improvement to the robustness and accuracy of speaker identification tasks
Microphone smart device fingerprinting from video recordings
This report aims at summarizing the on-going research activity carried out by DG-JRC in the framework of the institutional project Authors and Victims Identification of Child Abuse on-line, concerning the use of microphone fingerprinting for source device classification. Starting from an exhaustive study of the State of Art regarding the matter, this report describes a feasibility study about the adoption of microphone fingerprinting for source identification of video recordings. A set of operational scenarios have been established in collaboration with EUROPOL law enforcers, according to investigators needs. A critical analysis of the obtained results has demonstrated the feasibility of microphone fingerprinting and it has suggested a set of recommendations, both in terms of usability and future researches in the field.JRC.E.3-Cyber and Digital Citizens' Securit
MobiFace: A Novel Dataset for Mobile Face Tracking in the Wild
Face tracking serves as the crucial initial step in mobile applications
trying to analyse target faces over time in mobile settings. However, this
problem has received little attention, mainly due to the scarcity of dedicated
face tracking benchmarks. In this work, we introduce MobiFace, the first
dataset for single face tracking in mobile situations. It consists of 80
unedited live-streaming mobile videos captured by 70 different smartphone users
in fully unconstrained environments. Over bounding boxes are manually
labelled. The videos are carefully selected to cover typical smartphone usage.
The videos are also annotated with 14 attributes, including 6 newly proposed
attributes and 8 commonly seen in object tracking. 36 state-of-the-art
trackers, including facial landmark trackers, generic object trackers and
trackers that we have fine-tuned or improved, are evaluated. The results
suggest that mobile face tracking cannot be solved through existing approaches.
In addition, we show that fine-tuning on the MobiFace training data
significantly boosts the performance of deep learning-based trackers,
suggesting that MobiFace captures the unique characteristics of mobile face
tracking. Our goal is to offer the community a diverse dataset to enable the
design and evaluation of mobile face trackers. The dataset, annotations and the
evaluation server will be on \url{https://mobiface.github.io/}.Comment: To appear on The 14th IEEE International Conference on Automatic Face
and Gesture Recognition (FG 2019
Deep learning methods in speaker recognition: a review
This paper summarizes the applied deep learning practices in the field of
speaker recognition, both verification and identification. Speaker recognition
has been a widely used field topic of speech technology. Many research works
have been carried out and little progress has been achieved in the past 5-6
years. However, as deep learning techniques do advance in most machine learning
fields, the former state-of-the-art methods are getting replaced by them in
speaker recognition too. It seems that DL becomes the now state-of-the-art
solution for both speaker verification and identification. The standard
x-vectors, additional to i-vectors, are used as baseline in most of the novel
works. The increasing amount of gathered data opens up the territory to DL,
where they are the most effective
- …