Search CORE

2 research outputs found

Joint gender and age estimation based on speech signals using x-vectors and transfer learning

Author: Hemmerling Daria
Kwasny Damian
Publication venue
Publication date: 02/12/2020
Field of study

In this paper we extend the x-vector framework for the task of speaker's age estimation and gender classification. In particular, we replace the baseline multilayer-TDNN architecture with QuartzNet, a convolutional architecture that has gained success in the field of speech recognition. We further propose a two-staged transfer learning scheme, utilizing large scale speech datasets: VoxCeleb and Common Voice, and usage of multitask learning to allow for joint age estimation and gender classification with a single system. We train and evaluate the performance on the TIMIT dataset. The proposed transfer learning scheme yields consecutive performance improvements in terms of both age estimation error and gender classification accuracy and the best performing system achieves new state-of-the-art results on the task of age estimation on the TIMIT TEST dataset with MAE of 5.12 and 5.29 years and RMSE of 7.24 and 8.12 years for male and female speakers respectively while maintaining a gender classification accuracy of 99.6%

arXiv.org e-Print Archive

A Case Study of Deep Learning Based Multi-Modal Methods for Predicting the Age-Suitability Rating of Movie Trailers

Author: Kakadiaris Ioannis A.
Shafaei Mahsa
Smailis Christos
Solorio Thamar
Publication venue
Publication date: 26/01/2021
Field of study

In this work, we explore different approaches to combine modalities for the problem of automated age-suitability rating of movie trailers. First, we introduce a new dataset containing videos of movie trailers in English downloaded from IMDB and YouTube, along with their corresponding age-suitability rating labels. Secondly, we propose a multi-modal deep learning pipeline addressing the movie trailer age suitability rating problem. This is the first attempt to combine video, audio, and speech information for this problem, and our experimental results show that multi-modal approaches significantly outperform the best mono and bimodal models in this task

arXiv.org e-Print Archive