Search CORE

7,781 research outputs found

VoxCeleb2: Deep Speaker Recognition

Author: Chung Joon Son
Nagrani Arsha
Zisserman Andrew
Publication venue
Publication date: 26/06/2018
Field of study

The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 . 1806.05622v2: minor fixes; 5 page

arXiv.org e-Print Archive

Oxford University Research Archive

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Author: Byun Seokhyun
Dey Subhadeep
Jung Kyomin
Yoon Seunghyun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/05/2019
Field of study

In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task. The baseline approach models the information from audio and text independently using two deep neural networks (DNNs). The outputs from both the DNNs are then fused for classification. As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data. The proposed framework uses two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance. Furthermore, we propose an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities. The multi-hop attention first computes the relevant segments of the textual data corresponding to the audio signal. The relevant textual data is then applied to attend parts of the audio signal. To evaluate the performance of the proposed system, experiments are performed in the IEMOCAP dataset. Experimental results show that the proposed technique outperforms the state-of-the-art system by 6.5% relative improvement in terms of weighted accuracy.Comment: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral presentation

arXiv.org e-Print Archive

Crossref

Leveraging a Hybrid Deep Learning Architecture for Efficient Emotion Recognition in Audio Processing

Author: Kirti Sharma et al.
Publication venue: Auricle Global Society of Education and Research
Publication date: 02/11/2023
Field of study

This paper presents a novel hybrid deep learning architecture for emotion recognition from speech signals, which has garnered significant interest in recent years due to its potential applications in various fields such as healthcare, psychology, and entertainment. The proposed architecture combines modified ResNet-34 and RoBERTa models to extract meaningful features from speech signals and classify them into different emotion categories. The model is evaluated on five standard emotion recognition datasets, including RAVDESS, EmoDB, SAVEE, CREMA-D, and TESS, and achieves state-of-the-art performance on all datasets. The experimental results show that the proposed hybrid architecture outperforms existing emotion recognition models, achieving high accuracy and F1 scores for emotion classification. The proposed architecture is promising for real-time emotion recognition applications and can be applied in various domains such as speech-based emotion recognition systems, human-computer interaction, and virtual assistants

International Journal on Recent and Innovation Trends in Computing and Communication