Search CORE

12 research outputs found

Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Author: Kirchhoff Katrin
Ling Shaoshi
Salazar Julian
Publication venue
Publication date: 30/06/2019
Field of study

Pretrained contextual word representations in NLP have greatly improved performance on various downstream tasks. For speech, we propose contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition. These representations come from the frame-wise intermediate representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken utterances. We first train the model on the Fisher English corpus with context-independent phoneme labels, then use its representations at inference time as features for task-specific models on the NIST LRE07 closed-set language recognition task and a Fisher speaker recognition task, giving significant improvements over the state-of-the-art on both (e.g., language EER of 4.68% on 3sec utterances, 23% relative reduction in speaker EER). Results remain competitive when using a novel dilated convolutional model for language recognition, or when ASR pretraining is done with character labels only.Comment: submitted to INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

A Deep Learning Approach for Low-Latency Packet Loss Concealment of Audio Signals in Networked Music Performance Applications

Author: Chafe Chris
Mezza Alessandro Ilic
Rottondi Cristina
Verma Prateek
Publication venue
Publication date: 01/01/2020
Field of study

Networked Music Performance (NMP) is envisioned as a potential game changer among Internet applications: it aims at revolutionizing the traditional concept of musical interaction by enabling remote musicians to interact and perform together through a telecommunication network. Ensuring realistic conditions for music performance, however, constitutes a significant engineering challenge due to extremely strict requirements in terms of audio quality and, most importantly, network delay. To minimize the end-to-end delay experienced by the musicians, typical implementations of NMP applications use un-compressed, bidirectional audio streams and leverage UDP as transport protocol. Being connection less and unreliable,audio packets transmitted via UDP which become lost in transit are not re-transmitted and thus cause glitches in the receiver audio playout. This article describes a technique for predicting lost packet content in real-time using a deep learning approach. The ability of concealing errors in real time can help mitigate audio impairments caused by packet losses, thus improving the quality of audio playout in real-world scenarios.Comment: 8 pages, 2 figure

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Politecnico di Milano

Self-Supervised Representation Learning for Vocal Music Context

Author: Berger Jonathan
Noufi Camille
Verma Prateek
Publication venue
Publication date: 25/03/2021
Field of study

In music and speech, meaning is derived at multiple levels of context. Affect, for example, can be inferred both by a short sound token and by sonic patterns over a longer temporal window such as an entire recording. In this paper we focus on inferring meaning from this dichotomy of contexts. We show how contextual representations of short sung vocal lines can be implicitly learned from fundamental frequency (

F_0

) and thus be used as a meaningful feature space for downstream Music Information Retrieval (MIR) tasks. We propose three self-supervised deep learning paradigms which leverage pseudotask learning of these two levels of context to produce latent representation spaces. We evaluate the usefulness of these representations by embedding unseen vocal contours into each space and conducting downstream classification tasks. Our results show that contextual representation can enhance downstream classification by as much as 15 % as compared to using traditional statistical contour features.Comment: Working on more updated versio

arXiv.org e-Print Archive

Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation

Author: Jacobs Christiaan
Kamper Herman
Matusevych Yevgen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/03/2021
Field of study

Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zero-resource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based recurrent models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an unseen zero-resource language. We consider how a recent contrastive learning loss can be used in both the purely unsupervised and multilingual transfer settings. Firstly, we show that terms from an unsupervised term discovery system can be used for contrastive self-supervision, resulting in improvements over previous unsupervised monolingual AWE models. Secondly, we consider how multilingual AWE models can be adapted to a specific zero-resource language using discovered terms. We find that self-supervised contrastive adaptation outperforms adapted multilingual correspondence autoencoder and Siamese AWE models, giving the best overall results in a word discrimination task on six zero-resource languages.Comment: Accepted to SLT 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

Self-attention transfer networks for speech emotion recognition

Author: Bao Zhongtian
Cummins Nicholas
Schuller Björn W.
Sun Shihuang
Tao Jianhua
Wang Haishuai
Zhang Zixing
Zhao Ziping
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

OPUS Augsburg

King's Research Portal