12 research outputs found
Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition
Pretrained contextual word representations in NLP have greatly improved
performance on various downstream tasks. For speech, we propose contextual
frame representations that capture phonetic information at the acoustic frame
level and can be used for utterance-level language, speaker, and speech
recognition. These representations come from the frame-wise intermediate
representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken
utterances. We first train the model on the Fisher English corpus with
context-independent phoneme labels, then use its representations at inference
time as features for task-specific models on the NIST LRE07 closed-set language
recognition task and a Fisher speaker recognition task, giving significant
improvements over the state-of-the-art on both (e.g., language EER of 4.68% on
3sec utterances, 23% relative reduction in speaker EER). Results remain
competitive when using a novel dilated convolutional model for language
recognition, or when ASR pretraining is done with character labels only.Comment: submitted to INTERSPEECH 201
A Deep Learning Approach for Low-Latency Packet Loss Concealment of Audio Signals in Networked Music Performance Applications
Networked Music Performance (NMP) is envisioned as a potential game changer
among Internet applications: it aims at revolutionizing the traditional concept
of musical interaction by enabling remote musicians to interact and perform
together through a telecommunication network. Ensuring realistic conditions for
music performance, however, constitutes a significant engineering challenge due
to extremely strict requirements in terms of audio quality and, most
importantly, network delay. To minimize the end-to-end delay experienced by the
musicians, typical implementations of NMP applications use un-compressed,
bidirectional audio streams and leverage UDP as transport protocol. Being
connection less and unreliable,audio packets transmitted via UDP which become
lost in transit are not re-transmitted and thus cause glitches in the receiver
audio playout. This article describes a technique for predicting lost packet
content in real-time using a deep learning approach. The ability of concealing
errors in real time can help mitigate audio impairments caused by packet
losses, thus improving the quality of audio playout in real-world scenarios.Comment: 8 pages, 2 figure
Self-Supervised Representation Learning for Vocal Music Context
In music and speech, meaning is derived at multiple levels of context.
Affect, for example, can be inferred both by a short sound token and by sonic
patterns over a longer temporal window such as an entire recording. In this
paper we focus on inferring meaning from this dichotomy of contexts. We show
how contextual representations of short sung vocal lines can be implicitly
learned from fundamental frequency () and thus be used as a meaningful
feature space for downstream Music Information Retrieval (MIR) tasks. We
propose three self-supervised deep learning paradigms which leverage pseudotask
learning of these two levels of context to produce latent representation
spaces. We evaluate the usefulness of these representations by embedding unseen
vocal contours into each space and conducting downstream classification tasks.
Our results show that contextual representation can enhance downstream
classification by as much as 15 % as compared to using traditional statistical
contour features.Comment: Working on more updated versio
Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation
Acoustic word embeddings (AWEs) are fixed-dimensional representations of
variable-length speech segments. For zero-resource languages where labelled
data is not available, one AWE approach is to use unsupervised
autoencoder-based recurrent models. Another recent approach is to use
multilingual transfer: a supervised AWE model is trained on several
well-resourced languages and then applied to an unseen zero-resource language.
We consider how a recent contrastive learning loss can be used in both the
purely unsupervised and multilingual transfer settings. Firstly, we show that
terms from an unsupervised term discovery system can be used for contrastive
self-supervision, resulting in improvements over previous unsupervised
monolingual AWE models. Secondly, we consider how multilingual AWE models can
be adapted to a specific zero-resource language using discovered terms. We find
that self-supervised contrastive adaptation outperforms adapted multilingual
correspondence autoencoder and Siamese AWE models, giving the best overall
results in a word discrimination task on six zero-resource languages.Comment: Accepted to SLT 202