9,619 research outputs found
UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation
In this paper, we introduce UnFuSeD, a novel approach to leverage
self-supervised learning and reduce the need for large amounts of labeled data
for audio classification. Unlike prior works, which directly fine-tune a
self-supervised pre-trained encoder on a target dataset, we use the encoder to
generate pseudo-labels for unsupervised fine-tuning before the actual
fine-tuning step. We first train an encoder using a novel self-supervised
learning algorithm (SSL) on an unlabeled audio dataset. Then, we use that
encoder to generate pseudo-labels on our target task dataset via clustering the
extracted representations. These pseudo-labels are then used to guide
self-distillation on a randomly initialized model, which we call unsupervised
fine-tuning. Finally, the resultant encoder is then fine-tuned on our target
task dataset. Through UnFuSeD, we propose the first system that moves away from
generic SSL paradigms in literature, which pre-train and fine-tune the same
encoder, and present a novel self-distillation-based system to leverage SSL
pre-training for low-resource audio classification. In practice, UnFuSeD
achieves state-of-the-art results on the LAPE Benchmark, significantly
outperforming all our baselines. Additionally, UnFuSeD allows us to achieve
this at a 40% reduction in the number of parameters over the previous
state-of-the-art system. We make all our codes publicly available.Comment: Under review at ICASSP 2023 SASB Worksho
Multi-learner based recursive supervised training
In this paper, we propose the Multi-Learner Based Recursive Supervised Training (MLRT) algorithm which uses the existing framework of recursive task decomposition, by training the entire dataset, picking out the best learnt patterns, and then repeating the process with the remaining patterns. Instead of having a single learner to classify all datasets during each recursion, an appropriate learner is chosen from a set of three learners, based on the subset of data being trained, thereby avoiding the time overhead associated with the genetic algorithm learner utilized in previous approaches. In this way MLRT seeks to identify the inherent characteristics of the dataset, and utilize it to train the data accurately and efficiently. We observed that empirically, MLRT performs considerably well as compared to RPHP and other systems on benchmark data with 11% improvement in accuracy on the SPAM dataset and comparable performances on the VOWEL and the TWO-SPIRAL problems. In addition, for most datasets, the time taken by MLRT is considerably lower than the other systems with comparable accuracy. Two heuristic versions, MLRT-2 and MLRT-3 are also introduced to improve the efficiency in the system, and to make it more scalable for future updates. The performance in these versions is similar to the original MLRT system
VoxCeleb2: Deep Speaker Recognition
The objective of this paper is speaker recognition under noisy and
unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale
audio-visual speaker recognition dataset collected from open-source media.
Using a fully automated pipeline, we curate VoxCeleb2 which contains over a
million utterances from over 6,000 speakers. This is several times larger than
any publicly available speaker recognition dataset.
Second, we develop and compare Convolutional Neural Network (CNN) models and
training strategies that can effectively recognise identities from voice under
various conditions. The models trained on the VoxCeleb2 dataset surpass the
performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be
downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 .
1806.05622v2: minor fixes; 5 page
Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
We tackle the problem of audiovisual scene analysis for weakly-labeled data.
To this end, we build upon our previous audiovisual representation learning
framework to perform object classification in noisy acoustic environments and
integrate audio source enhancement capability. This is made possible by a novel
use of non-negative matrix factorization for the audio modality. Our approach
is founded on the multiple instance learning paradigm. Its effectiveness is
established through experiments over a challenging dataset of music instrument
performance videos. We also show encouraging visual object localization
results
- …