1,771 research outputs found
TempCLR: Temporal Alignment Representation with Contrastive Learning
Video representation learning has been successful in video-text pre-training
for zero-shot transfer, where each sentence is trained to be close to the
paired video clips in a common feature space. For long videos, given a
paragraph of description where the sentences describe different segments of the
video, by matching all sentence-clip pairs, the paragraph and the full video
are aligned implicitly. However, such unit-level similarity measure may ignore
the global temporal context over a long time span, which inevitably limits the
generalization ability. In this paper, we propose a contrastive learning
framework TempCLR to compare the full video and the paragraph explicitly. As
the video/paragraph is formulated as a sequence of clips/sentences, under the
constraint of their temporal order, we use dynamic time warping to compute the
minimum cumulative cost over sentence-clip pairs as the sequence-level
distance. To explore the temporal dynamics, we break the consistency of
temporal order by shuffling the video clips or sentences according to the
temporal granularity. In this way, we obtain the representations for
clips/sentences, which perceive the temporal information and thus facilitate
the sequence alignment. In addition to pre-training on the video and paragraph,
our approach can also generalize on the matching between different video
instances. We evaluate our approach on video retrieval, action step
localization, and few-shot action recognition, and achieve consistent
performance gain over all three tasks. Detailed ablation studies are provided
to justify the approach design
Audio self-supervised learning: a survey
Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL
Seamless Multimodal Biometrics for Continuous Personalised Wellbeing Monitoring
Artificially intelligent perception is increasingly present in the lives of
every one of us. Vehicles are no exception, (...) In the near future, pattern
recognition will have an even stronger role in vehicles, as self-driving cars
will require automated ways to understand what is happening around (and within)
them and act accordingly. (...) This doctoral work focused on advancing
in-vehicle sensing through the research of novel computer vision and pattern
recognition methodologies for both biometrics and wellbeing monitoring. The
main focus has been on electrocardiogram (ECG) biometrics, a trait well-known
for its potential for seamless driver monitoring. Major efforts were devoted to
achieving improved performance in identification and identity verification in
off-the-person scenarios, well-known for increased noise and variability. Here,
end-to-end deep learning ECG biometric solutions were proposed and important
topics were addressed such as cross-database and long-term performance,
waveform relevance through explainability, and interlead conversion. Face
biometrics, a natural complement to the ECG in seamless unconstrained
scenarios, was also studied in this work. The open challenges of masked face
recognition and interpretability in biometrics were tackled in an effort to
evolve towards algorithms that are more transparent, trustworthy, and robust to
significant occlusions. Within the topic of wellbeing monitoring, improved
solutions to multimodal emotion recognition in groups of people and
activity/violence recognition in in-vehicle scenarios were proposed. At last,
we also proposed a novel way to learn template security within end-to-end
models, dismissing additional separate encryption processes, and a
self-supervised learning approach tailored to sequential data, in order to
ensure data security and optimal performance. (...)Comment: Doctoral thesis presented and approved on the 21st of December 2022
to the University of Port
Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs
We study a novel neural architecture and its training strategies of speaker
encoder for speaker recognition without using any identity labels. The speaker
encoder is trained to extract a fixed-size speaker embedding from a spoken
utterance of various length. Contrastive learning is a typical self-supervised
learning technique. However, the quality of the speaker encoder depends very
much on the sampling strategy of positive and negative pairs. It is common that
we sample a positive pair of segments from the same utterance. Unfortunately,
such poor-man's positive pairs (PPP) lack necessary diversity for the training
of a robust encoder. In this work, we propose a multi-modal contrastive
learning technique with novel sampling strategies. By cross-referencing between
speech and face data, we study a method that finds diverse positive pairs (DPP)
for contrastive learning, thus improving the robustness of the speaker encoder.
We train the speaker encoder on the VoxCeleb2 dataset without any speaker
labels, and achieve an equal error rate (EER) of 2.89\%, 3.17\% and 6.27\%
under the proposed progressive clustering strategy, and an EER of 1.44\%,
1.77\% and 3.27\% under the two-stage learning strategy with pseudo labels, on
the three test sets of VoxCeleb1. This novel solution outperforms the
state-of-the-art self-supervised learning methods by a large margin, at the
same time, achieves comparable results with the supervised learning
counterpart. We also evaluate our self-supervised learning technique on LRS2
and LRW datasets, where the speaker information is unknown. All experiments
suggest that the proposed neural architecture and sampling strategies are
robust across datasets.Comment: 13 page
- …