820 research outputs found
To Compress or Not to Compress -- Self-Supervised Learning and Information Theory: A Review
Deep neural networks have demonstrated remarkable performance in supervised
learning tasks but require large amounts of labeled data. Self-supervised
learning offers an alternative paradigm, enabling the model to learn from data
without explicit labels. Information theory has been instrumental in
understanding and optimizing deep neural networks. Specifically, the
information bottleneck principle has been applied to optimize the trade-off
between compression and relevant information preservation in supervised
settings. However, the optimal information objective in self-supervised
learning remains unclear. In this paper, we review various approaches to
self-supervised learning from an information-theoretic standpoint and present a
unified framework that formalizes the \textit{self-supervised
information-theoretic learning problem}. We integrate existing research into a
coherent framework, examine recent self-supervised methods, and identify
research opportunities and challenges. Moreover, we discuss empirical
measurement of information-theoretic quantities and their estimators. This
paper offers a comprehensive review of the intersection between information
theory, self-supervised learning, and deep neural networks
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
We propose a self-supervised method to learn feature representations from
videos. A standard approach in traditional self-supervised methods uses
positive-negative data pairs to train with contrastive learning strategy. In
such a case, different modalities of the same video are treated as positives
and video clips from a different video are treated as negatives. Because the
spatio-temporal information is important for video representation, we extend
the negative samples by introducing intra-negative samples, which are
transformed from the same anchor video by breaking temporal relations in video
clips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train
spatio-temporal convolutional networks to learn video representations. There
are many flexible options in our IIC framework and we conduct experiments by
using several different configurations. Evaluations are conducted on video
retrieval and video recognition tasks using the learned video representation.
Our proposed IIC outperforms current state-of-the-art results by a large
margin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101
and HMDB51 datasets for video retrieval, respectively. For video recognition,
improvements can also be obtained on these two benchmark datasets. Code is
available at
https://github.com/BestJuly/Inter-intra-video-contrastive-learning.Comment: Accepted by ACMMM 2020. Our project page is at
https://bestjuly.github.io/Inter-intra-video-contrastive-learning
Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in
Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors
mounted at different locations to monitor the driver and the vehicle's interior
scene and employ decision-level fusion to integrate these heterogenous data.
However, this fusion method may not fully utilize the complementarity of
different data sources and may overlook their relative importance. To address
these limitations, we propose a novel multiview multimodal driver monitoring
system based on feature-level fusion through multi-head self-attention (MHSA).
We demonstrate its effectiveness by comparing it against four alternative
fusion strategies (Sum, Conv, SE, and AFF). We also present a novel
GPU-friendly supervised contrastive learning framework SuMoCo to learn better
representations. Furthermore, We fine-grained the test split of the DAD dataset
to enable the multi-class recognition of drivers' activities. Experiments on
this enhanced database demonstrate that 1) the proposed MHSA-based fusion
method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and
2) training MHSA with patch masking can improve its robustness against
modality/view collapses. The code and annotations are publicly available.Comment: 9 pages (1 for reference); accepted by the 6th Multimodal Learning
and Applications Workshop (MULA) at CVPR 202
Local Manifold Augmentation for Multiview Semantic Consistency
Multiview self-supervised representation learning roots in exploring semantic
consistency across data of complex intra-class variation. Such variation is not
directly accessible and therefore simulated by data augmentations. However,
commonly adopted augmentations are handcrafted and limited to simple
geometrical and color changes, which are unable to cover the abundant
intra-class variation. In this paper, we propose to extract the underlying data
variation from datasets and construct a novel augmentation operator, named
local manifold augmentation (LMA). LMA is achieved by training an
instance-conditioned generator to fit the distribution on the local manifold of
data and sampling multiview data using it. LMA shows the ability to create an
infinite number of data views, preserve semantics, and simulate complicated
variations in object pose, viewpoint, lighting condition, background etc.
Experiments show that with LMA integrated, self-supervised learning methods
such as MoCov2 and SimSiam gain consistent improvement on prevalent benchmarks
including CIFAR10, CIFAR100, STL10, ImageNet100, and ImageNet. Furthermore, LMA
leads to representations that obtain more significant invariance to the
viewpoint, object pose, and illumination changes and stronger robustness to
various real distribution shifts reflected by ImageNet-V2, ImageNet-R, ImageNet
Sketch etc
Constructive Assimilation: Boosting Contrastive Learning Performance through View Generation Strategies
Transformations based on domain expertise (expert transformations), such as
random-resized-crop and color-jitter, have proven critical to the success of
contrastive learning techniques such as SimCLR. Recently, several attempts have
been made to replace such domain-specific, human-designed transformations with
generated views that are learned. However for imagery data, so far none of
these view-generation methods has been able to outperform expert
transformations. In this work, we tackle a different question: instead of
replacing expert transformations with generated views, can we constructively
assimilate generated views with expert transformations? We answer this question
in the affirmative and propose a view generation method and a simple, effective
assimilation method that together improve the state-of-the-art by up to ~3.6%
on three different datasets. Importantly, we conduct a detailed empirical study
that systematically analyzes a range of view generation and assimilation
methods and provides a holistic picture of the efficacy of learned views in
contrastive representation learning.Comment: Accepted at Generative Models for Computer Vision Workshop 202
Augmentation is AUtO-Net: Augmentation-Driven Contrastive Multiview Learning for Medical Image Segmentation
The utilisation of deep learning segmentation algorithms that learn complex
organs and tissue patterns and extract essential regions of interest from the
noisy background to improve the visual ability for medical image diagnosis has
achieved impressive results in Medical Image Computing (MIC). This thesis
focuses on retinal blood vessel segmentation tasks, providing an extensive
literature review of deep learning-based medical image segmentation approaches
while comparing the methodologies and empirical performances. The work also
examines the limitations of current state-of-the-art methods by pointing out
the two significant existing limitations: data size constraints and the
dependency on high computational resources. To address such problems, this work
proposes a novel efficient, simple multiview learning framework that
contrastively learns invariant vessel feature representation by comparing with
multiple augmented views by various transformations to overcome data shortage
and improve generalisation ability. Moreover, the hybrid network architecture
integrates the attention mechanism into a Convolutional Neural Network to
further capture complex continuous curvilinear vessel structures. The result
demonstrates the proposed method validated on the CHASE-DB1 dataset, attaining
the highest F1 score of 83.46% and the highest Intersection over Union (IOU)
score of 71.62% with UNet structure, surpassing existing benchmark UNet-based
methods by 1.95% and 2.8%, respectively. The combination of the metrics
indicates the model detects the vessel object accurately with a highly
coincidental location with the ground truth. Moreover, the proposed approach
could be trained within 30 minutes by consuming less than 3 GB GPU RAM, and
such characteristics support the efficient implementation for real-world
applications and deployments
- …