403,742 research outputs found
LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers
Vision Transformers have been incredibly effective when tackling computer
vision tasks due to their ability to model long feature dependencies. By using
large-scale training data and various self-supervised signals (e.g., masked
random patches), vision transformers provide state-of-the-art performance on
several benchmarking datasets, such as ImageNet-1k and CIFAR-10. However, these
vision transformers pretrained over general large-scale image corpora could
only produce an anisotropic representation space, limiting their
generalizability and transferability to the target downstream tasks. In this
paper, we propose a simple and effective Label-aware Contrastive Training
framework LaCViT, which improves the isotropy of the pretrained representation
space for vision transformers, thereby enabling more effective transfer
learning amongst a wide range of image classification tasks. Through
experimentation over five standard image classification datasets, we
demonstrate that LaCViT-trained models outperform the original pretrained
baselines by around 9% absolute Accuracy@1, and consistent improvements can be
observed when applying LaCViT to our three evaluated vision transformers
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
In recent years, the use of multi-modal pre-trained Transformers has led to
significant advancements in visually-rich document understanding. However,
existing models have mainly focused on features such as text and vision while
neglecting the importance of layout relationship between text nodes. In this
paper, we propose GraphLayoutLM, a novel document understanding model that
leverages the modeling of layout structure graph to inject document layout
knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm
to adjust the text sequence based on the graph structure. Additionally, our
model uses a layout-aware multi-head self-attention layer to learn document
layout knowledge. The proposed model enables the understanding of the spatial
arrangement of text elements, improving document comprehension. We evaluate our
model on various benchmarks, including FUNSD, XFUND and CORD, and achieve
state-of-the-art results among these datasets. Our experimental results
demonstrate that our proposed method provides a significant improvement over
existing approaches and showcases the importance of incorporating layout
information into document understanding models. We also conduct an ablation
study to investigate the contribution of each component of our model. The
results show that both the graph reordering algorithm and the layout-aware
multi-head self-attention layer play a crucial role in achieving the best
performance
Efficient Zero-shot Visual Search via Target and Context-aware Transformer
Visual search is a ubiquitous challenge in natural vision, including daily
tasks such as finding a friend in a crowd or searching for a car in a parking
lot. Human rely heavily on relevant target features to perform goal-directed
visual search. Meanwhile, context is of critical importance for locating a
target object in complex scenes as it helps narrow down the search area and
makes the search process more efficient. However, few works have combined both
target and context information in visual search computational models. Here we
propose a zero-shot deep learning architecture, TCT (Target and Context-aware
Transformer), that modulates self attention in the Vision Transformer with
target and contextual relevant information to enable human-like zero-shot
visual search performance. Target modulation is computed as patch-wise local
relevance between the target and search images, whereas contextual modulation
is applied in a global fashion. We conduct visual search experiments on TCT and
other competitive visual search models on three natural scene datasets with
varying levels of difficulty. TCT demonstrates human-like performance in terms
of search efficiency and beats the SOTA models in challenging visual search
tasks. Importantly, TCT generalizes well across datasets with novel objects
without retraining or fine-tuning. Furthermore, we also introduce a new dataset
to benchmark models for invariant visual search under incongruent contexts. TCT
manages to search flexibly via target and context modulation, even under
incongruent contexts
Integrating kNN with Foundation Models for Adaptable and Privacy-Aware Image Classification
Traditional deep learning models implicity encode knowledge limiting their
transparency and ability to adapt to data changes. Yet, this adaptability is
vital for addressing user data privacy concerns. We address this limitation by
storing embeddings of the underlying training data independently of the model
weights, enabling dynamic data modifications without retraining. Specifically,
our approach integrates the -Nearest Neighbor (-NN) classifier with a
vision-based foundation model, pre-trained self-supervised on natural images,
enhancing interpretability and adaptability. We share open-source
implementations of a previously unpublished baseline method as well as our
performance-improving contributions. Quantitative experiments confirm improved
classification across established benchmark datasets and the method's
applicability to distinct medical image classification tasks. Additionally, we
assess the method's robustness in continual learning and data removal
scenarios. The approach exhibits great promise for bridging the gap between
foundation models' performance and challenges tied to data privacy. The source
code is available at
https://github.com/TobArc/privacy-aware-image-classification-with-kNN.Comment: Accepted at 21st IEEE International Symposium on Biomedical Imaging
(IEEE ISBI 2024
GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes
This paper tackles the challenges of self-supervised monocular depth
estimation in indoor scenes caused by large rotation between frames and low
texture. We ease the learning process by obtaining coarse camera poses from
monocular sequences through multi-view geometry to deal with the former.
However, we found that limited by the scale ambiguity across different scenes
in the training dataset, a na\"ive introduction of geometric coarse poses
cannot play a positive role in performance improvement, which is
counter-intuitive. To address this problem, we propose to refine those poses
during training through rotation and translation/scale optimization. To soften
the effect of the low texture, we combine the global reasoning of vision
transformers with an overfitting-aware, iterative self-distillation mechanism,
providing more accurate depth guidance coming from the network itself.
Experiments on NYUv2, ScanNet, 7scenes, and KITTI datasets support the
effectiveness of each component in our framework, which sets a new
state-of-the-art for indoor self-supervised monocular depth estimation, as well
as outstanding generalization ability. Code and models are available at
https://github.com/zxcqlf/GasMonoComment: ICCV 2023. Code: https://github.com/zxcqlf/GasMon
HINT: High-quality INpainting Transformer with Mask-Aware Encoding and Enhanced Attention
Existing image inpainting methods leverage convolution-based downsampling approaches to reduce spatial dimensions. This may result in information loss from corrupted images where the available information is inherently sparse, especially for the scenario of large missing regions. Recent advances in self-attention mechanisms within transformers have led to significant improvements in many computer vision tasks including inpainting. However, limited by the computational costs, existing methods cannot fully exploit the efficacy of long-range modelling capabilities of such models. In this paper, we propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT, which consists of a novel mask-aware pixel-shuffle downsampling module (MPD) to preserve the visible information extracted from the corrupted image while maintaining the integrity of the information available for highlevel inferences made within the model. Moreover, we propose a Spatially-activated Channel Attention Layer (SCAL), an efficient self-attention mechanism interpreting spatial awareness to model the corrupted image at multiple scales. To further enhance the effectiveness of SCAL, motivated by recent advanced in speech recognition, we introduce a sandwich structure that places feed-forward networks before and after the SCAL module. We demonstrate the superior performance of HINT compared to contemporary state-of-the-art models on four datasets, CelebA, CelebA-HQ, Places2, and Dunhuang
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
This paper presents that the masked-modeling principle driving the success of
large foundational vision models can be effectively applied to audio by making
predictions in a latent space. We introduce Audio-based Joint-Embedding
Predictive Architecture (A-JEPA), a simple extension method for self-supervised
learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA
encodes visible audio spectrogram patches with a curriculum masking strategy
via context encoder, and predicts the representations of regions sampled at
well-designed locations. The target representations of those regions are
extracted by the exponential moving average of context encoder, \emph{i.e.},
target encoder, on the whole spectrogram. We find it beneficial to transfer
random block masking into time-frequency aware masking in a curriculum manner,
considering the complexity of highly correlated in local time and frequency in
audio spectrograms. To enhance contextual semantic understanding and
robustness, we fine-tune the encoder with a regularized masking on target
datasets, instead of input dropping or zero. Empirically, when built with
Vision Transformers structure, we find A-JEPA to be highly scalable and sets
new state-of-the-art performance on multiple audio and speech classification
tasks, outperforming other recent models that use externally supervised
pre-training.Comment: arXiv admin note: text overlap with arXiv:2207.06405 by other author
Detection-aware multi-object tracking evaluation
Master Universitario en Deep Learning for Audio and Video Signal ProcessingMulti-Object Tracking (MOT) is a hot topic in the computer vision field. It is a
complex task that requires a detector, to identify objects, and a tracker, to follow
them. It is useful for self-driving, surveillance and robot vision, between others, where
research teams and companies are trying to improve their models. In order to determine
which model performs better, they are scored using tracking metrics.
In this thesis we experiment with MOT metrics aware of detection by using correlation matrices. By analyzing the results, we realize that tracking metrics incur in
certain issues that prevent them for correctly reflecting tracking performance. The
performance of the detector is relevant when scoring tracking models. The problem
observed is that tracking metrics weigh differently elements that evaluate detection
performance. Thus, improving one detector’s aspect with a high weight in the MOT
metric will significantly improve the tracker’s score, but not necessarily indicating the
amount of effort done by the tracker. That is, trackers are not evaluated in a balanced
way.
In order to solve this issue with the tracker scoring, we present a new multi-object
tracking metric, based on the effort done by the tracker given a certain set of detections.
This effort is calculated based on the improvement of bounding boxes over the ones
given by the detector and the precision to keep the trace of the objects in a sequence.
The metric has been tested for two widely employed datasets and shows us its reliability
scoring tracking metrics. Also, it do not incur in the problem presented above
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances
modern Vision-Language Pre-training (VLP) models by aligning visual and
linguistic modalities. Due to noises in web-harvested text-image pairs,
however, scaling up training data volume in SMCL presents considerable
obstacles in terms of computational cost and data inefficiency. To improve data
efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates
mix-based data augmentation techniques into SMCL, yielding significant
performance improvements without significantly increasing computational
overhead. We provide a theoretical analysis of TiMixfrom a mutual information
(MI) perspective, showing that mixed data samples for cross-modal contrastive
learning implicitly serve as a regularizer for the contrastive loss. The
experimental results demonstrate that TiMix exhibits a comparable performance
on downstream tasks, even with a reduced amount of training data and shorter
training time, when benchmarked against existing methods. This work empirically
and theoretically demonstrates the potential of data mixing for data-efficient
and computationally viable VLP, benefiting broader VLP model adoption in
practical scenarios.Comment: Accepted on AAAI202
- …