178 research outputs found
High-dimensional approximate nearest neighbor: k-d Generalized Randomized Forests
We propose a new data-structure, the generalized randomized kd forest, or
kgeraf, for approximate nearest neighbor searching in high dimensions. In
particular, we introduce new randomization techniques to specify a set of
independently constructed trees where search is performed simultaneously, hence
increasing accuracy. We omit backtracking, and we optimize distance
computations, thus accelerating queries. We release public domain software
geraf and we compare it to existing implementations of state-of-the-art methods
including BBD-trees, Locality Sensitive Hashing, randomized kd forests, and
product quantization. Experimental results indicate that our method would be
the method of choice in dimensions around 1,000, and probably up to 10,000, and
pointsets of cardinality up to a few hundred thousands or even one million;
this range of inputs is encountered in many critical applications today. For
instance, we handle a real dataset of images represented in 960
dimensions with a query time of less than sec on average and 90\% responses
being true nearest neighbors
Multi-Target Unsupervised Domain Adaptation for Semantic Segmentation without External Data
Multi-target unsupervised domain adaptation (UDA) aims to learn a unified
model to address the domain shift between multiple target domains. Due to the
difficulty of obtaining annotations for dense predictions, it has recently been
introduced into cross-domain semantic segmentation. However, most existing
solutions require labeled data from the source domain and unlabeled data from
multiple target domains concurrently during training. Collectively, we refer to
this data as "external". When faced with new unlabeled data from an unseen
target domain, these solutions either do not generalize well or require
retraining from scratch on all data. To address these challenges, we introduce
a new strategy called "multi-target UDA without external data" for semantic
segmentation. Specifically, the segmentation model is initially trained on the
external data. Then, it is adapted to a new unseen target domain without
accessing any external data. This approach is thus more scalable than existing
solutions and remains applicable when external data is inaccessible. We
demonstrate this strategy using a simple method that incorporates
self-distillation and adversarial learning, where knowledge acquired from the
external data is preserved during adaptation through "one-way" adversarial
learning. Extensive experiments in several synthetic-to-real and real-to-real
adaptation settings on four benchmark urban driving datasets show that our
method significantly outperforms current state-of-the-art solutions, even in
the absence of external data. Our source code is available online
(https://github.com/YonghaoXu/UT-KD)
PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis
Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to
interpret the complex nature of human sentiments. Despite significant progress
in multimodal architecture design, the field lacks comprehensive regularization
methods. This paper introduces PowMix, a versatile embedding space regularizer
that builds upon the strengths of unimodal mixing-based regularization
approaches and introduces novel algorithmic components that are specifically
tailored to multimodal tasks. PowMix is integrated before the fusion stage of
multimodal architectures and facilitates intra-modal mixing, such as mixing
text with text, to act as a regularizer. PowMix consists of five components: 1)
a varying number of generated mixed examples, 2) mixing factor reweighting, 3)
anisotropic mixing, 4) dynamic mixing, and 5) cross-modal label mixing.
Extensive experimentation across benchmark MSA datasets and a broad spectrum of
diverse architectural designs demonstrate the efficacy of PowMix, as evidenced
by consistent performance improvements over baselines and existing mixing
methods. An in-depth ablation study highlights the critical contribution of
each PowMix component and how they synergistically enhance performance.
Furthermore, algorithmic analysis demonstrates how PowMix behaves in different
scenarios, particularly comparing early versus late fusion architectures.
Notably, PowMix enhances overall performance without sacrificing model
robustness or magnifying text dominance. It also retains its strong performance
in situations of limited data. Our findings position PowMix as a promising
versatile regularization strategy for MSA. Code will be made available.Comment: Preprin
Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?
Convolutional networks and vision transformers have different forms of
pairwise interactions, pooling across layers and pooling at the end of the
network. Does the latter really need to be different? As a by-product of
pooling, vision transformers provide spatial attention for free, but this is
most often of low quality unless self-supervised, which is not well studied. Is
supervision really the problem?
In this work, we develop a generic pooling framework and then we formulate a
number of existing methods as instantiations. By discussing the properties of
each group of methods, we derive SimPool, a simple attention-based pooling
mechanism as a replacement of the default one for both convolutional and
transformer encoders. We find that, whether supervised or self-supervised, this
improves performance on pre-training and downstream tasks and provides
attention maps delineating object boundaries in all cases. One could thus call
SimPool universal. To our knowledge, we are the first to obtain attention maps
in supervised transformers of at least as good quality as self-supervised,
without explicit losses or modifying the architecture. Code at:
https://github.com/billpsomas/simpool.Comment: ICCV 2023. Code and models: https://github.com/billpsomas/simpoo
Fusing MPEG-7 visual descriptors for image classification
This paper proposes three content-based image classification techniques based on fusing various low-level MPEG-7 visual descriptors. Fusion is necessary as descriptors would be otherwise incompatible and inappropriate to directly include e.g. in a Euclidean distance. Three approaches are described: A “merging” fusion combined with an SVM classifier, a back-propagation fusion combined with a KNN classifier and a Fuzzy-ART neurofuzzy network. In the latter case, fuzzy rules can be extracted in an effort to bridge the “semantic gap” between the low-level descriptors and the high-level semantics of an image. All networks were evaluated using content from the repository of the aceMedia project1 and more specifically in a beach/urban scene classification problem
Adaptive Anchor Label Propagation for Transductive Few-Shot Learning
Few-shot learning addresses the issue of classifying images using limited
labeled data. Exploiting unlabeled data through the use of transductive
inference methods such as label propagation has been shown to improve the
performance of few-shot learning significantly. Label propagation infers
pseudo-labels for unlabeled data by utilizing a constructed graph that exploits
the underlying manifold structure of the data. However, a limitation of the
existing label propagation approaches is that the positions of all data points
are fixed and might be sub-optimal so that the algorithm is not as effective as
possible. In this work, we propose a novel algorithm that adapts the feature
embeddings of the labeled data by minimizing a differentiable loss function
optimizing their positions in the manifold in the process. Our novel algorithm,
Adaptive Anchor Label Propagation}, outperforms the standard label propagation
algorithm by as much as 7% and 2% in the 1-shot and 5-shot settings
respectively. We provide experimental results highlighting the merits of our
algorithm on four widely used few-shot benchmark datasets, namely miniImageNet,
tieredImageNet, CUB and CIFAR-FS and two commonly used backbones, ResNet12 and
WideResNet-28-10. The source code can be found at
https://github.com/MichalisLazarou/A2LP.Comment: published in ICIP 202
Opti-CAM: Optimizing saliency maps for interpretability
Methods based on class activation maps (CAM) provide a simple mechanism to
interpret predictions of convolutional neural networks by using linear
combinations of feature maps as saliency maps. By contrast, masking-based
methods optimize a saliency map directly in the image space or learn it by
training another network on additional data.
In this work we introduce Opti-CAM, combining ideas from CAM-based and
masking-based approaches. Our saliency map is a linear combination of feature
maps, where weights are optimized per image such that the logit of the masked
image for a given class is maximized. We also fix a fundamental flaw in two of
the most common evaluation metrics of attribution methods. On several datasets,
Opti-CAM largely outperforms other CAM-based approaches according to the most
relevant classification metrics. We provide empirical evidence supporting that
localization and classifier interpretability are not necessarily aligned.Comment: This work is under consideration at "Computer Vision and Image
Understanding
CA-Stream: Attention-based pooling for interpretable image recognition
Explanations obtained from transformer-based architectures in the form of raw
attention, can be seen as a class-agnostic saliency map. Additionally,
attention-based pooling serves as a form of masking the in feature space.
Motivated by this observation, we design an attention-based pooling mechanism
intended to replace Global Average Pooling (GAP) at inference. This mechanism,
called Cross-Attention Stream (CA-Stream), comprises a stream of cross
attention blocks interacting with features at different network depths.
CA-Stream enhances interpretability in models, while preserving recognition
performance.Comment: CVPR XAI4CV workshop 202
- …