22 research outputs found
Independent Prototype Propagation for Zero-Shot Compositionality
Humans are good at compositional zero-shot reasoning; someone who has never
seen a zebra before could nevertheless recognize one when we tell them it looks
like a horse with black and white stripes. Machine learning systems, on the
other hand, usually leverage spurious correlations in the training data, and
while such correlations can help recognize objects in context, they hurt
generalization. To be able to deal with underspecified datasets while still
leveraging contextual clues during classification, we propose ProtoProp, a
novel prototype propagation graph method. First we learn prototypical
representations of objects (e.g., zebra) that are conditionally independent
w.r.t. their attribute labels (e.g., stripes) and vice versa. Next we propagate
the independent prototypes through a compositional graph, to learn
compositional prototypes of novel attribute-object combinations that reflect
the dependencies of the target distribution. The method does not rely on any
external data, such as class hierarchy graphs or pretrained word embeddings. We
evaluate our approach on AO-Clever, a synthetic and strongly visual dataset
with clean labels, and UT-Zappos, a noisy real-world dataset of fine-grained
shoe types. We show that in the generalized compositional zero-shot setting we
outperform state-of-the-art results, and through ablations we show the
importance of each part of the method and their contribution to the final
results
Language-Based Augmentation to Address Shortcut Learning in Object Goal Navigation
Deep Reinforcement Learning (DRL) has shown great potential in enabling robots to find certain objects (e.g., `find a fridge') in environments like homes or schools. This task is known as Object-Goal Navigation (ObjectNav). DRL methods are predominantly trained and evaluated using environment simulators. Although DRL has shown impressive results, the simulators may be biased or limited. This creates a risk of shortcut learning, i.e., learning a policy tailored to specific visual details of training environments. We aim to deepen our understanding of shortcut learning in ObjectNav, its implications and propose a solution. We design an experiment for inserting a shortcut bias in the appearance of training environments. As a proof-of-concept, we associate room types to specific wall colors (e.g., bedrooms with green walls), and observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to environments where this is not the case (e.g., bedrooms with blue walls). We find that shortcut learning is the root cause: the agent learns to navigate to target objects, by simply searching for the associated wall color of the target object's room. To solve this, we propose Language-Based (L-B) augmentation. Our key insight is that we can leverage the multimodal feature space of a Vision-Language Model (VLM) to augment visual representations directly at the feature-level, requiring no changes to the simulator, and only an addition of one layer to the model. Where the SOTA ObjectNav method's success rate drops 69%, our proposal has only a drop of 23%
Language-Based Augmentation to Address Shortcut Learning in Object Goal Navigation
Deep Reinforcement Learning (DRL) has shown great potential in enabling
robots to find certain objects (e.g., `find a fridge') in environments like
homes or schools. This task is known as Object-Goal Navigation (ObjectNav). DRL
methods are predominantly trained and evaluated using environment simulators.
Although DRL has shown impressive results, the simulators may be biased or
limited. This creates a risk of shortcut learning, i.e., learning a policy
tailored to specific visual details of training environments. We aim to deepen
our understanding of shortcut learning in ObjectNav, its implications and
propose a solution. We design an experiment for inserting a shortcut bias in
the appearance of training environments. As a proof-of-concept, we associate
room types to specific wall colors (e.g., bedrooms with green walls), and
observe poor generalization of a state-of-the-art (SOTA) ObjectNav method to
environments where this is not the case (e.g., bedrooms with blue walls). We
find that shortcut learning is the root cause: the agent learns to navigate to
target objects, by simply searching for the associated wall color of the target
object's room. To solve this, we propose Language-Based (L-B) augmentation. Our
key insight is that we can leverage the multimodal feature space of a
Vision-Language Model (VLM) to augment visual representations directly at the
feature-level, requiring no changes to the simulator, and only an addition of
one layer to the model. Where the SOTA ObjectNav method's success rate drops
69%, our proposal has only a drop of 23%.Comment: 8 pages, 6 figures, to be published in IEEE IRC 202
Recurrently Predicting Hypergraphs
This work considers predicting the relational structure of a hypergraph for a
given set of vertices, as common for applications in particle physics,
biological systems and other complex combinatorial problems. A problem arises
from the number of possible multi-way relationships, or hyperedges, scaling in
for a set of elements. Simply storing an indicator
tensor for all relationships is already intractable for moderately sized ,
prompting previous approaches to restrict the number of vertices a hyperedge
connects. Instead, we propose a recurrent hypergraph neural network that
predicts the incidence matrix by iteratively refining an initial guess of the
solution. We leverage the property that most hypergraphs of interest are
sparsely connected and reduce the memory requirement to ,
where is the maximum number of positive edges, i.e., edges that actually
exist. In order to counteract the linearly growing memory cost from training a
lengthening sequence of refinement steps, we further propose an algorithm that
applies backpropagation through time on randomly sampled subsequences. We
empirically show that our method can match an increase in the intrinsic
complexity without a performance decrease and demonstrate superior performance
compared to state-of-the-art models
Diffusing More Objects for Semi-Supervised Domain Adaptation with Less Labeling
For object detection, it is possible to view the prediction of bounding boxes
as a reverse diffusion process. Using a diffusion model, the random bounding
boxes are iteratively refined in a denoising step, conditioned on the image. We
propose a stochastic accumulator function that starts each run with random
bounding boxes and combines the slightly different predictions. We empirically
verify that this improves detection performance. The improved detections are
leveraged on unlabelled images as weighted pseudo-labels for semi-supervised
learning. We evaluate the method on a challenging out-of-domain test set. Our
method brings significant improvements and is on par with human-selected
pseudo-labels, while not requiring any human involvement.Comment: 4 pages, Workshop on DiffusionModels, NeurIPS 202
Self-Guided Diffusion Models
Diffusion models have demonstrated remarkable progress in image generation
quality, especially when guidance is used to control the generative process.
However, guidance requires a large amount of image-annotation pairs for
training and is thus dependent on their availability, correctness and
unbiasedness. In this paper, we eliminate the need for such annotation by
instead leveraging the flexibility of self-supervision signals to design a
framework for self-guided diffusion models. By leveraging a feature extraction
function and a self-annotation function, our method provides guidance signals
at various image granularities: from the level of holistic images to object
boxes and even segmentation masks. Our experiments on single-label and
multi-label image datasets demonstrate that self-labeled guidance always
outperforms diffusion models without guidance and may even surpass guidance
based on ground-truth labels, especially on unbalanced data. When equipped with
self-supervised box or mask proposals, our method further generates visually
diverse yet semantically consistent images, without the need for any class,
box, or segment label annotation. Self-guided diffusion is simple, flexible and
expected to profit from deployment at scale
Incremental concept learning with few training examples and hierarchical classification
Object recognition and localization are important to automatically interpret video and allow better querying
on its content. We propose a method for object localization that learns incrementally and addresses four key
aspects. Firstly, we show that for certain applications, recognition is feasible with only a few training samples.
Secondly, we show that novel objects can be added incrementally without retraining existing objects, which is
important for fast interaction. Thirdly, we show that an unbalanced number of positive training samples leads
to biased classi er scores that can be corrected by modifying weights. Fourthly, we show that the detector
performance can deteriorate due to hard-negative mining for similar or closely related classes (e.g., for Barbie
and dress, because the doll is wearing a dress). This can be solved by our hierarchical classi cation. We introduce
a new dataset, which we call TOSO, and use it to demonstrate the e ectiveness of the proposed method for the
localization and recognition of multiple objects in images.This research was performed in the GOOSE project, which is jointly funded by the enabling technology program
Adaptive Multi Sensor Networks (AMSN) and the MIST research program of the Dutch Ministry of Defense.
This publication was supported by the research program Making Sense of Big Data (MSoBD).peer-reviewe
Data Augmentations in Deep Weight Spaces
Learning in weight spaces, where neural networks process the weights of other
deep neural networks, has emerged as a promising research direction with
applications in various fields, from analyzing and editing neural fields and
implicit neural representations, to network pruning and quantization. Recent
works designed architectures for effective learning in that space, which takes
into account its unique, permutation-equivariant, structure. Unfortunately, so
far these architectures suffer from severe overfitting and were shown to
benefit from large datasets. This poses a significant challenge because
generating data for this learning setup is laborious and time-consuming since
each data sample is a full set of network weights that has to be trained. In
this paper, we address this difficulty by investigating data augmentations for
weight spaces, a set of techniques that enable generating new data examples on
the fly without having to train additional input weight space elements. We
first review several recently proposed data augmentation schemes %that were
proposed recently and divide them into categories. We then introduce a novel
augmentation scheme based on the Mixup method. We evaluate the performance of
these techniques on existing benchmarks as well as new benchmarks we generate,
which can be valuable for future studies.Comment: Accepted to NeurIPS 2023 Workshop on Symmetry and Geometry in Neural
Representation
Recognition and localization of relevant human behavior in videos, SPIE,
ABSTRACT Ground surveillance is normally performed by human assets, since it requires visual intelligence. However, especially for military operations, this can be dangerous and is very resource intensive. Therefore, unmanned autonomous visualintelligence systems are desired. In this paper, we present an improved system that can recognize actions of a human and interactions between multiple humans. Central to the new system is our agent-based architecture. The system is trained on thousands of videos and evaluated on realistic persistent surveillance data in the DARPA Mind's Eye program, with hours of videos of challenging scenes. The results show that our system is able to track the people, detect and localize events, and discriminate between different behaviors, and it performs 3.4 times better than our previous system