3,821 research outputs found
Building Machines That Learn and Think Like People
Recent progress in artificial intelligence (AI) has renewed interest in
building systems that learn and think like people. Many advances have come from
using deep neural networks trained end-to-end in tasks such as object
recognition, video games, and board games, achieving performance that equals or
even beats humans in some respects. Despite their biological inspiration and
performance achievements, these systems differ from human intelligence in
crucial ways. We review progress in cognitive science suggesting that truly
human-like learning and thinking machines will have to reach beyond current
engineering trends in both what they learn, and how they learn it.
Specifically, we argue that these machines should (a) build causal models of
the world that support explanation and understanding, rather than merely
solving pattern recognition problems; (b) ground learning in intuitive theories
of physics and psychology, to support and enrich the knowledge that is learned;
and (c) harness compositionality and learning-to-learn to rapidly acquire and
generalize knowledge to new tasks and situations. We suggest concrete
challenges and promising routes towards these goals that can combine the
strengths of recent neural network advances with more structured cognitive
models.Comment: In press at Behavioral and Brain Sciences. Open call for commentary
proposals (until Nov. 22, 2016).
https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/information/calls-for-commentary/open-calls-for-commentar
Independent Prototype Propagation for Zero-Shot Compositionality
Humans are good at compositional zero-shot reasoning; someone who has never
seen a zebra before could nevertheless recognize one when we tell them it looks
like a horse with black and white stripes. Machine learning systems, on the
other hand, usually leverage spurious correlations in the training data, and
while such correlations can help recognize objects in context, they hurt
generalization. To be able to deal with underspecified datasets while still
leveraging contextual clues during classification, we propose ProtoProp, a
novel prototype propagation graph method. First we learn prototypical
representations of objects (e.g., zebra) that are conditionally independent
w.r.t. their attribute labels (e.g., stripes) and vice versa. Next we propagate
the independent prototypes through a compositional graph, to learn
compositional prototypes of novel attribute-object combinations that reflect
the dependencies of the target distribution. The method does not rely on any
external data, such as class hierarchy graphs or pretrained word embeddings. We
evaluate our approach on AO-Clever, a synthetic and strongly visual dataset
with clean labels, and UT-Zappos, a noisy real-world dataset of fine-grained
shoe types. We show that in the generalized compositional zero-shot setting we
outperform state-of-the-art results, and through ablations we show the
importance of each part of the method and their contribution to the final
results
Deconfounding Causal Inference for Zero-shot Action Recognition
Zero-shot action recognition (ZSAR) aims to recognize unseen action categories in the test set without corresponding training examples. Most existing zero-shot methods follow the feature generation framework to transfer knowledge from seen action categories to model the feature distribution of unseen categories. However, due to the complexity and diversity of actions, it remains challenging to generate unseen feature distribution, especially for the cross-dataset scenario when there is potentially larger domain shift. This paper proposes a De confounding Ca usa l GAN (DeCalGAN) for generating unseen action video features with the following technical contributions: 1) Our model unifies compositional ZSAR with traditional visual-semantic models to incorporate local object information with global semantic information for feature generation. 2) A GAN-based architecture is proposed for causal inference and unseen distribution discovery. 3) A deconfounding module is proposed to refine representations of local object and global semantic information confounder in the training data. Action descriptions and random object feature after causal inference are then used to discover unseen distributions of novel actions in different datasets. Our extensive experiments on C ross- D ataset Z ero- S hot A ction R ecognition (CD-ZSAR) demonstrate substantial improvement over the UCF101 and HMDB51 standard benchmarks for this problem
ComCLIP: Training-Free Compositional Image and Text Matching
Contrastive Language-Image Pretraining (CLIP) has demonstrated great
zero-shot performance for image-text matching because of its holistic use of
natural language supervision that covers large-scale, open-world visual
concepts. However, it is still challenging to adapt CLIP to compositional image
and text matching -- a more challenging image and matching mask requiring the
model understanding of compositional word concepts and visual components.
Towards better compositional generalization in zero-shot image and text
matching, in this paper, we study the problem from a causal perspective: the
erroneous semantics of individual entities are essentially confounders that
cause the matching failure. Therefore, we propose a novel training-free
compositional CLIP model (ComCLIP). ComCLIP disentangles input images into
subjects, objects, and action sub-images and composes CLIP's vision encoder and
text encoder to perform evolving matching over compositional text embedding and
sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations
introduced by the pretrained CLIP models and dynamically assess the
contribution of each entity when performing image and text matching.
Experiments on compositional image-text matching on SVO and ComVG and general
image-text retrieval on Flickr8K demonstrate the effectiveness of our
plug-and-play method, which boosts the zero-shot inference ability of CLIP even
without further training or fine-tuning of CLIP
Learning Conditional Attributes for Compositional Zero-Shot Learning
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize
novel compositional concepts based on learned concepts such as attribute-object
combinations. One of the challenges is to model attributes interacted with
different objects, e.g., the attribute ``wet" in ``wet apple" and ``wet cat" is
different. As a solution, we provide analysis and argue that attributes are
conditioned on the recognized object and input image and explore learning
conditional attribute embeddings by a proposed attribute learning framework
containing an attribute hyper learner and an attribute base learner. By
encoding conditional attributes, our model enables to generate flexible
attribute embeddings for generalization from seen to unseen compositions.
Experiments on CZSL benchmarks, including the more challenging C-GQA dataset,
demonstrate better performances compared with other state-of-the-art approaches
and validate the importance of learning conditional attributes. Code is
available at https://github.com/wqshmzh/CANet-CZSLComment: 10 pages, 4 figures, accepted in CVPR202
A Short Survey of Systematic Generalization
This survey includes systematic generalization and a history of how machine
learning addresses it. We aim to summarize and organize the related information
of both conventional and recent improvements. We first look at the definition
of systematic generalization, then introduce Classicist and Connectionist. We
then discuss different types of Connectionists and how they approach the
generalization. Two crucial problems of variable binding and causality are
discussed. We look into systematic generalization in language, vision, and VQA
fields. Recent improvements from different aspects are discussed. Systematic
generalization has a long history in artificial intelligence. We could cover
only a small portion of many contributions. We hope this paper provides a
background and is beneficial for discoveries in future work
DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning
Compositional Zero-shot Learning (CZSL) aims to recognize novel concepts
composed of known knowledge without training samples. Standard CZSL either
identifies visual primitives or enhances unseen composed entities, and as a
result, entanglement between state and object primitives cannot be fully
utilized. Admittedly, vision-language models (VLMs) could naturally cope with
CZSL through tuning prompts, while uneven entanglement leads prompts to be
dragged into local optimum. In this paper, we take a further step to introduce
a novel Disentangled and Recurrent Prompt Tuning framework termed DRPT to
better tap the potential of VLMs in CZSL. Specifically, the state and object
primitives are deemed as learnable tokens of vocabulary embedded in prompts and
tuned on seen compositions. Instead of jointly tuning state and object, we
devise a disentangled and recurrent tuning strategy to suppress the traction
force caused by entanglement and gradually optimize the token parameters,
leading to a better prompt space. Notably, we develop a progressive fine-tuning
procedure that allows for incremental updates to the prompts, optimizing the
object first, then the state, and vice versa. Meanwhile, the optimization of
state and object is independent, thus clearer features can be learned to
further alleviate the issue of entangling misleading optimization. Moreover, we
quantify and analyze the entanglement in CZSL and supplement entanglement
rebalancing optimization schemes. DRPT surpasses representative
state-of-the-art methods on extensive benchmark datasets, demonstrating
superiority in both accuracy and efficiency
Simple Primitives with Feasibility- and Contextuality-Dependence for Open-World Compositional Zero-shot Learning
The task of Compositional Zero-Shot Learning (CZSL) is to recognize images of
novel state-object compositions that are absent during the training stage.
Previous methods of learning compositional embedding have shown effectiveness
in closed-world CZSL. However, in Open-World CZSL (OW-CZSL), their performance
tends to degrade significantly due to the large cardinality of possible
compositions. Some recent works separately predict simple primitives (i.e.,
states and objects) to reduce cardinality. However, they consider simple
primitives as independent probability distributions, ignoring the heavy
dependence between states, objects, and compositions. In this paper, we model
the dependence of compositions via feasibility and contextuality.
Feasibility-dependence refers to the unequal feasibility relations between
simple primitives, e.g., \textit{hairy} is more feasible with \textit{cat} than
with \textit{building} in the real world. Contextuality-dependence represents
the contextual variance in images, e.g., \textit{cat} shows diverse appearances
under the state of \textit{dry} and \textit{wet}. We design Semantic Attention
(SA) and generative Knowledge Disentanglement (KD) to learn the dependence of
feasibility and contextuality, respectively. SA captures semantics in
compositions to alleviate impossible predictions, driven by the visual
similarity between simple primitives. KD disentangles images into unbiased
feature representations, easing contextual bias in predictions. Moreover, we
complement the current compositional probability model with feasibility and
contextuality in a compatible format. Finally, we conduct comprehensive
experiments to analyze and validate the superior or competitive performance of
our model, Semantic Attention and knowledge Disentanglement guided Simple
Primitives (SAD-SP), on three widely-used benchmark OW-CZSL datasets
- …