10 research outputs found
Visual Concept-Metaconcept Learning
Humans reason with concepts and metaconcepts: we recognize red and green from
visual input; we also understand that they describe the same property of
objects (i.e., the color). In this paper, we propose the visual
concept-metaconcept learner (VCML) for joint learning of concepts and
metaconcepts from images and associated question-answer pairs. The key is to
exploit the bidirectional connection between visual concepts and metaconcepts.
Visual representations provide grounding cues for predicting relations between
unseen pairs of concepts. Knowing that red and green describe the same property
of objects, we generalize to the fact that cube and sphere also describe the
same property of objects, since they both categorize the shape of objects.
Meanwhile, knowledge about metaconcepts empowers visual concept learning from
limited, noisy, and even biased data. From just a few examples of purple cubes
we can understand a new color purple, which resembles the hue of the cubes
instead of the shape of them. Evaluation on both synthetic and real-world
datasets validates our claims.Comment: NeurIPS 2019. First two authors contributed equally. Project page:
http://vcml.csail.mit.edu
Benchmarking and Enhancing Disentanglement in Concept-Residual Models
Concept bottleneck models (CBMs) are interpretable models that first predict
a set of semantically meaningful features, i.e., concepts, from observations
that are subsequently used to condition a downstream task. However, the model's
performance strongly depends on the engineered features and can severely suffer
from incomplete sets of concepts. Prior works have proposed a side channel -- a
residual -- that allows for unconstrained information flow to the downstream
task, thus improving model performance but simultaneously introducing
information leakage, which is undesirable for interpretability. This work
proposes three novel approaches to mitigate information leakage by
disentangling concepts and residuals, investigating the critical balance
between model performance and interpretability. Through extensive empirical
analysis on the CUB, OAI, and CIFAR 100 datasets, we assess the performance of
each disentanglement method and provide insights into when they work best.
Further, we show how each method impacts the ability to intervene over the
concepts and their subsequent impact on task performance
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning
Humans have an inherent ability to learn novel concepts from only a few samples and generalize these concepts to different situations. Even though today's machine learning models excel with a plethora of training data on standard recognition tasks, a considerable gap exists between machine-level pattern recognition and human-level concept learning. To narrow this gap, the Bongard Problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems. Albeit new advances in representation learning and learning to learn, BPs remain a daunting challenge for modern AI. Inspired by the original one hundred BPs, we propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning. We develop a program-guided generation technique to produce a large set of human-interpretable visual cognition problems in action-oriented LOGO language. Our benchmark captures three core properties of human cognition: 1) context-dependent perception, in which the same object may have disparate interpretations given different contexts; 2) analogy-making perception, in which some meaningful concepts are traded off for other meaningful concepts; and 3) perception with a few samples but infinite vocabulary. In experiments, we show that the state-of-the-art deep learning methods perform substantially worse than human subjects, implying that they fail to capture core human cognition properties. Finally, we discuss research directions towards a general architecture for visual reasoning to tackle this benchmark
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning
Humans have an inherent ability to learn novel concepts from only a few
samples and generalize these concepts to different situations. Even though
today's machine learning models excel with a plethora of training data on
standard recognition tasks, a considerable gap exists between machine-level
pattern recognition and human-level concept learning. To narrow this gap, the
Bongard problems (BPs) were introduced as an inspirational challenge for visual
cognition in intelligent systems. Despite new advances in representation
learning and learning to learn, BPs remain a daunting challenge for modern AI.
Inspired by the original one hundred BPs, we propose a new benchmark
Bongard-LOGO for human-level concept learning and reasoning. We develop a
program-guided generation technique to produce a large set of
human-interpretable visual cognition problems in action-oriented LOGO language.
Our benchmark captures three core properties of human cognition: 1)
context-dependent perception, in which the same object may have disparate
interpretations given different contexts; 2) analogy-making perception, in
which some meaningful concepts are traded off for other meaningful concepts;
and 3) perception with a few samples but infinite vocabulary. In experiments,
we show that the state-of-the-art deep learning methods perform substantially
worse than human subjects, implying that they fail to capture core human
cognition properties. Finally, we discuss research directions towards a general
architecture for visual reasoning to tackle this benchmark.Comment: 22 pages, NeurIPS 202
Neurosymbolic Spike Concept Learner towards Neuromorphic General Intelligence
Current research in the area of concept learning makes use of deep learning and ensembles methods to learn concepts. Concept learning allows us to combine heterogeneous entities in data which could collectively identify as individual concepts. Heterogeneity and compositionality are crucial areas to explore in machine learning as it has the potential to contribute profoundly to artificial general intelligence. We investigate the use of spiking neural networks for concept learning. Spiking neurones inclusively model the temporal properties as observed in biological neurones. A benefit of spike-based neurones allows for localised learning rules that only adapts connections between relevant neurones. In this position paper, we propose a technique allowing dynamic formation of synapse (connections) in spiking neural networks, the basis of structural plasticity. Achieving dynamic formation of synapse allows for a unique approach to concept learning with a malleable neural structure. We call this technique Neurosymbolic Spike-Concept Learner (NS-SCL). The limitations of NS-SCL can be overcome with the neuromorphic computing paradigm. Furthermore, introducing NS-SCL as a technique on neuromorphic platforms should motivate a new direction of research towards Neuromorphic General Intelligence (NGI), a term we define to some extent
From Vision-Language Multimodal Learning Towards Embodied Agents
To build machine agents with intelligent capabilities mimicking human perception and cognition, vision and language stand out as two essential modalities and foster computer vision and natural language processing. Advances in such realms stimulate research in vision-language multimodal learning that allows optical and linguistic inputs and outputs. Due to the innate difference between the two modalities and the lack of large-scale fine-grained annotations, multimodal agents tend to inherit unimodal shortcuts. In this thesis, we develop various solutions to intervene unimodal shortcuts for multimodal generation and reasoning. For visual shortcuts, we introduce a linguistic prior and devise a syntax-aware action targeting module for dynamic description to rectify the correlation between subject and object in a sentence. We apply concept hierarchy and propose a visual superordinate abstraction framework for unbiased concept learning to reduce the correlation among different attributes of an object. For linguistic shortcuts, we disentangle the topic and syntax to reduce the repetition in generated paragraph descriptions for a given image. With the ubiquity of large-scale pre-trained models, we leverage self-supervised learning in finetuning process to increase the robustness of multimodal reasoning.
The rapid development in multimodal learning promises embodied agents capable of interacting with physical environments. This thesis studies the typical embodied task vision-and-language navigation in discrete scenarios and proposes an episodic scene memory (ESceme) mechanism to balance generalization and efficiency. We figure out one desirable instantiation of the mechanism, namely candidate enhancing, and validate its superiority in various settings. Without extra time and computational cost before inference, ESceme improves performance in unseen environments by a large margin. We hope our findings can inspire more practical explorations on episodic memory in embodied AI