39 research outputs found
What Can Help Pedestrian Detection?
Aggregating extra features has been considered as an effective approach to
boost traditional pedestrian detection methods. However, there is still a lack
of studies on whether and how CNN-based pedestrian detectors can benefit from
these extra features. The first contribution of this paper is exploring this
issue by aggregating extra features into CNN-based pedestrian detection
framework. Through extensive experiments, we evaluate the effects of different
kinds of extra features quantitatively. Moreover, we propose a novel network
architecture, namely HyperLearner, to jointly learn pedestrian detection as
well as the given extra feature. By multi-task training, HyperLearner is able
to utilize the information of given features and improve detection performance
without extra inputs in inference. The experimental results on multiple
pedestrian benchmarks validate the effectiveness of the proposed HyperLearner.Comment: Accepted to IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR) 201
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns
visual concepts, words, and semantic parsing of sentences without explicit
supervision on any of them; instead, our model learns by simply looking at
images and reading paired questions and answers. Our model builds an
object-based scene representation and translates sentences into executable,
symbolic programs. To bridge the learning of two modules, we use a
neuro-symbolic reasoning module that executes these programs on the latent
scene representation. Analogical to human concept learning, the perception
module learns visual concepts based on the language description of the object
being referred to. Meanwhile, the learned visual concepts facilitate learning
new words and parsing new sentences. We use curriculum learning to guide the
searching over the large compositional space of images and language. Extensive
experiments demonstrate the accuracy and efficiency of our model on learning
visual concepts, word representations, and semantic parsing of sentences.
Further, our method allows easy generalization to new object attributes,
compositions, language concepts, scenes and questions, and even new program
domains. It also empowers applications including visual question answering and
bidirectional image-text retrieval.Comment: ICLR 2019 (Oral). Project page: http://nscl.csail.mit.edu
HandMeThat: Human-Robot Communication in Physical and Social Environments
We introduce HandMeThat, a benchmark for a holistic evaluation of instruction
understanding and following in physical and social environments. While previous
datasets primarily focused on language grounding and planning, HandMeThat
considers the resolution of human instructions with ambiguities based on the
physical (object states and relations) and social (human actions and goals)
information. HandMeThat contains 10,000 episodes of human-robot interactions.
In each episode, the robot first observes a trajectory of human actions towards
her internal goal. Next, the robot receives a human instruction and should take
actions to accomplish the subgoal set through the instruction. In this paper,
we present a textual interface for our benchmark, where the robot interacts
with a virtual environment through textual commands. We evaluate several
baseline models on HandMeThat, and show that both offline and online
reinforcement learning algorithms perform poorly on HandMeThat, suggesting
significant room for future work on physical and social human-robot
communications and interactions.Comment: NeurIPS 2022 (Dataset and Benchmark Track). First two authors
contributed equally. Project page: http://handmethat.csail.mit.edu
What's Left? Concept Grounding with Logic-Enhanced Foundation Models
Recent works such as VisProg and ViperGPT have smartly composed foundation
models for visual reasoning-using large language models (LLMs) to produce
programs that can be executed by pre-trained vision-language models. However,
they operate in limited domains, such as 2D images, not fully exploiting the
generalization of language: abstract concepts like "left" can also be grounded
in 3D, temporal, and action data, as in moving to your left. This limited
generalization stems from these inference-only methods' inability to learn or
adapt pre-trained models to a new domain. We propose the Logic-Enhanced
Foundation Model (LEFT), a unified framework that learns to ground and reason
with concepts across domains with a differentiable, domain-independent,
first-order logic-based program executor. LEFT has an LLM interpreter that
outputs a program represented in a general, logic-based reasoning language,
which is shared across all domains and tasks. LEFT's executor then executes the
program with trainable domain-specific grounding modules. We show that LEFT
flexibly learns concepts in four domains: 2D images, 3D scenes, human motions,
and robotic manipulation. It exhibits strong reasoning ability in a wide
variety of tasks, including those that are complex and not seen during
training, and can be easily applied to new domains.Comment: NeurIPS 2023. First two authors contributed equally. Project page:
https://web.stanford.edu/~joycj/projects/left_neurips_202