1,767 research outputs found
Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach
We are interested in image manipulation via natural language text -- a task
that is useful for multiple AI applications but requires complex reasoning over
multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning
(NSCL), which has been quite effective for the task of Visual Question
Answering (VQA), for the task of image manipulation. Our system referred to as
NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and
only requires weak supervision in the form of annotated data for VQA. NeuroSIM
parses an instruction into a symbolic program, based on a Domain Specific
Language (DSL) comprising of object attributes and manipulation operations,
that guides its execution. We create a new dataset for the task, and extensive
experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA
baselines that make use of supervised data for manipulation.Comment: EMNLP 2023 (long paper, main conference
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Robots operating in human-centric environments require the integration of visual grounding and grasping capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. To address these limitations, we develop a challenging benchmark based on cluttered indoor scenes from OCID dataset, for which we generate referring expressions and connect them with 4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn grasp synthesis directly from image-text pairs. Our results show that vanilla integration of CLIP with pretrained models transfers poorly in our challenging benchmark, while CROG achieves significant improvements both in terms of grounding and grasping. Extensive robot experiments in both simulation and hardware demonstrate the effectiveness of our approach in challenging interactive object grasping scenarios that include clutter
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter
Robots operating in human-centric environments require the integration of
visual grounding and grasping capabilities to effectively manipulate objects
based on user instructions. This work focuses on the task of referring grasp
synthesis, which predicts a grasp pose for an object referred through natural
language in cluttered scenes. Existing approaches often employ multi-stage
pipelines that first segment the referred object and then propose a suitable
grasp, and are evaluated in private datasets or simulators that do not capture
the complexity of natural indoor scenes. To address these limitations, we
develop a challenging benchmark based on cluttered indoor scenes from OCID
dataset, for which we generate referring expressions and connect them with
4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that
leverages the visual grounding capabilities of CLIP to learn grasp synthesis
directly from image-text pairs. Our results show that vanilla integration of
CLIP with pretrained models transfers poorly in our challenging benchmark,
while CROG achieves significant improvements both in terms of grounding and
grasping. Extensive robot experiments in both simulation and hardware
demonstrate the effectiveness of our approach in challenging interactive object
grasping scenarios that include clutter.Comment: Poster CoRL 2023. Dataset and code available here:
https://github.com/gtziafas/OCID-VL
Contextualizing Multiple Tasks via Learning to Decompose
One single instance could possess multiple portraits and reveal diverse
relationships with others according to different contexts. Those ambiguities
increase the difficulty of learning a generalizable model when there exists one
concept or mixed concepts in a task. We propose a general approach Learning to
Decompose Network (LeadNet) for both two cases, which contextualizes a model
through meta-learning multiple maps for concepts discovery -- the
representations of instances are decomposed and adapted conditioned on the
contexts. Through taking a holistic view over multiple latent components over
instances in a sampled pseudo task, LeadNet learns to automatically select the
right concept via incorporating those rich semantics inside and between
objects. LeadNet demonstrates its superiority in various applications,
including exploring multiple views of confusing tasks, out-of-distribution
recognition, and few-shot image classification
Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization
Open-vocabulary object detection (OVD) aims to scale up vocabulary size to
detect objects of novel categories beyond the training vocabulary. Recent work
resorts to the rich knowledge in pre-trained vision-language models. However,
existing methods are ineffective in proposal-level vision-language alignment.
Meanwhile, the models usually suffer from confidence bias toward base
categories and perform worse on novel ones. To overcome the challenges, we
present MEDet, a novel and effective OVD framework with proposal mining and
prediction equalization. First, we design an online proposal mining to refine
the inherited vision-semantic knowledge from coarse to fine, allowing for
proposal-level detection-oriented feature alignment. Second, based on causal
inference theory, we introduce a class-wise backdoor adjustment to reinforce
the predictions on novel categories to improve the overall OVD performance.
Extensive experiments on COCO and LVIS benchmarks verify the superiority of
MEDet over the competing approaches in detecting objects of novel categories,
e.g., 32.6% AP50 on COCO and 22.4% mask mAP on LVIS
Segment Anything
We introduce the Segment Anything (SA) project: a new task, model, and
dataset for image segmentation. Using our efficient model in a data collection
loop, we built the largest segmentation dataset to date (by far), with over 1
billion masks on 11M licensed and privacy respecting images. The model is
designed and trained to be promptable, so it can transfer zero-shot to new
image distributions and tasks. We evaluate its capabilities on numerous tasks
and find that its zero-shot performance is impressive -- often competitive with
or even superior to prior fully supervised results. We are releasing the
Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and
11M images at https://segment-anything.com to foster research into foundation
models for computer vision.Comment: Project web-page: https://segment-anything.co
- …