Search CORE

1,767 research outputs found

Image Manipulation via Multi-Hop Instructions -- A New Dataset and Weakly-Supervised Neuro-Symbolic Approach

Author: Garg Dinesh
Garg Poorva
Goswami Ashish
Gupta Mohit
Khandelwal Dinesh
Modi Satyam
Mondal Arnab Kumar
Shah Kevin
Singh Harman
Singla Parag
Publication venue
Publication date: 24/10/2023
Field of study

We are interested in image manipulation via natural language text -- a task that is useful for multiple AI applications but requires complex reasoning over multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning (NSCL), which has been quite effective for the task of Visual Question Answering (VQA), for the task of image manipulation. Our system referred to as NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. NeuroSIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides its execution. We create a new dataset for the task, and extensive experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA baselines that make use of supervised data for manipulation.Comment: EMNLP 2023 (long paper, main conference

arXiv.org e-Print Archive

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

Author: Goel Arushi
Kasaei Hamidreza
Kasaei Mohammadreza
Li Alex
Tziafas Georgios
Xu Yucheng
Publication venue
Publication date: 30/08/2023
Field of study

Robots operating in human-centric environments require the integration of visual grounding and grasping capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. To address these limitations, we develop a challenging benchmark based on cluttered indoor scenes from OCID dataset, for which we generate referring expressions and connect them with 4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn grasp synthesis directly from image-text pairs. Our results show that vanilla integration of CLIP with pretrained models transfers poorly in our challenging benchmark, while CROG achieves significant improvements both in terms of grounding and grasping. Extensive robot experiments in both simulation and hardware demonstrate the effectiveness of our approach in challenging interactive object grasping scenarios that include clutter

Edinburgh Research Explorer

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

Author: Goel Arushi
Kasaei Hamidreza
Kasaei Mohammadreza
Li Zhibin
Tziafas Georgios
Xu Yucheng
Publication venue
Publication date: 09/11/2023
Field of study

arXiv.org e-Print Archive

Contextualizing Multiple Tasks via Learning to Decompose

Author: Hong Lanqing
Li Zhenguo
Wei Xiu-Shen
Ye Han-Jia
Zhan De-Chuan
Zhou Da-Wei
Publication venue
Publication date: 15/06/2021
Field of study

One single instance could possess multiple portraits and reveal diverse relationships with others according to different contexts. Those ambiguities increase the difficulty of learning a generalizable model when there exists one concept or mixed concepts in a task. We propose a general approach Learning to Decompose Network (LeadNet) for both two cases, which contextualizes a model through meta-learning multiple maps for concepts discovery -- the representations of instances are decomposed and adapted conditioned on the contexts. Through taking a holistic view over multiple latent components over instances in a sampled pseudo task, LeadNet learns to automatically select the right concept via incorporating those rich semantics inside and between objects. LeadNet demonstrates its superiority in various applications, including exploring multiple views of confusing tasks, out-of-distribution recognition, and few-shot image classification

arXiv.org e-Print Archive

Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

Author: Chen Peixian
Li Ke
Lin Mingbao
Lin Shaohui
Ren Bo
Shen Yunhang
Sheng Kekai
Zhang Mengdan
Publication venue
Publication date: 24/11/2022
Field of study

Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary. Recent work resorts to the rich knowledge in pre-trained vision-language models. However, existing methods are ineffective in proposal-level vision-language alignment. Meanwhile, the models usually suffer from confidence bias toward base categories and perform worse on novel ones. To overcome the challenges, we present MEDet, a novel and effective OVD framework with proposal mining and prediction equalization. First, we design an online proposal mining to refine the inherited vision-semantic knowledge from coarse to fine, allowing for proposal-level detection-oriented feature alignment. Second, based on causal inference theory, we introduce a class-wise backdoor adjustment to reinforce the predictions on novel categories to improve the overall OVD performance. Extensive experiments on COCO and LVIS benchmarks verify the superiority of MEDet over the competing approaches in detecting objects of novel categories, e.g., 32.6% AP50 on COCO and 22.4% mask mAP on LVIS

arXiv.org e-Print Archive

Segment Anything

Author: Berg Alexander C.
Dollár Piotr
Girshick Ross
Gustafson Laura
Kirillov Alexander
Lo Wan-Yen
Mao Hanzi
Mintun Eric
Ravi Nikhila
Rolland Chloe
Whitehead Spencer
Xiao Tete
Publication venue
Publication date: 05/04/2023
Field of study

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.Comment: Project web-page: https://segment-anything.co

arXiv.org e-Print Archive