Search CORE

18 research outputs found

A Vision-free Baseline for Multimodal Grammar Induction

Author: Belongie Serge
Chen Catherine
Corona Rodolfo
Darrell Trevor
Flaherty Daniel
Klein Dan
Li Boyi
Malik Jitendra
Mangalam Karttikeya
Weinberger Kilian Q.
Publication venue
Publication date: 31/10/2023
Field of study

Past work has shown that paired vision-language signals substantially improve grammar induction in multimodal datasets such as MSCOCO. We investigate whether advancements in large language models (LLMs) that are only trained with text could provide strong assistance for grammar induction in multimodal settings. We find that our text-only approach, an LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods, and achieves state-of-the-art grammar induction performance for various multimodal datasets. Compared to image-aided grammar induction, LC-PCFG outperforms the prior state-of-the-art by 7.9 Corpus-F1 points, with an 85% reduction in parameter count and 1.7x faster training speed. Across three video-assisted grammar induction benchmarks, LC-PCFG outperforms prior state-of-the-art by up to 7.7 Corpus-F1, with 8.8x faster training. These results shed light on the notion that text-only language models might include visually grounded cues that aid in grammar induction in multimodal contexts. Moreover, our results emphasize the importance of establishing a robust vision-free baseline when evaluating the benefit of multimodal approaches

arXiv.org e-Print Archive

Gesture enhanced comprehension of ambiguous human-to-robot instructions

Author: KARUMPULLI Nipuni
LIM Joo Hwee
MISRA Archan
SUBBARAJU Vigneshwaran
TAN U-Xuan
TRAN Minh Anh Tuan
WEERAKOON MUDIYANSELAGE DULANGA KAVEESHA WEERAKOON
XU Qianli
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2020
Field of study

Institutional Knowledge at Singapore Management University

Synthetic Document Generator for Annotation-free Layout Recognition

Author: Raman Natraj
Shah Sameena
Veloso Manuela
Publication venue: 'Elsevier BV'
Publication date: 24/07/2022
Field of study

Analyzing the layout of a document to identify headers, sections, tables, figures etc. is critical to understanding its content. Deep learning based approaches for detecting the layout structure of document images have been promising. However, these methods require a large number of annotated examples during training, which are both expensive and time consuming to obtain. We describe here a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of the layout elements. The proposed generative process treats every physical component of a document as a random variable and models their intrinsic dependencies using a Bayesian Network graph. Our hierarchical formulation using stochastic templates allow parameter sharing between documents for retaining broad themes and yet the distributional characteristics produces visually unique samples, thereby capturing complex and diverse layouts. We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents

arXiv.org e-Print Archive