18 research outputs found
A Vision-free Baseline for Multimodal Grammar Induction
Past work has shown that paired vision-language signals substantially improve
grammar induction in multimodal datasets such as MSCOCO. We investigate whether
advancements in large language models (LLMs) that are only trained with text
could provide strong assistance for grammar induction in multimodal settings.
We find that our text-only approach, an LLM-based C-PCFG (LC-PCFG), outperforms
previous multi-modal methods, and achieves state-of-the-art grammar induction
performance for various multimodal datasets. Compared to image-aided grammar
induction, LC-PCFG outperforms the prior state-of-the-art by 7.9 Corpus-F1
points, with an 85% reduction in parameter count and 1.7x faster training
speed. Across three video-assisted grammar induction benchmarks, LC-PCFG
outperforms prior state-of-the-art by up to 7.7 Corpus-F1, with 8.8x faster
training. These results shed light on the notion that text-only language models
might include visually grounded cues that aid in grammar induction in
multimodal contexts. Moreover, our results emphasize the importance of
establishing a robust vision-free baseline when evaluating the benefit of
multimodal approaches
Synthetic Document Generator for Annotation-free Layout Recognition
Analyzing the layout of a document to identify headers, sections, tables,
figures etc. is critical to understanding its content. Deep learning based
approaches for detecting the layout structure of document images have been
promising. However, these methods require a large number of annotated examples
during training, which are both expensive and time consuming to obtain. We
describe here a synthetic document generator that automatically produces
realistic documents with labels for spatial positions, extents and categories
of the layout elements. The proposed generative process treats every physical
component of a document as a random variable and models their intrinsic
dependencies using a Bayesian Network graph. Our hierarchical formulation using
stochastic templates allow parameter sharing between documents for retaining
broad themes and yet the distributional characteristics produces visually
unique samples, thereby capturing complex and diverse layouts. We empirically
illustrate that a deep layout detection model trained purely on the synthetic
documents can match the performance of a model that uses real documents