7,299 research outputs found
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents
Vision-language pretraining models have achieved great success in supporting
multimedia applications by understanding the alignments between images and
text. While existing vision-language pretraining models primarily focus on
understanding single image associated with a single piece of text, they often
ignore the alignment at the intra-document level, consisting of multiple
sentences with multiple images. In this work, we propose DocumentCLIP, a
salience-aware contrastive learning framework to enforce vision-language
pretraining models to comprehend the interaction between images and longer text
within documents. Our model is beneficial for the real-world multimodal
document understanding like news article, magazines, product descriptions,
which contain linguistically and visually richer content. To the best of our
knowledge, we are the first to explore multimodal intra-document links by
contrastive learning. In addition, we collect a large Wikipedia dataset for
pretraining, which provides various topics and structures. Experiments show
DocumentCLIP not only outperforms the state-of-the-art baselines in the
supervised setting, but also achieves the best zero-shot performance in the
wild after human evaluation. Our code is available at
https://github.com/FuxiaoLiu/DocumentCLIP.Comment: 8 pages, 5 figures. In submissio
Improving Policy Learning via Language Dynamics Distillation
Recent work has shown that augmenting environments with language descriptions improves policy learning. However, for environments with complex language abstractions, learning how to ground language to observations is difficult due to sparse, delayed rewards. We propose Language Dynamics Distillation (LDD), which pretrains a model to predict environment dynamics given demonstrations with language descriptions, and then fine-tunes these language-aware pretrained representations via reinforcement learning (RL). In this way, the model is trained to both maximize expected reward and retain knowledge about how language relates to environment dynamics. On SILG, a benchmark of five tasks with language descriptions that evaluate distinct generalization challenges on unseen environments (NetHack, ALFWorld, RTFM, Messenger, and Touchdown), LDD outperforms tabula-rasa RL, VAE pretraining, and methods that learn from unlabeled demonstrations in inverse RL and reward shaping with pretrained experts. In our analyses, we show that language descriptions in demonstrations improve sample-efficiency and generalization across environments, and that dynamics modeling with expert demonstrations is more effective than with non-experts
Universal Language Model Fine-tuning for Text Classification
Inductive transfer learning has greatly impacted computer vision, but
existing approaches in NLP still require task-specific modifications and
training from scratch. We propose Universal Language Model Fine-tuning
(ULMFiT), an effective transfer learning method that can be applied to any task
in NLP, and introduce techniques that are key for fine-tuning a language model.
Our method significantly outperforms the state-of-the-art on six text
classification tasks, reducing the error by 18-24% on the majority of datasets.
Furthermore, with only 100 labeled examples, it matches the performance of
training from scratch on 100x more data. We open-source our pretrained models
and code.Comment: ACL 2018, fixed denominator in Equation 3, line
- …