9 research outputs found
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Self-supervised pre-training techniques have achieved remarkable progress in
Document AI. Most multimodal pre-trained models use a masked language modeling
objective to learn bidirectional representations on the text modality, but they
differ in pre-training objectives for the image modality. This discrepancy adds
difficulty to multimodal representation learning. In this paper, we propose
LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified
text and image masking. Additionally, LayoutLMv3 is pre-trained with a
word-patch alignment objective to learn cross-modal alignment by predicting
whether the corresponding image patch of a text word is masked. The simple
unified architecture and training objectives make LayoutLMv3 a general-purpose
pre-trained model for both text-centric and image-centric Document AI tasks.
Experimental results show that LayoutLMv3 achieves state-of-the-art performance
not only in text-centric tasks, including form understanding, receipt
understanding, and document visual question answering, but also in
image-centric tasks such as document image classification and document layout
analysis. The code and models are publicly available at
https://aka.ms/layoutlmv3.Comment: Work in Progres
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
The diffusion model has been proven a powerful generative model in recent
years, yet remains a challenge in generating visual text. Several methods
alleviated this issue by incorporating explicit text position and content as
guidance on where and what text to render. However, these methods still suffer
from several drawbacks, such as limited flexibility and automation, constrained
capability of layout prediction, and restricted style diversity. In this paper,
we present TextDiffuser-2, aiming to unleash the power of language models for
text rendering. Firstly, we fine-tune a large language model for layout
planning. The large language model is capable of automatically generating
keywords for text rendering and also supports layout modification through
chatting. Secondly, we utilize the language model within the diffusion model to
encode the position and texts at the line level. Unlike previous methods that
employed tight character-level guidance, this approach generates more diverse
text images. We conduct extensive experiments and incorporate user studies
involving human participants as well as GPT-4V, validating TextDiffuser-2's
capacity to achieve a more rational text layout and generation with enhanced
diversity. The code and model will be available at
\url{https://aka.ms/textdiffuser-2}
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Large language models exhibit enhanced zero-shot performance on various tasks
when fine-tuned with instruction-following data. Multimodal
instruction-following models extend these capabilities by integrating both text
and images. However, existing models such as MiniGPT-4 face challenges in
maintaining dialogue coherence in scenarios involving multiple images. A
primary reason is the lack of a specialized dataset for this critical
application. To bridge these gaps, we present SparklesChat, a multimodal
instruction-following model for open-ended dialogues across multiple images. To
support the training, we introduce SparklesDialogue, the first
machine-generated dialogue dataset tailored for word-level interleaved
multi-image and text interactions. Furthermore, we construct SparklesEval, a
GPT-assisted benchmark for quantitatively assessing a model's conversational
competence across multiple images and dialogue turns. Our experiments validate
the effectiveness of SparklesChat in understanding and reasoning across
multiple images and dialogue turns. Specifically, SparklesChat outperformed
MiniGPT-4 on established vision-and-language benchmarks, including the BISON
binary image selection task and the NLVR2 visual reasoning task. Moreover,
SparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding
MiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative
evaluations further demonstrate SparklesChat's generality in handling
real-world applications. All resources will be available at
https://github.com/HYPJUDY/Sparkles
Kosmos-2.5: A Multimodal Literate Model
We present Kosmos-2.5, a multimodal literate model for machine reading of
text-intensive images. Pre-trained on large-scale text-intensive images,
Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1)
generating spatially-aware text blocks, where each block of text is assigned
its spatial coordinates within the image, and (2) producing structured text
output that captures styles and structures into the markdown format. This
unified multimodal literate capability is achieved through a shared Transformer
architecture, task-specific prompts, and flexible text representations. We
evaluate Kosmos-2.5 on end-to-end document-level text recognition and
image-to-markdown text generation. Furthermore, the model can be readily
adapted for any text-intensive image understanding task with different prompts
through supervised fine-tuning, making it a general-purpose tool for real-world
applications involving text-rich images. This work also paves the way for the
future scaling of multimodal large language models