Search CORE

9 research outputs found

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Author: Cui Lei
Huang Yupan
Lu Yutong
Lv Tengchao
Wei Furu
Publication venue
Publication date: 19/04/2022
Field of study

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at https://aka.ms/layoutlmv3.Comment: Work in Progres

arXiv.org e-Print Archive

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

Author: Chen Jingye
Chen Qifeng
Cui Lei
Huang Yupan
Lv Tengchao
Wei Furu
Publication venue
Publication date: 27/11/2023
Field of study

The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords for text rendering and also supports layout modification through chatting. Secondly, we utilize the language model within the diffusion model to encode the position and texts at the line level. Unlike previous methods that employed tight character-level guidance, this approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. The code and model will be available at \url{https://aka.ms/textdiffuser-2}

arXiv.org e-Print Archive

Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

Author: Collier Nigel
Huang Yupan
Liu Fangyu
Lu Yutong
Meng Zaiqiao
Su Yixuan
Publication venue
Publication date: 31/08/2023
Field of study

Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. To support the training, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns. Specifically, SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks, including the BISON binary image selection task and the NLVR2 visual reasoning task. Moreover, SparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding MiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources will be available at https://github.com/HYPJUDY/Sparkles

arXiv.org e-Print Archive

Kosmos-2.5: A Multimodal Literate Model

Author: Chang Yaoyao
Chen Jingye
Cui Lei
Dong Li
Huang Shaohan
Huang Yupan
Luo Weiyao
Lv Tengchao
Ma Shuming
Wang Guoxin
Wang Wenhui
Wei Furu
Wu Shaoxiang
Zhang Cha
Publication venue
Publication date: 20/09/2023
Field of study

We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models

arXiv.org e-Print Archive

Microfluidic mixing using PDMS-based microporous structures

Author: A Scheidegger
AD Stroock
AP Sudarsan
DA Nield
DNH Tran
Dongdong Liu
GM Whitesides
GS Beavers
H Bachman
H Phan Van
H Teng
H Yu
HA Stone
HY Gan
J Noack
J-W Han
K Ward
KH Parker
M Hejazian
M Veyskarami
N-T Nguyen
N-T Nguyen
P Ritter
P Si
P Sudarsan Arjun
P Thurgood
P-H Huang
P-H Huang
P-H Huang
PK Yuen
PK Yuen
S Liang
S Qiu
T Janoschka
Tuan Tran
W Yang
W Yupan
X Chen
X Linfeng
X Shang
X Wang
Y-C Ahn
Z Chai
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref