Search CORE

80 research outputs found

Staying afloat amidst extreme uncertainty: A case study of digital transformation in Higher Education

Author: Antonopoulou Katerina
Begkos Christos
Zhu Zichen
Publication venue: 'Elsevier BV'
Publication date: 01/07/2023
Field of study

University of Liverpool Repository

The University of Manchester - Institutional Repository

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

Author: Cao Ruisheng
Chen Lu
Ma Da
Xu Hongshen
Yu Kai
Zhao Zihan
Zhu Zichen
Publication venue
Publication date: 28/02/2024
Field of study

The growing prevalence of visually rich documents, such as webpages and scanned/digital-born documents (images, PDFs, etc.), has led to increased interest in automatic document understanding and information extraction across academia and industry. Although various document modalities, including image, text, layout, and structure, facilitate human information retrieval, the interconnected nature of these modalities presents challenges for neural networks. In this paper, we introduce WebLM, a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages. Instead of processing document images as unified natural images, WebLM integrates the hierarchical structure of document images to enhance the understanding of markup-language-based documents. Additionally, we propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively. Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks. The pre-trained models and code are available at https://github.com/X-LANCE/weblm

arXiv.org e-Print Archive

ChemDFM: Dialogue Foundation Model for Chemistry

Author: Chen Lu
Chen Xin
Fan Shuai
Li Zihao
Ma Da
Shen Guodong
Sun Liangtai
Xu Hongshen
Yu Kai
Zhao Zihan
Zhu Su
Zhu Zichen
Publication venue
Publication date: 26/01/2024
Field of study

Large language models (LLMs) have established great success in the general domain of natural language processing. Their emerging task generalization and free-form dialogue capabilities can greatly help to design Chemical General Intelligence (CGI) to assist real-world research in chemistry. However, the existence of specialized language and knowledge in the field of chemistry, such as the highly informative SMILES notation, hinders the performance of general-domain LLMs in chemistry. To this end, we develop ChemDFM, the first LLM towards CGI. ChemDFM-13B is trained on 34B tokens from chemical literature, textbooks, and instructions as well as various data from the general domain. Therefore, it can store, understand, and reason over chemical knowledge and languages while still possessing advanced free-form language comprehension capabilities. Extensive quantitative evaluation shows that ChemDFM can significantly outperform the representative open-sourced LLMs. Moreover, ChemDFM can also surpass GPT-4 on a great portion of chemical tasks, despite the significant size difference. Further qualitative evaluations demonstrate the efficiency and effectiveness of ChemDFM in real-world research scenarios. We will open-source the ChemDFM model soon.Comment: 10 pages, 12 figures, 13 tables. Under Revie

arXiv.org e-Print Archive

VIMA: General Robot Manipulation with Multimodal Prompts

Author: Anandkumar Anima
Chen Yanjun
Dou Yongqiang
Fan Linxi
Fei-Fei Li
Gupta Agrim
Jiang Yunfan
Wang Guanzhi
Zhang Zichen
Zhu Yuke
Publication venue
Publication date: 28/05/2023
Field of study

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to

2.9\times

task success rate given the same training data. With

10\times

less training data, VIMA still performs

2.7\times

better than the best competing variant. Code and video demos are available at https://vimalabs.github.io/Comment: ICML 2023 Camera-ready version. Project website: https://vimalabs.github.io

arXiv.org e-Print Archive

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Author: Chen Hao
Gong Neil Zhenqiang
Wang Jindong
Wang Yidong
Wang Zichen
Xie Xing
Yang Linyi
Ye Wei
Zhang Yue
Zhou Jiaheng
Zhu Kaijie
Publication venue
Publication date: 13/06/2023
Field of study

The increasing reliance on Large Language Models (LLMs) across academia and industry necessitates a comprehensive understanding of their robustness to prompts. In response to this vital need, we introduce PromptBench, a robustness benchmark designed to measure LLMs' resilience to adversarial prompts. This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic. These prompts are then employed in diverse tasks, such as sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving. Our study generates 4,032 adversarial prompts, meticulously evaluated over 8 tasks and 13 datasets, with 567,084 test samples in total. Our findings demonstrate that contemporary LLMs are vulnerable to adversarial prompts. Furthermore, we present comprehensive analysis to understand the mystery behind prompt robustness and its transferability. We then offer insightful robustness analysis and pragmatic recommendations for prompt composition, beneficial to both researchers and everyday users. We make our code, prompts, and methodologies to generate adversarial prompts publicly accessible, thereby enabling and encouraging collaborative exploration in this pivotal field: https://github.com/microsoft/promptbench.Comment: Technical report; 23 pages; code is at: https://github.com/microsoft/promptbenc

arXiv.org e-Print Archive

MULTI: Multimodal Understanding Leaderboard with Text and Images

Author: Cai Jinyu
Chen Lu
Liu Jiaqi
Ma Yichuan
Ma Yingzi
Sun Liangtai
Sun Yiming
Wen Hailin
Xu Yang
Yang Jingkai
Yu Kai
Zhang Situo
Zhao Zihan
Zhu Zichen
Publication venue
Publication date: 20/02/2024
Field of study

Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community, while existing benchmarks primarily focus on understanding simple natural images and short context. In this paper, we present MULTI as a cutting-edge benchmark for evaluating MLLMs on understanding complex tables and images, and reasoning with long context. MULTI provides multimodal inputs and requires responses that are either precise or open-ended, reflecting real-life examination styles. MULTI includes over 18,000 questions and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning. We also introduce MULTI-Elite, a 500-question selected hard subset, and MULTI-Extend, with more than 4,500 external knowledge context pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a 63.7% accuracy rate on MULTI, in contrast to other MLLMs scoring between 28.5% and 55.3%. MULTI serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.Comment: 16 pages, 9 figures, 10 tables. Details and access are available at: https://OpenDFM.github.io/MULTI-Benchmark

arXiv.org e-Print Archive