Search CORE

40 research outputs found

COCO is "ALL'' You Need for Visual Instruction Fine-tuning

Author: Han Xiaotian
Wang Yiqi
Yang Hongxia
You Quanzeng
Zhai Bohan
Publication venue
Publication date: 16/01/2024
Field of study

Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and diversified instruction following data is the key to this fine-tuning process. Recent studies propose to construct visual IFT datasets through a multifaceted approach: transforming existing datasets with rule-based templates, employing GPT-4 for rewriting annotations, and utilizing GPT-4V for visual dataset pseudo-labeling. LLaVA-1.5 adopted similar approach and construct LLaVA-mix-665k, which is one of the simplest, most widely used, yet most effective IFT datasets today. Notably, when properly fine-tuned with this dataset, MLLMs can achieve state-of-the-art performance on several benchmarks. However, we noticed that models trained with this dataset often struggle to follow user instructions properly in multi-round dialog. In addition, tradition caption and VQA evaluation benchmarks, with their closed-form evaluation structure, are not fully equipped to assess the capabilities of modern open-ended generative MLLMs. This problem is not unique to the LLaVA-mix-665k dataset, but may be a potential issue in all IFT datasets constructed from image captioning or VQA sources, though the extent of this issue may vary. We argue that datasets with diverse and high-quality detailed instruction following annotations are essential and adequate for MLLMs IFT. In this work, we establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions. Our experiments show that when fine-tuned with out proposed dataset, MLLMs achieve better performance on open-ended evaluation benchmarks in both single-round and multi-round dialog setting

arXiv.org e-Print Archive

HallE-Control: Controlling Object Hallucination in Large Multimodal Models

Author: Keutzer Kurt
Li Chunyuan
Li Manling
Shen Sheng
Xu Chenfeng
Yang Shijia
Zhai Bohan
Publication venue
Publication date: 28/03/2024
Field of study

Current Large Multimodal Models (LMMs) achieve remarkable progress, yet there remains significant uncertainty regarding their ability to accurately apprehend visual details, that is, in performing detailed captioning. To address this, we introduce

\textit{CCEval}

, a GPT-4 assisted evaluation method for detailed captioning. Interestingly, while LMMs demonstrate minimal object existence hallucination in existing VQA benchmarks, our proposed evaluation reveals continued susceptibility to such hallucinations. In this paper, we make the first attempt to investigate such hallucination from different aspects, including image resolution, the language decoder size, and instruction data amount, quality, granularity. Our findings underscore the unwarranted inference when the language description includes details at a finer object granularity than what the vision module can ground or verify, thus inducing hallucination. To control such hallucinations, we further attribute the reliability of captioning to contextual knowledge (involving only contextually grounded objects) and parametric knowledge (containing inferred objects by the model). Thus, we introduce

\textit{HallE-Control}

, a controllable LMM in terms of

\textbf{Hall}

ucination in object

\textbf{E}

xistence. HallE-Control can condition the captioning to shift between (i) exclusively depicting contextual knowledge for grounded objects and (ii) blending it with parametric knowledge to imagine inferred objects. Our method reduces hallucination by 44% compared to LLaVA

_{7B}

and maintains the object coverage.Comment: Our code is publicly available at https://github.com/bronyayang/HallE_Contro

arXiv.org e-Print Archive

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Author: Chen Wentao
Han Xiaotian
Lin Xudong
Liu Yongfei
Wang Yiqi
Yang Hongxia
You Quanzeng
Yuan Jianbo
Zhai Bohan
Zhao Haiteng
Publication venue
Publication date: 18/01/2024
Field of study

Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning

arXiv.org e-Print Archive

Comprehensive characterization of ERV-K (HML-8) in the chimpanzee genome revealed less genomic activity than humans

Author: Bohan Zhang
Bohan Zhang
Caiqin Yang
Caiqin Yang
Chunlei Wang
Chunlei Wang
Chunlei Wang
Hanping Li
Hanping Li
Jingwan Han
Jingwan Han
Jingyun Li
Jingyun Li
Lei Jia
Lei Jia
Lin Li
Lin Li
Lin Li
Mingyue Chen
Shibo Wang
Xiaolin Wang
Xiaolin Wang
Xiuli Zhai
Xiuli Zhai
Xiuli Zhai
Yanmei Song
Yanmei Song
Yongjian Liu
Yongjian Liu
Publication venue: Frontiers Media S.A.
Publication date: 01/02/2024
Field of study

Endogenous retroviruses (ERVs) originate from ancestral germline infections caused by exogenous retroviruses. Throughout evolution, they have become fixed within the genome of the animals into which they were integrated. As ERV elements coevolve with the host, they are normally epigenetically silenced and can become upregulated in a series of physiological and pathological processes. Generally, a detailed ERV profile in the host genome is critical for understanding the evolutionary history and functional performance of the host genome. We previously characterized and cataloged all the ERV-K subtype HML-8 loci in the human genome; however, this has not been done for the chimpanzee, the nearest living relative of humans. In this study, we aimed to catalog and characterize the integration of HML-8 in the chimpanzee genome and compare it with the integration of HML-8 in the human genome. We analyzed the integration of HML-8 and found that HML-8 pervasively invaded the chimpanzee genome. A total of 76 proviral elements were characterized on 23/24 chromosomes, including detailed elements distribution, structure, phylogeny, integration time, and their potential to regulate adjacent genes. The incomplete structure of HML-8 proviral LTRs will undoubtedly affect their activity. Moreover, the results indicated that HML-8 integration occurred before the divergence between humans and chimpanzees. Furthermore, chimpanzees include more HML-8 proviral elements (76 vs. 40) and fewer solo long terminal repeats (LTR) (0 vs. 5) than humans. These results suggested that chimpanzee genome activity is less than the human genome and that humans may have a better ability to shape and screen integrated proviral elements. Our work is informative in both an evolutionary and a functional context for ERVs

Directory of Open Access Journals

NTIRE 2024 Quality Assessment of AI-Generated Content Challenge

Author: Asadi Erfan
Bianco Simone
Cao Wenzhi
Cao Yuqin
Celona Luigi
Chen Baoying
Chen Pengfei
Chen Rui
Chen Shu
Chen Zhibo
Cheng Yihang
Conde Marcos V.
Deng Yifan
Dong Weifeng
Dou Zifei
Duan Zhichao
Fan Haotian
Fang Xi
Fu Huiyuan
Fu Jing
Gankhuyag Ganzorig
Gao Wei
Gao Yixuan
Geng Miao
Guan Fengbin
Gupta Shashank
He Shuai
Hosseini-Benvidi S. Farhad
Huang Bingchen
Kong Fangyuan
Kou Tengchuan
Li Bing
Li Chunyi
Li Haohui
Li Jiaze
Li Junlin
Li Xin
Li Yang
Li Yao
Li Yaohui
Li Zhan
Liang Xiaoyu
Liao Renjie
Liao Ruling
Liao Yiting
Liu Jiahe
Liu Jianzhao
Liu Limei
Liu Xiaohong
Lu Yiting
Lu Zihao
Luo Shuqing
Lv Xiaoxin
Ma Huadong
Ma Xingyuan
Mahmoudi-Aznaveh Ahmad
Mansouri Azadeh
Min Xiongkuo
Ming Anlong
Mishra Sandeep
Motamednia Hossein
Napoletano Paolo
Peng Fei
Peng Han
Qu Bowen
Qu Youran
Saha Avinab
Saha Oindrila
Schettini Raimondo
Shen Jiabin
Sun Shangkun
Sun Wei
Sureddi Rajesh
Tao Xin
Timofte Radu
Wang Chengwei
Wang Chuanming
Wang Lele
Wang Qiulin
Wang Shunzhou
Wang Weigang
Wang Xinrui
Wang Zhiling
Wu Haoning
Wu Xiele
Xie Haiyi
Xu Haoran
Xu Jingwen
Xu Xinrui
Xu Yifang
Yan Jun
Yan Qi
Yang Jianquan
Yang Junfeng
Yang Mengduo
Ye Yan
Yin Haibing
Yoon Kihwan
Yu Bohan
Yu Zihao
Yuan Weijun
Zeng Jishen
Zeng Xiaohui
Zhai Guangtao
Zhang Huacong
Zhang Wei
Zhang Yabin
Zhang Zicheng
Zhao Shiling
Zhi Tianwu
Zhou Jie
Zhou Zhaokun
Zhu Li
Zhu Shiding
Publication venue
Publication date: 07/05/2024
Field of study

This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC

arXiv.org e-Print Archive

RCM: A Remote Cache Management Framework for Spark

Author: Bohan Li
Han Li
Jinjiang Wang
Junyang Yu
Rui Zhai
Xin He
Yixin Song
Publication venue: MDPI AG
Publication date: 01/11/2022
Field of study

With the rapid growth of Internet data, the performance of big data processing platforms is attracting more and more attention. In Spark, cache data are replaced by the Least Recently Used (LRU) Algorithm. LRU cannot identify the cost of cache data, which leads to replacing some important cache data. In addition, the placement of cache data is random, which lacks a measure to find efficient cache servers. Focusing on the above problems, a remote cache management framework (RCM) for the Spark platform was proposed, including a cache weight generation module (CWG), cache replacement module (CREP), and cache placement module (CPL). CWG establishes initial weights from three main factors: the response time of the query database, the number of queries, and the data size. Then, CWG reduces the old data weight through a time loss function. CREP promises that the sum of cache data weights is maximized by a greedy strategy. CPL allocates the best cache server for data based on the Kuhn-Munkres matching algorithm to improve cooperation efficiency. To verify the effectiveness of RCM, RCM is implemented on Redis and deployed on eight computing nodes and four cache servers. Three groups of benchmark jobs, PageRank, K-means and WordCount, is tested. The result of experiments confirmed that compared with MCM, SACM and DMAOM, the execution time of RCM is reduced by 42.1% at most

Directory of Open Access Journals

RCM: A Remote Cache Management Framework for Spark

Author: Bohan Li
Han Li
Jinjiang Wang
Junyang Yu
Rui Zhai
Xin He
Yixin Song
Publication venue: 'MDPI AG'
Publication date: 12/11/2022
Field of study

Multidisciplinary Digital Publishing Institute