40 research outputs found
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the
field of artificial intelligence. Visual instruction fine-tuning (IFT) is a
vital process for aligning MLLMs' output with user's intentions. High-quality
and diversified instruction following data is the key to this fine-tuning
process. Recent studies propose to construct visual IFT datasets through a
multifaceted approach: transforming existing datasets with rule-based
templates, employing GPT-4 for rewriting annotations, and utilizing GPT-4V for
visual dataset pseudo-labeling. LLaVA-1.5 adopted similar approach and
construct LLaVA-mix-665k, which is one of the simplest, most widely used, yet
most effective IFT datasets today. Notably, when properly fine-tuned with this
dataset, MLLMs can achieve state-of-the-art performance on several benchmarks.
However, we noticed that models trained with this dataset often struggle to
follow user instructions properly in multi-round dialog. In addition, tradition
caption and VQA evaluation benchmarks, with their closed-form evaluation
structure, are not fully equipped to assess the capabilities of modern
open-ended generative MLLMs. This problem is not unique to the LLaVA-mix-665k
dataset, but may be a potential issue in all IFT datasets constructed from
image captioning or VQA sources, though the extent of this issue may vary. We
argue that datasets with diverse and high-quality detailed instruction
following annotations are essential and adequate for MLLMs IFT. In this work,
we establish a new IFT dataset, with images sourced from the COCO dataset along
with more diverse instructions. Our experiments show that when fine-tuned with
out proposed dataset, MLLMs achieve better performance on open-ended evaluation
benchmarks in both single-round and multi-round dialog setting
HallE-Control: Controlling Object Hallucination in Large Multimodal Models
Current Large Multimodal Models (LMMs) achieve remarkable progress, yet there
remains significant uncertainty regarding their ability to accurately apprehend
visual details, that is, in performing detailed captioning. To address this, we
introduce , a GPT-4 assisted evaluation method for detailed
captioning. Interestingly, while LMMs demonstrate minimal object existence
hallucination in existing VQA benchmarks, our proposed evaluation reveals
continued susceptibility to such hallucinations. In this paper, we make the
first attempt to investigate such hallucination from different aspects,
including image resolution, the language decoder size, and instruction data
amount, quality, granularity. Our findings underscore the unwarranted inference
when the language description includes details at a finer object granularity
than what the vision module can ground or verify, thus inducing hallucination.
To control such hallucinations, we further attribute the reliability of
captioning to contextual knowledge (involving only contextually grounded
objects) and parametric knowledge (containing inferred objects by the model).
Thus, we introduce , a controllable LMM in terms of
ucination in object xistence. HallE-Control can
condition the captioning to shift between (i) exclusively depicting contextual
knowledge for grounded objects and (ii) blending it with parametric knowledge
to imagine inferred objects. Our method reduces hallucination by 44% compared
to LLaVA and maintains the object coverage.Comment: Our code is publicly available at
https://github.com/bronyayang/HallE_Contro
Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence
(AGI) with abstract reasoning ability is the goal of next-generation AI. Recent
advancements in Large Language Models (LLMs), along with the emerging field of
Multimodal Large Language Models (MLLMs), have demonstrated impressive
capabilities across a wide range of multimodal tasks and applications.
Particularly, various MLLMs, each with distinct model architectures, training
data, and training stages, have been evaluated across a broad range of MLLM
benchmarks. These studies have, to varying degrees, revealed different aspects
of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs
have not been systematically investigated. In this survey, we comprehensively
review the existing evaluation protocols of multimodal reasoning, categorize
and illustrate the frontiers of MLLMs, introduce recent trends in applications
of MLLMs on reasoning-intensive tasks, and finally discuss current practices
and future directions. We believe our survey establishes a solid base and sheds
light on this important topic, multimodal reasoning
Comprehensive characterization of ERV-K (HML-8) in the chimpanzee genome revealed less genomic activity than humans
Endogenous retroviruses (ERVs) originate from ancestral germline infections caused by exogenous retroviruses. Throughout evolution, they have become fixed within the genome of the animals into which they were integrated. As ERV elements coevolve with the host, they are normally epigenetically silenced and can become upregulated in a series of physiological and pathological processes. Generally, a detailed ERV profile in the host genome is critical for understanding the evolutionary history and functional performance of the host genome. We previously characterized and cataloged all the ERV-K subtype HML-8 loci in the human genome; however, this has not been done for the chimpanzee, the nearest living relative of humans. In this study, we aimed to catalog and characterize the integration of HML-8 in the chimpanzee genome and compare it with the integration of HML-8 in the human genome. We analyzed the integration of HML-8 and found that HML-8 pervasively invaded the chimpanzee genome. A total of 76 proviral elements were characterized on 23/24 chromosomes, including detailed elements distribution, structure, phylogeny, integration time, and their potential to regulate adjacent genes. The incomplete structure of HML-8 proviral LTRs will undoubtedly affect their activity. Moreover, the results indicated that HML-8 integration occurred before the divergence between humans and chimpanzees. Furthermore, chimpanzees include more HML-8 proviral elements (76 vs. 40) and fewer solo long terminal repeats (LTR) (0 vs. 5) than humans. These results suggested that chimpanzee genome activity is less than the human genome and that humans may have a better ability to shape and screen integrated proviral elements. Our work is informative in both an evolutionary and a functional context for ERVs
NTIRE 2024 Quality Assessment of AI-Generated Content Challenge
This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated
Content Challenge, which will be held in conjunction with the New Trends in
Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge
is to address a major challenge in the field of image and video processing,
namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for
AI-Generated Content (AIGC). The challenge is divided into the image track and
the video track. The image track uses the AIGIQA-20K, which contains 20,000
AI-Generated Images (AIGIs) generated by 15 popular generative models. The
image track has a total of 318 registered participants. A total of 1,646
submissions are received in the development phase, and 221 submissions are
received in the test phase. Finally, 16 participating teams submitted their
models and fact sheets. The video track uses the T2VQA-DB, which contains
10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V)
models. A total of 196 participants have registered in the video track. A total
of 991 submissions are received in the development phase, and 185 submissions
are received in the test phase. Finally, 12 participating teams submitted their
models and fact sheets. Some methods have achieved better results than baseline
methods, and the winning methods in both tracks have demonstrated superior
prediction performance on AIGC
RCM: A Remote Cache Management Framework for Spark
With the rapid growth of Internet data, the performance of big data processing platforms is attracting more and more attention. In Spark, cache data are replaced by the Least Recently Used (LRU) Algorithm. LRU cannot identify the cost of cache data, which leads to replacing some important cache data. In addition, the placement of cache data is random, which lacks a measure to find efficient cache servers. Focusing on the above problems, a remote cache management framework (RCM) for the Spark platform was proposed, including a cache weight generation module (CWG), cache replacement module (CREP), and cache placement module (CPL). CWG establishes initial weights from three main factors: the response time of the query database, the number of queries, and the data size. Then, CWG reduces the old data weight through a time loss function. CREP promises that the sum of cache data weights is maximized by a greedy strategy. CPL allocates the best cache server for data based on the Kuhn-Munkres matching algorithm to improve cooperation efficiency. To verify the effectiveness of RCM, RCM is implemented on Redis and deployed on eight computing nodes and four cache servers. Three groups of benchmark jobs, PageRank, K-means and WordCount, is tested. The result of experiments confirmed that compared with MCM, SACM and DMAOM, the execution time of RCM is reduced by 42.1% at most
RCM: A Remote Cache Management Framework for Spark
With the rapid growth of Internet data, the performance of big data processing platforms is attracting more and more attention. In Spark, cache data are replaced by the Least Recently Used (LRU) Algorithm. LRU cannot identify the cost of cache data, which leads to replacing some important cache data. In addition, the placement of cache data is random, which lacks a measure to find efficient cache servers. Focusing on the above problems, a remote cache management framework (RCM) for the Spark platform was proposed, including a cache weight generation module (CWG), cache replacement module (CREP), and cache placement module (CPL). CWG establishes initial weights from three main factors: the response time of the query database, the number of queries, and the data size. Then, CWG reduces the old data weight through a time loss function. CREP promises that the sum of cache data weights is maximized by a greedy strategy. CPL allocates the best cache server for data based on the Kuhn-Munkres matching algorithm to improve cooperation efficiency. To verify the effectiveness of RCM, RCM is implemented on Redis and deployed on eight computing nodes and four cache servers. Three groups of benchmark jobs, PageRank, K-means and WordCount, is tested. The result of experiments confirmed that compared with MCM, SACM and DMAOM, the execution time of RCM is reduced by 42.1% at most