3,050 research outputs found
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision
Large multimodal models (LMMs) suffer from multimodal hallucination, where
they provide incorrect responses misaligned with the given visual information.
Recent works have conjectured that one of the reasons behind multimodal
hallucination might be due to the vision encoder failing to ground on the image
properly. To mitigate this issue, we propose a novel approach that leverages
self-feedback as visual cues. Building on this approach, we introduce Volcano,
a multimodal self-feedback guided revision model. Volcano generates natural
language feedback to its initial response based on the provided visual
information and utilizes this feedback to self-revise its initial response.
Volcano effectively reduces multimodal hallucination and achieves
state-of-the-art on MMHal-Bench, POPE, and GAVIE. It also improves on general
multimodal abilities and outperforms previous models on MM-Vet and MMBench.
Through a qualitative analysis, we show that Volcano's feedback is properly
grounded on the image than the initial response. This indicates that Volcano
can provide itself with richer visual information, helping alleviate multimodal
hallucination. We publicly release Volcano models of 7B and 13B sizes along
with the data and code at https://github.com/kaistAI/Volcano
AgentBench: Evaluating LLMs as Agents
Large Language Models (LLMs) are becoming increasingly smart and autonomous,
targeting real-world pragmatic missions beyond traditional NLP tasks. As a
result, there has been an urgent need to evaluate LLMs as agents on challenging
tasks in interactive environments. We present AgentBench, a multi-dimensional
evolving benchmark that currently consists of 8 distinct environments to assess
LLM-as-Agent's reasoning and decision-making abilities in a multi-turn
open-ended generation setting. Our extensive test over 27 API-based and
open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong
ability of acting as agents in complex environments, there is a significant
disparity in performance between them and OSS competitors. We identify the
typical reasons of failures in environments and LLMs, showing that poor
long-term reasoning, decision-making, and instruction following abilities are
the main obstacles for developing usable LLM agents. Training on code and high
quality multi-turn alignment data could improve agent performance. Datasets,
environments, and an integrated evaluation package for AgentBench are released
at \url{https://github.com/THUDM/AgentBench}.Comment: 55 page
Natural Language Commanding via Program Synthesis
We present Semantic Interpreter, a natural language-friendly AI system for
productivity software such as Microsoft Office that leverages large language
models (LLMs) to execute user intent across application features. While LLMs
are excellent at understanding user intent expressed as natural language, they
are not sufficient for fulfilling application-specific user intent that
requires more than text-to-text transformations. We therefore introduce the
Office Domain Specific Language (ODSL), a concise, high-level language
specialized for performing actions in and interacting with entities in Office
applications. Semantic Interpreter leverages an Analysis-Retrieval prompt
construction method with LLMs for program synthesis, translating natural
language user utterances to ODSL programs that can be transpiled to application
APIs and then executed. We focus our discussion primarily on a research
exploration for Microsoft PowerPoint
Reasoning about Actions over Visual and Linguistic Modalities: A Survey
'Actions' play a vital role in how humans interact with the world and enable
them to achieve desired goals. As a result, most common sense (CS) knowledge
for humans revolves around actions. While 'Reasoning about Actions & Change'
(RAC) has been widely studied in the Knowledge Representation community, it has
recently piqued the interest of NLP and computer vision researchers. This paper
surveys existing tasks, benchmark datasets, various techniques and models, and
their respective performance concerning advancements in RAC in the vision and
language domain. Towards the end, we summarize our key takeaways, discuss the
present challenges facing this research area, and outline potential directions
for future research.Comment: 7 pages, 3 figures; This survey will be periodically updated with the
latest works in this are
- …