47 research outputs found
A Hierarchical Framework for Relation Extraction with Reinforcement Learning
Most existing methods determine relation types only after all the entities
have been recognized, thus the interaction between relation types and entity
mentions is not fully modeled. This paper presents a novel paradigm to deal
with relation extraction by regarding the related entities as the arguments of
a relation. We apply a hierarchical reinforcement learning (HRL) framework in
this paradigm to enhance the interaction between entity mentions and relation
types. The whole extraction process is decomposed into a hierarchy of two-level
RL policies for relation detection and entity extraction respectively, so that
it is more feasible and natural to deal with overlapping relations. Our model
was evaluated on public datasets collected via distant supervision, and results
show that it gains better performance than existing methods and is more
powerful for extracting overlapping relations.Comment: To appear in AAAI 1
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Large language models have demonstrated impressive universal capabilities
across a wide range of open-ended tasks and have extended their utility to
encompass multimodal conversations. However, existing methods encounter
challenges in effectively handling both image and video understanding,
particularly with limited visual tokens. In this work, we introduce Chat-UniVi,
a unified vision-language model capable of comprehending and engaging in
conversations involving images and videos through a unified visual
representation. Specifically, we employ a set of dynamic visual tokens to
uniformly represent images and videos. This representation framework empowers
the model to efficiently utilize a limited number of visual tokens to
simultaneously capture the spatial details necessary for images and the
comprehensive temporal relationship required for videos. Moreover, we leverage
a multi-scale representation, enabling the model to perceive both high-level
semantic concepts and low-level visual details. Notably, Chat-UniVi is trained
on a mixed dataset containing both images and videos, allowing direct
application to tasks involving both mediums without requiring any
modifications. Extensive experimental results demonstrate that Chat-UniVi, as a
unified model, consistently outperforms even existing methods exclusively
designed for either images or videos.Comment: 26 page