39 research outputs found
MPMQA: Multimodal Question Answering on Product Manuals
Visual contents, such as illustrations and images, play a big role in product
manual understanding. Existing Product Manual Question Answering (PMQA)
datasets tend to ignore visual contents and only retain textual parts. In this
work, to emphasize the importance of multimodal contents, we propose a
Multimodal Product Manual Question Answering (MPMQA) task. For each question,
MPMQA requires the model not only to process multimodal contents but also to
provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is
constructed with human annotations, which contains 209 product manuals from 27
well-known consumer electronic brands. Human annotations include 6 types of
semantic regions for manual contents and 22,021 pairs of question and answer.
Especially, each answer consists of a textual sentence and related visual
regions from manuals. Taking into account the length of product manuals and the
fact that a question is always related to a small number of pages, MPMQA can be
naturally split into two subtasks: retrieving most related pages and then
generating multimodal answers. We further propose a unified model that can
perform these two subtasks all together and achieve comparable performance with
multiple task-specific models. The PM209 dataset is available at
https://github.com/AIM3-RUC/MPMQA
Explore and Tell: Embodied Visual Captioning in 3D Environments
While current visual captioning models have achieved impressive performance,
they often assume that the image is well-captured and provides a complete view
of the scene. In real-world scenarios, however, a single image may not offer a
good viewpoint, hindering fine-grained scene understanding. To overcome this
limitation, we propose a novel task called Embodied Captioning, which equips
visual captioning models with navigation capabilities, enabling them to
actively explore the scene and reduce visual ambiguity from suboptimal
viewpoints. Specifically, starting at a random viewpoint, an agent must
navigate the environment to gather information from different viewpoints and
generate a comprehensive paragraph describing all objects in the scene. To
support this task, we build the ET-Cap dataset with Kubric simulator,
consisting of 10K 3D scenes with cluttered objects and three annotated
paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT),
which comprises of a navigator and a captioner, to tackle this task. The
navigator predicts which actions to take in the environment, while the
captioner generates a paragraph description based on the whole navigation
trajectory. Extensive experiments demonstrate that our model outperforms other
carefully designed baselines. Our dataset, codes and models are available at
https://aim3-ruc.github.io/ExploreAndTell.Comment: 12 pages; 10 figures; ICCV 202
Movie101: A New Movie Understanding Benchmark
To help the visually impaired enjoy movies, automatic movie narrating systems
are expected to narrate accurate, coherent, and role-aware plots when there are
no speaking lines of actors. Existing works benchmark this challenge as a
normal video captioning task via some simplifications, such as removing role
names and evaluating narrations with ngram-based metrics, which makes it
difficult for automatic systems to meet the needs of real application
scenarios. To narrow this gap, we construct a large-scale Chinese movie
benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating
(MCN) task in our benchmark asks models to generate role-aware narration
paragraphs for complete movie clips where no actors are speaking. External
knowledge, such as role information and movie genres, is also provided for
better movie understanding. Besides, we propose a new metric called Movie
Narration Score (MNScore) for movie narrating evaluation, which achieves the
best correlation with human evaluation. Our benchmark also supports the
Temporal Narration Grounding (TNG) task to investigate clip localization given
text descriptions. For both two tasks, our proposed methods well leverage
external knowledge and outperform carefully designed baselines. The dataset and
codes are released at https://github.com/yuezih/Movie101.Comment: Accepted to ACL 202
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However, previous
methods primarily focus on enhancing multi-modal capabilities. In this work, we
introduce a versatile multi-modal large language model, mPLUG-Owl2, which
effectively leverages modality collaboration to improve performance in both
text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design,
with the language decoder acting as a universal interface for managing
different modalities. Specifically, mPLUG-Owl2 incorporates shared functional
modules to facilitate modality collaboration and introduces a modality-adaptive
module that preserves modality-specific features. Extensive experiments reveal
that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal
tasks and achieving state-of-the-art performances with a single generic model.
Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality
collaboration phenomenon in both pure-text and multi-modal scenarios, setting a
pioneering path in the development of future multi-modal foundation models
Understanding health and social challenges for aging and long-term care in China
The second King’s College London Symposium on Ageing and Long-term Care in China was convened from 4 to 5th July 2019 at King’s College London in London. The aim of the Symposium was to have a better understanding of health and social challenges for aging and long-term care in China. This symposium draws research insights from a wide range of disciplines, including economics, public policy, demography, gerontology, public health and sociology. A total of 20 participants from eight countries, seek to identify the key issues and research priorities in the area of aging and long-term care in China. The results published here are a synthesis of the top four research areas that represent the perspectives from some of the leading researchers in the field. © The Author(s) 2020
Mechanisms and Therapeutic Targets of Depression After Intracerebral Hemorrhage
The relationship between depression and intracerebral hemorrhage (ICH) is complicated. One of the most common neuropsychiatric comorbidities of hemorrhagic stroke is Post-ICH depression. Depression, as a neuropsychiatric symptom, also negatively impacts the outcome of ICH by enhancing morbidity, disability, and mortality. However, the ICH outcome can be improved by antidepressants such as the frequently-used selective serotonin reuptake inhibitors. This review therefore presents the mechanisms of post-ICH depression, we grouped the mechanisms according to inflammation, oxidative stress (OS), apoptosis and autophagy, and explained them through their several associated signaling pathways. Inflammation is mainly related to Toll-like receptors (TLRs), the NF-kB mediated signal pathway, the PPAR-Îł-dependent pathway, as well as other signaling pathways. OS is associated to nuclear factor erythroid-2 related factor 2 (Nrf2), the PI3K/Akt pathway and the MAPK/P38 pathway. Moreover, autophagy is associated with the mTOR signaling cascade and the NF-kB mediated signal pathway, while apoptosis is correlated with the death receptor-mediated apoptosis pathway, mitochondrial apoptosis pathway, caspase-independent pathways and others. Furthermore, we found that neuroinflammation, oxidative stress, autophagy, and apoptosis experience interactions with one another. Additionally, it may provide several potential therapeutic targets for patients that might suffer from depression after ICH
The Role of lncRNAs in the Distant Metastasis of Breast Cancer
Breast cancer (BC) remains the most frequently diagnosed cancer worldwide. Among breast cancer patients, distant metastasis and invasion is the leading cause of BC related death. Recently, long non-coding RNAs (lncRNAs), which used to be considered a genetic byproduct (owing to their unknown biological function), have been reported to be highly implicated in the development and progression of BC. In this review, we produce a summary of the functions and mechanisms of lncRNAs implicated in the different distant metastases of BC. The functions of lncRNAs have been divided into two types: oncogenic type and tumor suppressor. Furthermore, the majority of them exert their roles through the regulation of invasion, migration, epithelial—mesenchymal transition (EMT), and the metastasis process. In the final part, we briefly addressed future research prospects of lncRNAs, especially the testing methods through which to detect lncRNAs in the clinical work, and introduced several different tools with which to detect lncRNAs more conveniently. Although lncRNA research is still in the initial stages, it is a promising prognosticator and a novel therapeutic target for BC metastasis, which requires more research in the future
MPMQA: Multimodal Question Answering on Product Manuals
Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts.
In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the model not only to process multimodal contents but also to provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is constructed with human annotations, which contains 209 product manuals from 27 well-known consumer electronic brands. Human annotations include 6 types of semantic regions for manual contents and 22,021 pairs of question and answer. Especially, each answer consists of a textual sentence and related visual regions from manuals. Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers. We further propose a unified model that can perform these two subtasks all together and achieve comparable performance with multiple task-specific models. The PM209 dataset is available at https://github.com/AIM3-RUC/MPMQA
InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Automatic image captioning evaluation is critical for benchmarking and
promoting advances in image captioning research. Existing metrics only provide
a single score to measure caption qualities, which are less explainable and
informative. Instead, we humans can easily identify the problems of captions in
details, e.g., which words are inaccurate and which salient objects are not
described, and then rate the caption quality. To support such informative
feedback, we propose an Informative Metric for Reference-free Image Caption
evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to
report incorrect words and unmentioned image regions at fine-grained level, and
also provide a text precision score, a vision recall score and an overall
quality score at coarse-grained level. The coarse-grained score of InfoMetIC
achieves significantly better correlation with human judgements than existing
metrics on multiple benchmarks. We also construct a token-level evaluation
dataset and demonstrate the effectiveness of InfoMetIC in fine-grained
evaluation. Our code and datasets are publicly available at
https://github.com/HAWLYQ/InfoMetIC.Comment: Accepted by ACL 2023 main conferenc