39 research outputs found

    MPMQA: Multimodal Question Answering on Product Manuals

    Full text link
    Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts. In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the model not only to process multimodal contents but also to provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is constructed with human annotations, which contains 209 product manuals from 27 well-known consumer electronic brands. Human annotations include 6 types of semantic regions for manual contents and 22,021 pairs of question and answer. Especially, each answer consists of a textual sentence and related visual regions from manuals. Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers. We further propose a unified model that can perform these two subtasks all together and achieve comparable performance with multiple task-specific models. The PM209 dataset is available at https://github.com/AIM3-RUC/MPMQA

    Explore and Tell: Embodied Visual Captioning in 3D Environments

    Full text link
    While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell.Comment: 12 pages; 10 figures; ICCV 202

    Movie101: A New Movie Understanding Benchmark

    Full text link
    To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101.Comment: Accepted to ACL 202

    mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

    Full text link
    Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models

    Understanding health and social challenges for aging and long-term care in China

    Get PDF
    The second King’s College London Symposium on Ageing and Long-term Care in China was convened from 4 to 5th July 2019 at King’s College London in London. The aim of the Symposium was to have a better understanding of health and social challenges for aging and long-term care in China. This symposium draws research insights from a wide range of disciplines, including economics, public policy, demography, gerontology, public health and sociology. A total of 20 participants from eight countries, seek to identify the key issues and research priorities in the area of aging and long-term care in China. The results published here are a synthesis of the top four research areas that represent the perspectives from some of the leading researchers in the field. © The Author(s) 2020

    Mechanisms and Therapeutic Targets of Depression After Intracerebral Hemorrhage

    Get PDF
    The relationship between depression and intracerebral hemorrhage (ICH) is complicated. One of the most common neuropsychiatric comorbidities of hemorrhagic stroke is Post-ICH depression. Depression, as a neuropsychiatric symptom, also negatively impacts the outcome of ICH by enhancing morbidity, disability, and mortality. However, the ICH outcome can be improved by antidepressants such as the frequently-used selective serotonin reuptake inhibitors. This review therefore presents the mechanisms of post-ICH depression, we grouped the mechanisms according to inflammation, oxidative stress (OS), apoptosis and autophagy, and explained them through their several associated signaling pathways. Inflammation is mainly related to Toll-like receptors (TLRs), the NF-kB mediated signal pathway, the PPAR-Îł-dependent pathway, as well as other signaling pathways. OS is associated to nuclear factor erythroid-2 related factor 2 (Nrf2), the PI3K/Akt pathway and the MAPK/P38 pathway. Moreover, autophagy is associated with the mTOR signaling cascade and the NF-kB mediated signal pathway, while apoptosis is correlated with the death receptor-mediated apoptosis pathway, mitochondrial apoptosis pathway, caspase-independent pathways and others. Furthermore, we found that neuroinflammation, oxidative stress, autophagy, and apoptosis experience interactions with one another. Additionally, it may provide several potential therapeutic targets for patients that might suffer from depression after ICH

    The Role of lncRNAs in the Distant Metastasis of Breast Cancer

    Get PDF
    Breast cancer (BC) remains the most frequently diagnosed cancer worldwide. Among breast cancer patients, distant metastasis and invasion is the leading cause of BC related death. Recently, long non-coding RNAs (lncRNAs), which used to be considered a genetic byproduct (owing to their unknown biological function), have been reported to be highly implicated in the development and progression of BC. In this review, we produce a summary of the functions and mechanisms of lncRNAs implicated in the different distant metastases of BC. The functions of lncRNAs have been divided into two types: oncogenic type and tumor suppressor. Furthermore, the majority of them exert their roles through the regulation of invasion, migration, epithelial—mesenchymal transition (EMT), and the metastasis process. In the final part, we briefly addressed future research prospects of lncRNAs, especially the testing methods through which to detect lncRNAs in the clinical work, and introduced several different tools with which to detect lncRNAs more conveniently. Although lncRNA research is still in the initial stages, it is a promising prognosticator and a novel therapeutic target for BC metastasis, which requires more research in the future

    MPMQA: Multimodal Question Answering on Product Manuals

    No full text
    Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts. In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the model not only to process multimodal contents but also to provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is constructed with human annotations, which contains 209 product manuals from 27 well-known consumer electronic brands. Human annotations include 6 types of semantic regions for manual contents and 22,021 pairs of question and answer. Especially, each answer consists of a textual sentence and related visual regions from manuals. Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers. We further propose a unified model that can perform these two subtasks all together and achieve comparable performance with multiple task-specific models. The PM209 dataset is available at https://github.com/AIM3-RUC/MPMQA

    InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

    Full text link
    Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and then rate the caption quality. To support such informative feedback, we propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level, and also provide a text precision score, a vision recall score and an overall quality score at coarse-grained level. The coarse-grained score of InfoMetIC achieves significantly better correlation with human judgements than existing metrics on multiple benchmarks. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation. Our code and datasets are publicly available at https://github.com/HAWLYQ/InfoMetIC.Comment: Accepted by ACL 2023 main conferenc
    corecore