10 research outputs found

    TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

    Full text link
    Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.Comment: 16 pages, 9 figures, code is available at https://github.com/RenShuhuai-Andy/TEST

    DCA: Diversified Co-Attention towards Informative Live Video Commenting

    Full text link
    We focus on the task of Automatic Live Video Commenting (ALVC), which aims to generate real-time video comments with both video frames and other viewers' comments as inputs. A major challenge in this task is how to properly leverage the rich and diverse information carried by video and text. In this paper, we aim to collect diversified information from video and text for informative comment generation. To achieve this, we propose a Diversified Co-Attention (DCA) model for this task. Our model builds bidirectional interactions between video frames and surrounding comments from multiple perspectives via metric learning, to collect a diversified and informative context for comment generation. We also propose an effective parameter orthogonalization technique to avoid excessive overlap of information learned from different perspectives. Results show that our approach outperforms existing methods in the ALVC task, achieving new state-of-the-art results

    Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

    Full text link
    This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0% average accuracy on 10 classification dataset (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).Comment: Code is available at https://github.com/amazon-science/prompt-pretrainin

    VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

    Full text link
    The ability to perceive how objects change over time is a crucial ingredient in human intelligence. However, current benchmarks cannot faithfully reflect the temporal understanding abilities of video-language models (VidLMs) due to the existence of static visual shortcuts. To remedy this issue, we present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. Specifically, we first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. Furthermore, to disentangle the correlation between static and temporal information, we generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect. We employ a semi-automatic data collection framework using large language models and human-in-the-loop annotation to obtain high-quality counterfactual descriptions efficiently. Evaluation of representative video-language understanding models confirms their deficiency in temporal understanding, revealing the need for greater emphasis on the temporal elements in video-language research.Comment: 23 pages, 6 figures, 18 tables, data is available at https://github.com/lscpku/VITATEC

    Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

    Full text link
    In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research. Code and data are open at https://github.com/pkunlp-icler/PCA-EVAL/.Comment: FMDM@NeurIPS2023, Code and data: https://github.com/pkunlp-icler/PCA-EVAL

    Development and external validation of dual online tools for prognostic assessment in elderly patients with high-grade glioma: a comprehensive study using SEER and Chinese cohorts

    Get PDF
    BackgroundElderly individuals diagnosed with high-grade gliomas frequently experience unfavorable outcomes. We aimed to design two web-based instruments for prognosis to predict overall survival (OS) and cancer-specific survival (CSS), assisting clinical decision-making.MethodsWe scrutinized data from the SEER database on 5,245 elderly patients diagnosed with high-grade glioma between 2000-2020, segmenting them into training (3,672) and validation (1,573) subsets. An additional external validation cohort was obtained from our institution. Prognostic determinants were pinpointed using Cox regression analyses, which facilitated the construction of the nomogram. The nomogram’s predictive precision for OS and CSS was gauged using calibration and ROC curves, the C-index, and decision curve analysis (DCA). Based on risk scores, patients were stratified into high or low-risk categories, and survival disparities were explored.ResultsUsing multivariate Cox regression, we identified several prognostic factors for overall survival (OS) and cancer-specific survival (CSS) in elderly patients with high-grade gliomas, including age, tumor location, size, surgical technique, and therapies. Two digital nomograms were formulated anchored on these determinants. For OS, the C-index values in the training, internal, and external validation cohorts were 0.734, 0.729, and 0.701, respectively. We also derived AUC values for 3-, 6-, and 12-month periods. For CSS, the C-index values for the training and validation groups were 0.733 and 0.727, with analogous AUC metrics. The efficacy and clinical relevance of the nomograms were corroborated via ROC curves, calibration plots, and DCA for both cohorts.ConclusionOur investigation pinpointed pivotal risk factors in elderly glioma patients, leading to the development of an instrumental prognostic nomogram for OS and CSS. This instrument offers invaluable insights to optimize treatment strategies
    corecore