10 research outputs found
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Large-scale video-language pre-training has made remarkable strides in
advancing video-language understanding tasks. However, the heavy computational
burden of video encoding remains a formidable efficiency bottleneck,
particularly for long-form videos. These videos contain massive visual tokens
due to their inherent 3D properties and spatiotemporal redundancy, making it
challenging to capture complex temporal and spatial relationships. To tackle
this issue, we propose an efficient method called TEmporal-Spatial Token
Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating
similar frames, as well as similar patches within each frame. TESTA can reduce
the number of visual tokens by 75% and thus accelerate video encoding. Building
upon TESTA, we introduce a pre-trained video-language model equipped with a
divided space-time token aggregation module in each video encoder block. We
evaluate our model on five datasets for paragraph-to-video retrieval and
long-form VideoQA tasks. Experimental results show that TESTA improves
computing efficiency by 1.7 times, and achieves significant performance gains
from its scalability in processing longer input frames, e.g., +13.7 R@1 on
QuerYD and +6.5 R@1 on Condensed Movie.Comment: 16 pages, 9 figures, code is available at
https://github.com/RenShuhuai-Andy/TEST
DCA: Diversified Co-Attention towards Informative Live Video Commenting
We focus on the task of Automatic Live Video Commenting (ALVC), which aims to
generate real-time video comments with both video frames and other viewers'
comments as inputs. A major challenge in this task is how to properly leverage
the rich and diverse information carried by video and text. In this paper, we
aim to collect diversified information from video and text for informative
comment generation. To achieve this, we propose a Diversified Co-Attention
(DCA) model for this task. Our model builds bidirectional interactions between
video frames and surrounding comments from multiple perspectives via metric
learning, to collect a diversified and informative context for comment
generation. We also propose an effective parameter orthogonalization technique
to avoid excessive overlap of information learned from different perspectives.
Results show that our approach outperforms existing methods in the ALVC task,
achieving new state-of-the-art results
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
This work proposes POMP, a prompt pre-training method for vision-language
models. Being memory and computation efficient, POMP enables the learned prompt
to condense semantic information for a rich set of visual concepts with over
twenty-thousand classes. Once pre-trained, the prompt with a strong
transferable ability can be directly plugged into a variety of visual
recognition tasks including image classification, semantic segmentation, and
object detection, to boost recognition performances in a zero-shot manner.
Empirical evaluation shows that POMP achieves state-of-the-art performances on
21 downstream datasets, e.g., 67.0% average accuracy on 10 classification
dataset (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC
segmentation (+6.9 compared to ZSSeg).Comment: Code is available at
https://github.com/amazon-science/prompt-pretrainin
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
The ability to perceive how objects change over time is a crucial ingredient
in human intelligence. However, current benchmarks cannot faithfully reflect
the temporal understanding abilities of video-language models (VidLMs) due to
the existence of static visual shortcuts. To remedy this issue, we present
VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal
Concept underStanding. Specifically, we first introduce a fine-grained taxonomy
of temporal concepts in natural language in order to diagnose the capability of
VidLMs to comprehend different temporal aspects. Furthermore, to disentangle
the correlation between static and temporal information, we generate
counterfactual video descriptions that differ from the original one only in the
specified temporal aspect. We employ a semi-automatic data collection framework
using large language models and human-in-the-loop annotation to obtain
high-quality counterfactual descriptions efficiently. Evaluation of
representative video-language understanding models confirms their deficiency in
temporal understanding, revealing the need for greater emphasis on the temporal
elements in video-language research.Comment: 23 pages, 6 figures, 18 tables, data is available at
https://github.com/lscpku/VITATEC
Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond
In this study, we explore the potential of Multimodal Large Language Models
(MLLMs) in improving embodied decision-making processes for agents. While Large
Language Models (LLMs) have been widely used due to their advanced reasoning
skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual
understanding and reasoning capabilities. We investigate whether
state-of-the-art MLLMs can handle embodied decision-making in an end-to-end
manner and whether collaborations between LLMs and MLLMs can enhance
decision-making. To address these questions, we introduce a new benchmark
called PCA-EVAL, which evaluates embodied decision-making from the perspectives
of Perception, Cognition, and Action. Additionally, we propose HOLMES, a
multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs
to gather multimodal information for informed decision-making. We compare
end-to-end embodied decision-making and HOLMES on our benchmark and find that
the GPT4-Vision model demonstrates strong end-to-end embodied decision-making
abilities, outperforming GPT4-HOLMES in terms of average decision accuracy
(+3%). However, this performance is exclusive to the latest GPT4-Vision model,
surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate
that powerful MLLMs like GPT4-Vision hold promise for decision-making in
embodied agents, offering new avenues for MLLM research. Code and data are open
at https://github.com/pkunlp-icler/PCA-EVAL/.Comment: FMDM@NeurIPS2023, Code and data:
https://github.com/pkunlp-icler/PCA-EVAL
Development and external validation of dual online tools for prognostic assessment in elderly patients with high-grade glioma: a comprehensive study using SEER and Chinese cohorts
BackgroundElderly individuals diagnosed with high-grade gliomas frequently experience unfavorable outcomes. We aimed to design two web-based instruments for prognosis to predict overall survival (OS) and cancer-specific survival (CSS), assisting clinical decision-making.MethodsWe scrutinized data from the SEER database on 5,245 elderly patients diagnosed with high-grade glioma between 2000-2020, segmenting them into training (3,672) and validation (1,573) subsets. An additional external validation cohort was obtained from our institution. Prognostic determinants were pinpointed using Cox regression analyses, which facilitated the construction of the nomogram. The nomogram’s predictive precision for OS and CSS was gauged using calibration and ROC curves, the C-index, and decision curve analysis (DCA). Based on risk scores, patients were stratified into high or low-risk categories, and survival disparities were explored.ResultsUsing multivariate Cox regression, we identified several prognostic factors for overall survival (OS) and cancer-specific survival (CSS) in elderly patients with high-grade gliomas, including age, tumor location, size, surgical technique, and therapies. Two digital nomograms were formulated anchored on these determinants. For OS, the C-index values in the training, internal, and external validation cohorts were 0.734, 0.729, and 0.701, respectively. We also derived AUC values for 3-, 6-, and 12-month periods. For CSS, the C-index values for the training and validation groups were 0.733 and 0.727, with analogous AUC metrics. The efficacy and clinical relevance of the nomograms were corroborated via ROC curves, calibration plots, and DCA for both cohorts.ConclusionOur investigation pinpointed pivotal risk factors in elderly glioma patients, leading to the development of an instrumental prognostic nomogram for OS and CSS. This instrument offers invaluable insights to optimize treatment strategies