128 research outputs found
EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching
Current metrics for video captioning are mostly based on the text-level
comparison between reference and candidate captions. However, they have some
insuperable drawbacks, e.g., they cannot handle videos without references, and
they may result in biased evaluation due to the one-to-many nature of
video-to-text and the neglect of visual relevance. From the human evaluator's
viewpoint, a high-quality caption should be consistent with the provided video,
but not necessarily be similar to the reference in literal or semantics.
Inspired by human evaluation, we propose EMScore (Embedding Matching-based
score), a novel reference-free metric for video captioning, which directly
measures similarity between video and candidate captions. Benefit from the
recent development of large-scale pre-training models, we exploit a well
pre-trained vision-language model to extract visual and linguistic embeddings
for computing EMScore. Specifically, EMScore combines matching scores of both
coarse-grained (video and caption) and fine-grained (frames and words) levels,
which takes the overall understanding and detailed characteristics of the video
into account. Furthermore, considering the potential information gain, EMScore
can be flexibly extended to the conditions where human-labeled references are
available. Last but not least, we collect VATEX-EVAL and ActivityNet-FOIl
datasets to systematically evaluate the existing metrics. VATEX-EVAL
experiments demonstrate that EMScore has higher human correlation and lower
reference dependency. ActivityNet-FOIL experiment verifies that EMScore can
effectively identify "hallucinating" captions. The datasets will be released to
facilitate the development of video captioning metrics. The code is available
at: https://github.com/ShiYaya/emscore.Comment: cvpr202
Clinical Insight-Augmented Multi-View Learning for Alzheimer’s Detection in Retinal OCTA Images
Alzheimer’s disease (AD) poses a significant globalchallenge, with a notable absence of accessible and cost-effectivediagnostic tools for widespread AD detection. The retina, mirroringthe brain in anatomy and physiology, has emerged asa potential avenue for rapid AD identification through retinalimaging. The current retinal image-based AD detection methodsusually focus primarily on the macular area, but ignore thepotential value that the optic disc region may have for thedetection task. In this study, we leverage both macular- anddisc-centered OCTA images and propose a multi-region fusionframework for AD detection. Based on clinical evidence, weintegrate handcrafted features into the framework to improvemodel performance and interpretability. Specifically, vascularmorphological parameters extracted from the macular and discregions are used as input to a revalued KNN model to improvepredictive capabilities. Furthermore, recognizing the significanceof extracting and utilizing complementary information from themacular and optic disc regions, we propose an uncertaintyguidedstrategy based on Dempster-Shefer Theory (DST) tofuse knowledge from different regions. This approach considerseach region’s forecast quality and significantly improves theeffectiveness and robustness of the model. Through comparativeanalysis with existing methods, we have demonstrated that ourmethod outperforms the state-of-the-art ones and provides morevaluable pathological evidence for the association between retinalvascular changes and AD
- …