6 research outputs found
GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions
There is growing interest in systems that generate captions for scientific
figures. However, assessing these systems output poses a significant challenge.
Human evaluation requires academic expertise and is costly, while automatic
evaluation depends on often low-quality author-written captions. This paper
investigates using large language models (LLMs) as a cost-effective,
reference-free method for evaluating figure captions. We first constructed
SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600
scientific figure captions, both original and machine-made, for 600 arXiv
figures. We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption
based on its potential to aid reader understanding, given relevant context such
as figure-mentioning paragraphs. Results show that GPT-4, used as a zero-shot
evaluator, outperformed all other models and even surpassed assessments made by
Computer Science and Informatics undergraduates, achieving a Kendall
correlation score of 0.401 with Ph.D. students rankingsComment: To Appear in EMNLP 2023 Finding
Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization
Good figure captions help paper readers understand complex scientific
figures. Unfortunately, even published papers often have poorly written
captions. Automatic caption generation could aid paper writers by providing
good starting captions that can be refined for better quality. Prior work often
treated figure caption generation as a vision-to-language task. In this paper,
we show that it can be more effectively tackled as a text summarization task in
scientific documents. We fine-tuned PEGASUS, a pre-trained abstractive
summarization model, to specifically summarize figure-referencing paragraphs
(e.g., "Figure 3 shows...") into figure captions. Experiments on large-scale
arXiv figures show that our method outperforms prior vision methods in both
automatic and human evaluations. We further conducted an in-depth investigation
focused on two key challenges: (i) the common presence of low-quality
author-written captions and (ii) the lack of clear standards for good captions.
Our code and data are available at:
https://github.com/Crowd-AI-Lab/Generating-Figure-Captions-as-a-Text-Summarization-Task.Comment: Accepted by INLG-202
Reviving Static Charts into Live Charts
Data charts are prevalent across various fields due to their efficacy in
conveying complex data relationships. However, static charts may sometimes
struggle to engage readers and efficiently present intricate information,
potentially resulting in limited understanding. We introduce "Live Charts," a
new format of presentation that decomposes complex information within a chart
and explains the information pieces sequentially through rich animations and
accompanying audio narration. We propose an automated approach to revive static
charts into Live Charts. Our method integrates GNN-based techniques to analyze
the chart components and extract data from charts. Then we adopt large natural
language models to generate appropriate animated visuals along with a
voice-over to produce Live Charts from static ones. We conducted a thorough
evaluation of our approach, which involved the model performance, use cases, a
crowd-sourced user study, and expert interviews. The results demonstrate Live
Charts offer a multi-sensory experience where readers can follow the
information and understand the data insights better. We analyze the benefits
and drawbacks of Live Charts over static charts as a new information
consumption experience
Parsing AUC Result-Figures in Machine Learning Specific Scholarly Documents for Semantically-enriched Summarization
Machine learning specific scholarly full-text documents contain a number of result-figures expressing valuable data, including experimental results, evaluations, and cross-model comparisons. The scholarly search system often overlooks this vital information while indexing important terms using conventional text-based content extraction approaches. In this paper, we propose creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents. At first, classify the extracted figures and analyze them by parsing the figure text, legends, and data plots – using a convolutional neural network classification model with a pre-trained ResNet-50 on 1.2 million Images from ImageNet. Next, we extract information from the result figures specific to AUC by approximating the region under the function’s graph as a trapezoid and calculating its area, i.e., the trapezoidal rule. Using over 12,000 figures extracted from 1000 scholarly documents, we show that figure specialized summaries contain more enriched terms about figure semantics. Furthermore, we empirically show that the trapezoidal rule can calculate the area under the curve by dividing the curve into multiple intervals. Finally, we measure the quality of specialized summaries using ROUGE, Edit distance, and Jaccard Similarity metrics. Overall, we observed that figure specialized summaries are more comprehensive and semantically enriched. The applications of our research are enormous, including improved document searching, figure searching, and figure focused plagiarism. The data and code used in this paper can be accessed at the following URL: https://github.com/slab-itu/fig-ir/
VisText: A Benchmark for Semantically Rich Chart Captioning
Captions that describe or explain charts help improve recall and
comprehension of the depicted data and provide a more accessible medium for
people with visual disabilities. However, current approaches for automatically
generating such captions struggle to articulate the perceptual or cognitive
features that are the hallmark of charts (e.g., complex trends and patterns).
In response, we introduce VisText: a dataset of 12,441 pairs of charts and
captions that describe the charts' construction, report key statistics, and
identify perceptual and cognitive phenomena. In VisText, a chart is available
as three representations: a rasterized image, a backing data table, and a scene
graph -- a hierarchical representation of a chart's visual elements akin to a
web page's Document Object Model (DOM). To evaluate the impact of VisText, we
fine-tune state-of-the-art language models on our chart captioning task and
apply prefix-tuning to produce captions that vary the semantic content they
convey. Our models generate coherent, semantically rich captions and perform on
par with state-of-the-art chart captioning models across machine translation
and text generation metrics. Through qualitative analysis, we identify six
broad categories of errors that our models make that can inform future work.Comment: Published at ACL 2023, 29 pages, 10 figure