258 research outputs found
Advancing Multi-Modal Deep Learning: Towards Language-Grounded Visual Understanding
Using deep learning, computer vision now rivals people at object recognition and detection, opening doors to tackle new challenges in image understanding. Among these challenges, understanding and reasoning about language grounded visual content is of fundamental importance to advancing artificial intelligence. Recently, multiple datasets and algorithms have been created as proxy tasks towards this goal, with visual question answering (VQA) being the most widely studied. In VQA, an algorithm needs to produce an answer to a natural language question about an image. However, our survey of datasets and algorithms for VQA uncovered several sources of dataset bias and sub-optimal evaluation metrics that allowed algorithms to perform well by merely exploiting superficial statistical patterns. In this dissertation, we describe new algorithms and datasets that address these issues. We developed two new datasets and evaluation metrics that enable a more accurate measurement of abilities of a VQA model, and also expand VQA to include new abilities, such as reading text, handling out-of-vocabulary words, and understanding data-visualization. We also created new algorithms for VQA that have helped advance the state-of-the-art for VQA, including an algorithm that surpasses humans on two different chart question answering datasets about bar-charts, line-graphs and pie charts. Finally, we provide a holistic overview of several yet-unsolved challenges in not only VQA but vision and language research at large. Despite enormous progress, we find that a robust understanding and integration of vision and language is still an elusive goal, and much of the progress may be misleading due to dataset bias, superficial correlations and flaws in standard evaluation metrics. We carefully study and categorize these issues for several vision and language tasks and outline several possible paths towards development of safe, robust and trustworthy AI for language-grounded visual understanding
Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions
The current study focuses on systematically analyzing the recent advances in
the field of Multimodal eXplainable Artificial Intelligence (MXAI). In
particular, the relevant primary prediction tasks and publicly available
datasets are initially described. Subsequently, a structured presentation of
the MXAI methods of the literature is provided, taking into account the
following criteria: a) The number of the involved modalities, b) The stage at
which explanations are produced, and c) The type of the adopted methodology
(i.e. mathematical formalism). Then, the metrics used for MXAI evaluation are
discussed. Finally, a comprehensive analysis of current challenges and future
research directions is provided.Comment: 26 pages, 11 figure
ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules
Charts are a powerful tool for visually conveying complex data, but their
comprehension poses a challenge due to the diverse chart types and intricate
components. Existing chart comprehension methods suffer from either heuristic
rules or an over-reliance on OCR systems, resulting in suboptimal performance.
To address these issues, we present ChartReader, a unified framework that
seamlessly integrates chart derendering and comprehension tasks. Our approach
includes a transformer-based chart component detection module and an extended
pre-trained vision-language model for chart-to-X tasks. By learning the rules
of charts automatically from annotated datasets, our approach eliminates the
need for manual rule-making, reducing effort and enhancing accuracy.~We also
introduce a data variable replacement technique and extend the input and
position embeddings of the pre-trained model for cross-task training. We
evaluate ChartReader on Chart-to-Table, ChartQA, and Chart-to-Text tasks,
demonstrating its superiority over existing methods. Our proposed framework can
significantly reduce the manual effort involved in chart analysis, providing a
step towards a universal chart understanding model. Moreover, our approach
offers opportunities for plug-and-play integration with mainstream LLMs such as
T5 and TaPas, extending their capability to chart comprehension tasks. The code
is available at https://github.com/zhiqic/ChartReader
Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs
Building cross-model intelligence that can understand charts and communicate
the salient information hidden behind them is an appealing challenge in the
vision and language(V+L) community. The capability to uncover the underlined
table data of chart figures is a critical key to automatic chart understanding.
We introduce ChartT5, a V+L model that learns how to interpret table
information from chart images via cross-modal pre-training on plot table pairs.
Specifically, we propose two novel pre-training objectives: Masked Header
Prediction (MHP) and Masked Value Prediction (MVP) to facilitate the model with
different skills to interpret the table information. We have conducted
extensive experiments on chart question answering and chart summarization to
verify the effectiveness of the proposed pre-training strategies. In
particular, on the ChartQA benchmark, our ChartT5 outperforms the
state-of-the-art non-pretraining methods by over 8% performance gains.Comment: Accepted by Findings of ACL 202
DCQA: Document-Level Chart Question Answering towards Complex Reasoning and Common-Sense Understanding
Visually-situated languages such as charts and plots are omnipresent in
real-world documents. These graphical depictions are human-readable and are
often analyzed in visually-rich documents to address a variety of questions
that necessitate complex reasoning and common-sense responses. Despite the
growing number of datasets that aim to answer questions over charts, most only
address this task in isolation, without considering the broader context of
document-level question answering. Moreover, such datasets lack adequate
common-sense reasoning information in their questions. In this work, we
introduce a novel task named document-level chart question answering (DCQA).
The goal of this task is to conduct document-level question answering,
extracting charts or plots in the document via document layout analysis (DLA)
first and subsequently performing chart question answering (CQA). The newly
developed benchmark dataset comprises 50,010 synthetic documents integrating
charts in a wide range of styles (6 styles in contrast to 3 for PlotQA and
ChartQA) and includes 699,051 questions that demand a high degree of reasoning
ability and common-sense understanding. Besides, we present the development of
a potent question-answer generation engine that employs table data, a rich
color set, and basic question templates to produce a vast array of reasoning
question-answer pairs automatically. Based on DCQA, we devise an OCR-free
transformer for document-level chart-oriented understanding, capable of DLA and
answering complex reasoning and common-sense questions over charts in an
OCR-free manner. Our DCQA dataset is expected to foster research on
understanding visualizations in documents, especially for scenarios that
require complex reasoning for charts in the visually-rich document. We
implement and evaluate a set of baselines, and our proposed method achieves
comparable results
A Survey on ML4VIS: Applying Machine Learning Advances to Data Visualization
Inspired by the great success of machine learning (ML), researchers have
applied ML techniques to visualizations to achieve a better design,
development, and evaluation of visualizations. This branch of studies, known as
ML4VIS, is gaining increasing research attention in recent years. To
successfully adapt ML techniques for visualizations, a structured understanding
of the integration of ML4VISis needed. In this paper, we systematically survey
88 ML4VIS studies, aiming to answer two motivating questions: "what
visualization processes can be assisted by ML?" and "how ML techniques can be
used to solve visualization problems?" This survey reveals seven main processes
where the employment of ML techniques can benefit visualizations:Data
Processing4VIS, Data-VIS Mapping, InsightCommunication, Style Imitation, VIS
Interaction, VIS Reading, and User Profiling. The seven processes are related
to existing visualization theoretical models in an ML4VIS pipeline, aiming to
illuminate the role of ML-assisted visualization in general
visualizations.Meanwhile, the seven processes are mapped into main learning
tasks in ML to align the capabilities of ML with the needs in visualization.
Current practices and future opportunities of ML4VIS are discussed in the
context of the ML4VIS pipeline and the ML-VIS mapping. While more studies are
still needed in the area of ML4VIS, we hope this paper can provide a
stepping-stone for future exploration. A web-based interactive browser of this
survey is available at https://ml4vis.github.ioComment: 19 pages, 12 figures, 4 table
- …