189 research outputs found
Multi-Modal Answer Validation for Knowledge-Based VQA
The problem of knowledge-based visual question answering involves answering
questions that require external knowledge in addition to the content of the
image. Such knowledge typically comes in various forms, including visual,
textual, and commonsense knowledge. Using more knowledge sources increases the
chance of retrieving more irrelevant or noisy facts, making it challenging to
comprehend the facts and find the answer. To address this challenge, we propose
Multi-modal Answer Validation using External knowledge (MAVEx), where the idea
is to validate a set of promising answer candidates based on answer-specific
knowledge retrieval. Instead of searching for the answer in a vast collection
of often irrelevant facts as most existing approaches do, MAVEx aims to learn
how to extract relevant knowledge from noisy sources, which knowledge source to
trust for each answer candidate, and how to validate the candidate using that
source. Our multi-modal setting is the first to leverage external visual
knowledge (images searched using Google), in addition to textual knowledge in
the form of Wikipedia sentences and ConceptNet concepts. Our experiments with
OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx
achieves new state-of-the-art results. Our code is available at
https://github.com/jialinwu17/MAVEXComment: AAAI 202
Dual Attention on Pyramid Feature Maps for Image Captioning
Generating natural sentences from images is a fundamental learning task for
visual-semantic understanding in multimedia. In this paper, we propose to apply
dual attention on pyramid image feature maps to fully explore the
visual-semantic correlations and improve the quality of generated sentences.
Specifically, with the full consideration of the contextual information
provided by the hidden state of the RNN controller, the pyramid attention can
better localize the visually indicative and semantically consistent regions in
images. On the other hand, the contextual information can help re-calibrate the
importance of feature components by learning the channel-wise dependencies, to
improve the discriminative power of visual features for better content
description. We conducted comprehensive experiments on three well-known
datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in
generating descriptive and smooth natural sentences from images. Using either
convolution visual features or more informative bottom-up attention features,
our composite captioning model achieves very promising performance in a
single-model mode. The proposed pyramid attention and dual attention methods
are highly modular, which can be inserted into various image captioning modules
to further improve the performance.Comment: in IEEE Transactions on Multimedia, 202
Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport
Selecting input features of top relevance has become a popular method for
building self-explaining models. In this work, we extend this selective
rationalization approach to text matching, where the goal is to jointly select
and align text pieces, such as tokens or sentences, as a justification for the
downstream prediction. Our approach employs optimal transport (OT) to find a
minimal cost alignment between the inputs. However, directly applying OT often
produces dense and therefore uninterpretable alignments. To overcome this
limitation, we introduce novel constrained variants of the OT problem that
result in highly sparse alignments with controllable sparsity. Our model is
end-to-end differentiable using the Sinkhorn algorithm for OT and can be
trained without any alignment annotations. We evaluate our model on the
StackExchange, MultiNews, e-SNLI, and MultiRC datasets. Our model achieves very
sparse rationale selections with high fidelity while preserving prediction
accuracy compared to strong attention baseline models.Comment: To appear at ACL 202
What's in a Name? Beyond Class Indices for Image Recognition
Existing machine learning models demonstrate excellent performance in image
object recognition after training on a large-scale dataset under full
supervision. However, these models only learn to map an image to a predefined
class index, without revealing the actual semantic meaning of the object in the
image. In contrast, vision-language models like CLIP are able to assign
semantic class names to unseen objects in a `zero-shot' manner, although they
still rely on a predefined set of candidate names at test time. In this paper,
we reconsider the recognition problem and task a vision-language model to
assign class names to images given only a large and essentially unconstrained
vocabulary of categories as prior information. We use non-parametric methods to
establish relationships between images which allow the model to automatically
narrow down the set of possible candidate names. Specifically, we propose
iteratively clustering the data and voting on class names within them, showing
that this enables a roughly 50\% improvement over the baseline on ImageNet.
Furthermore, we tackle this problem both in unsupervised and partially
supervised settings, as well as with a coarse-grained and fine-grained search
space as the unconstrained dictionary
Document And Query Expansion Method With Dirichlet Smoothing Model For Retrieval Of Metadata Content In Digital Resource Objects
In this thesis, an IR framework is proposed which consists of three main stages: enhanced document expansion (EDE) method, adaptive structured Dirichlet smoothing (ASDS) model, and semantic query expansion (SQE)
method. The first stage involves proposing an EDE method in which a new procedure is introduced to increase each metadata unit content according to some specific steps by adding new information which is more relevant and closer to each metadata unit in each document while the second stage involves proposing an ASDS model that has two scenarios to improve the Dirichlet smoothing model. The first scenario is to enhance the model by taking into account of the document structure as in the proposed structured Dirichlet smoothing (SDS) model while the second scenario is to modify the parameters used in the model as in the proposed Adaptive Dirichlet smoothing (ADS) model. The third stage of the proposed framework involves the proposed SQE method to enhance the retrieval performance of DROs by improving the quality of candidate terms that are added semantically to the entire query term
Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning
Developing artificial learning systems that can understand and generate natural language has been one of the long-standing goals of artificial intelligence. Recent decades have witnessed an impressive progress on both of these problems, giving rise to a new family of approaches. Especially, the advances in deep learning over the past couple of years have led to neural approaches to natural language generation (NLG). These methods combine generative language learning techniques with neural-networks based frameworks. With a wide range of applications in natural language processing, neural NLG (NNLG) is a new and fast growing field of research. In this state-of-the-art report, we investigate the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies. We summarize the fundamental building blocks of NNLG approaches from these aspects and provide detailed reviews of commonly used preprocessing steps and basic neural architectures. This report also focuses on the seminal applications of these NNLG models such as machine translation, description generation, automatic speech recognition, abstractive summarization, text simplification, question answering and generation, and dialogue generation. Finally, we conclude with a thorough discussion of the described frameworks by pointing out some open research directions.This work has been partially supported by the European Commission ICT COST Action “Multi-task, Multilingual, Multi-modal Language Generation” (CA18231). AE was supported by BAGEP 2021 Award of the Science Academy. EE was supported in part by TUBA GEBIP 2018 Award. BP is in in part funded by Independent Research Fund Denmark (DFF) grant 9063-00077B. IC has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 838188. EL is partly funded by Generalitat Valenciana and the Spanish Government throught projects PROMETEU/2018/089 and RTI2018-094649-B-I00, respectively. SMI is partly funded by UNIRI project uniri-drustv-18-20. GB is partly supported by the Ministry of Innovation and the National Research, Development and Innovation Office within the framework of the Hungarian Artificial Intelligence National Laboratory Programme. COT is partially funded by the Romanian Ministry of European Investments and Projects through the Competitiveness Operational Program (POC) project “HOLOTRAIN” (grant no. 29/221 ap2/07.04.2020, SMIS code: 129077) and by the German Academic Exchange Service (DAAD) through the project “AWAKEN: content-Aware and netWork-Aware faKE News mitigation” (grant no. 91809005). ESA is partially funded by the German Academic Exchange Service (DAAD) through the project “Deep-Learning Anomaly Detection for Human and Automated Users Behavior” (grant no. 91809358)
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to
combine various modalities in a single joint representation. Especially in the
area of visiolinguistic (VL) learning multiple models and techniques have been
developed, targeting a variety of tasks that involve images and text. VL models
have reached unprecedented performances by extending the idea of Transformers,
so that both modalities can learn from each other. Massive pre-training
procedures enable VL models to acquire a certain level of real-world
understanding, although many gaps can be identified: the limited comprehension
of commonsense, factual, temporal and other everyday knowledge aspects
questions the extendability of VL tasks. Knowledge graphs and other knowledge
sources can fill those gaps by explicitly providing missing information,
unlocking novel capabilities of VL models. In the same time, knowledge graphs
enhance explainability, fairness and validity of decision making, issues of
outermost importance for such complex implementations. The current survey aims
to unify the fields of VL representation learning and knowledge graphs, and
provides a taxonomy and analysis of knowledge-enhanced VL models
CONDA-PM -- A Systematic Review and Framework for Concept Drift Analysis in Process Mining
Business processes evolve over time to adapt to changing business
environments. This requires continuous monitoring of business processes to gain
insights into whether they conform to the intended design or deviate from it.
The situation when a business process changes while being analysed is denoted
as Concept Drift. Its analysis is concerned with studying how a business
process changes, in terms of detecting and localising changes and studying the
effects of the latter. Concept drift analysis is crucial to enable early
detection and management of changes, that is, whether to promote a change to
become part of an improved process, or to reject the change and make decisions
to mitigate its effects. Despite its importance, there exists no comprehensive
framework for analysing concept drift types, affected process perspectives, and
granularity levels of a business process. This article proposes the CONcept
Drift Analysis in Process Mining (CONDA-PM) framework describing phases and
requirements of a concept drift analysis approach. CONDA-PM was derived from a
Systematic Literature Review (SLR) of current approaches analysing concept
drift. We apply the CONDA-PM framework on current approaches to concept drift
analysis and evaluate their maturity. Applying CONDA-PM framework highlights
areas where research is needed to complement existing efforts.Comment: 45 pages, 11 tables, 13 figure
- …