152,838 research outputs found
Open-ended visual question answering
Wearable cameras generate a large amount of photos which are, in many cases, useless or redundant. On the other hand, these devices are provide an excellent opportunity to create automatic questions and answers for reminiscence therapy. This is a follow up of the BSc thesis developed by Ricard Mestre during Fall 2014, and MSc thesis developed by Aniol Lidon.This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework. As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations. The source code and models are publicly available at https://github.com/imatge-upc/vqa-2016-cvprw.Esta tesis estudia métodos para resolver tareas de Visual Question-Answering usando técnicas de Deep Learning. Como primer paso, exploramos las redes Long Short-Term Memory (LST) que se usan en el Procesado del Lenguaje Natural (NLP) para atacar tareas de Question-Answering basadas únicamente en texto. A continuación modificamos el modelo anterior para aceptar una imagen como entrada junto con la pregunta. Para este propósito, estudiamos el uso de las redes convolucionales VGG-16 y K-CNN para extraer los descriptores visuales de la imagen. Estos descriptores son fusionados con el word embedding o sentence embedding de la pregunta para poder predecir la respuesta. Este trabajo se ha presentado al Visual Question Answering Challenge 2016, donde ha obtenido una precisión del 53,62% en los datos de test. El software desarrollado ha usado buenas prácticas de programación y ha seguido las directrices de estilo de Python, proveyendo un proyecto base en Keras consistente a distintas configuraciones. El código fuente y los modelos son públicos en https://github.com/imatge-upc/ vqa-2016-cvprw.Aquesta tesis estudia mètodes per resoldre tasques de Visual Question-Answering emprant tècniques de Deep Learning. Com a pas preliminar, explorem les xarxes Long Short-Term Memory (LSTM) que s'utilitzen en el Processat del Llenguatge Natural (NLP) per atacar tasques de Question-Answering basades únicament en text. A continuació modifiquem el model anterior per acceptar una imatge com a entrada juntament amb la pregunta. Per aquest propòsit, estudiem l'ús de les xarxes convolucionals VGG-16 i KCNN per tal d'extreure els descriptors visuals de la imatge. Aquests descriptors són fusionats amb el word embedding o sentence embedding de la pregunta per poder predir la resposta. Aquest treball ha estat presentat al Visual Question Answering Challenge 2016, on ha obtingut una precisió del 53,62% en les dades de test. El software desenvolupat ha emprat bones pràctiques en programació i ha seguit les directrius d'estil de Python, prove ïnt un projecte base en Keras consistent a diferents configuracions. El codi font i els models són públics a https://github.com/imatge-upc/vqa-2016-cvprw
Video Question Answering via Attribute-Augmented Attention Network Learning
Video Question Answering is a challenging problem in visual information
retrieval, which provides the answer to the referenced video content according
to the question. However, the existing visual question answering approaches
mainly tackle the problem of static image question, which may be ineffectively
for video question answering due to the insufficiency of modeling the temporal
dynamics of video contents. In this paper, we study the problem of video
question answering by modeling its temporal dynamics with frame-level attention
mechanism. We propose the attribute-augmented attention network learning
framework that enables the joint frame-level attribute detection and unified
video representation learning for video question answering. We then incorporate
the multi-step reasoning process for our proposed attention network to further
improve the performance. We construct a large-scale video question answering
dataset. We conduct the experiments on both multiple-choice and open-ended
video question answering tasks to show the effectiveness of the proposed
method.Comment: Accepted for SIGIR 201
Answer-Type Prediction for Visual Question Answering
Recently, algorithms for object recognition and related tasks have become sufficiently proficient that new vision tasks can now be pursued. In this paper, we build a system capable of answering open-ended text-based questions about images, which is known as Visual Question Answering (VQA). Our approach’s key insight is that we can predict the form of the answer from the question. We formulate our solution in a Bayesian framework. When our approach is combined with a discriminative model, the combined model achieves state-of-the-art results on four benchmark datasets for open-ended VQA: DAQUAR, COCO-QA, The VQA Dataset, and Visual7W
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
The open-ended Visual Question Answering (VQA) task requires AI models to
jointly reason over visual and natural language inputs using world knowledge.
Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to
the task and shown to be powerful world knowledge sources. However, these
methods suffer from low knowledge coverage caused by PLM bias -- the tendency
to generate certain tokens over other tokens regardless of prompt changes, and
high dependency on the PLM quality -- only models using GPT-3 can achieve the
best result.
To address the aforementioned challenges, we propose RASO: a new VQA pipeline
that deploys a generate-then-select strategy guided by world knowledge for the
first time. Rather than following the de facto standard to train a multi-modal
model that directly generates the VQA answer, RASO first adopts PLM to generate
all the possible answers, and then trains a lightweight answer selection model
for the correct answer. As proved in our analysis, RASO expands the knowledge
coverage from in-domain training data by a large margin. We provide extensive
experimentation and show the effectiveness of our pipeline by advancing the
state-of-the-art by 4.1% on OK-VQA, without additional computation cost. Code
and models are released at http://cogcomp.org/page/publication_view/1010Comment: Accepted to ACL 2023 Finding
Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
Medical Visual Question Answering (VQA) is an important challenge, as it
would lead to faster and more accurate diagnoses and treatment decisions. Most
existing methods approach it as a multi-class classification problem, which
restricts the outcome to a predefined closed-set of curated answers. We focus
on open-ended VQA and motivated by the recent advances in language models
consider it as a generative task. Leveraging pre-trained language models, we
introduce a novel method particularly suited for small, domain-specific,
medical datasets. To properly communicate the medical images to the language
model, we develop a network that maps the extracted visual features to a set of
learnable tokens. Then, alongside the question, these learnable tokens directly
prompt the language model. We explore recent parameter-efficient fine-tuning
strategies for language models, which allow for resource- and data-efficient
fine-tuning. We evaluate our approach on the prime medical VQA benchmarks,
namely, Slake, OVQA and PathVQA. The results demonstrate that our approach
outperforms existing methods across various training settings while also being
computationally efficient
- …