330,778 research outputs found
SVIT: Scaling up Visual Instruction Tuning
Thanks to the emerging of foundation models, the large language and vision
models are integrated to acquire the multimodal ability of visual captioning,
dialogue, question answering, etc. Although existing multimodal models present
impressive performance of visual understanding and reasoning, their limits are
still largely under-explored due to the scarcity of high-quality instruction
tuning data. To push the limits of multimodal capability, we Sale up Visual
Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual
instruction tuning data including 1.6M conversation question-answer (QA) pairs
and 1.6M complex reasoning QA pairs and 106K detailed image descriptions.
Besides the volume, the proposed dataset is also featured by the high quality
and rich diversity, which is generated by prompting GPT-4 with the abundant
manual annotations of images. We empirically verify that training multimodal
models on SVIT can significantly improve the multimodal performance in terms of
visual perception, reasoning and planing
Deconstructing Visual Images of 1Malaysia
As Malaysia is a multiracial country, Prime Minister Najib has introduced the concept of 1Malaysia to protect each ethnic group and to bring unity to the country. To inform people about the importance of unity, media has been employed to publicize the concept by distributing images of 1Malaysia logo. 1Malaysia is being fetishized now so much so that even public transportations are painted with the 1Malaysia logo. To an outsider eye, this fetishization seems absolutely surprising and complex. Through deconstructing images of 1Malaysia in media from an outsider perspective, this paper examines the function of these visual discourses on Malaysians and tries to answer the question that whether the Malay, Chinese and Indian perceive themselves together as 1– Malaysians. We argue that Malaysia is still a work in progress to achieve ‘unity in diversity.’ Keywords: Malaysia, 1Malaysia, images, unity, diversity, ethnicit
Which visual questions are difficult to answer? Analysis with Entropy of Answer Distributions
We propose a novel approach to identify the difficulty of visual questions
for Visual Question Answering (VQA) without direct supervision or annotations
to the difficulty. Prior works have considered the diversity of ground-truth
answers of human annotators. In contrast, we analyze the difficulty of visual
questions based on the behavior of multiple different VQA models. We propose to
cluster the entropy values of the predicted answer distributions obtained by
three different models: a baseline method that takes as input images and
questions, and two variants that take as input images only and questions only.
We use a simple k-means to cluster the visual questions of the VQA v2
validation set. Then we use state-of-the-art methods to determine the accuracy
and the entropy of the answer distributions for each cluster. A benefit of the
proposed method is that no annotation of the difficulty is required, because
the accuracy of each cluster reflects the difficulty of visual questions that
belong to it. Our approach can identify clusters of difficult visual questions
that are not answered correctly by state-of-the-art methods. Detailed analysis
on the VQA v2 dataset reveals that 1) all methods show poor performances on the
most difficult cluster (about 10% accuracy), 2) as the cluster difficulty
increases, the answers predicted by the different methods begin to differ, and
3) the values of cluster entropy are highly correlated with the cluster
accuracy. We show that our approach has the advantage of being able to assess
the difficulty of visual questions without ground-truth (i.e. the test set of
VQA v2) by assigning them to one of the clusters. We expect that this can
stimulate the development of novel directions of research and new algorithms.
Clustering results are available online at https://github.com/tttamaki/vqd .Comment: accepted by IEEE access available at
https://doi.org/10.1109/ACCESS.2020.3022063 as "An Entropy Clustering
Approach for Assessing Visual Question Difficulty
Effectiveness of dermoscopy in skin cancer diagnosis
Clinical Inquiries question: Does dermoscopy improve the effectiveness of skin cancer diagnosis when used for skin cancer screening? Evidence-based answer: Dermoscopy added to visual inspection is more accurate than visual inspection alone in the diagnosis of melanoma and basal cell carcinoma (BCC). However, there is insufficient evidence to draw conclusions on the effectiveness of dermoscopy in the diagnosis of squamous cell carcinoma (SCC; strength of recommendation B: based on systematic reviews of randomized controlled trials [RCTs], and prospective and retrospective observational studies).Sydney Davis, MD; Cleveland Piggott, MD, MPH; Corey Lyon, DO; Kristen DeSanto, MSLS, MS, RD, AHIPDr Davis is a resident family physician, Dr Piggott is Assistant Professor and Director of Diversity & Health Equity for Family Medicine, Dr Lyon is Associate Professor in the Department of Family Medicine, and Ms DeSanto is Clinical Librarian in the Strauss Health Sciences Library, all at the University of Colorado in Denver.Includes bibliographical reference
Creativity: Generating Diverse Questions using Variational Autoencoders
Generating diverse questions for given images is an important task for
computational education, entertainment and AI assistants. Different from many
conventional prediction techniques is the need for algorithms to generate a
diverse set of plausible questions, which we refer to as "creativity". In this
paper we propose a creative algorithm for visual question generation which
combines the advantages of variational autoencoders with long short-term memory
networks. We demonstrate that our framework is able to generate a large set of
varying questions given a single input image.Comment: Accepted to CVPR 201
Learning by Asking Questions
We introduce an interactive learning framework for the development and
testing of intelligent visual systems, called learning-by-asking (LBA). We
explore LBA in context of the Visual Question Answering (VQA) task. LBA differs
from standard VQA training in that most questions are not observed during
training time, and the learner must ask questions it wants answers to. Thus,
LBA more closely mimics natural learning and has the potential to be more
data-efficient than the traditional VQA setting. We present a model that
performs LBA on the CLEVR dataset, and show that it automatically discovers an
easy-to-hard curriculum when learning interactively from an oracle. Our LBA
generated data consistently matches or outperforms the CLEVR train data and is
more sample efficient. We also show that our model asks questions that
generalize to state-of-the-art VQA models and to novel test time distributions
Hard to Cheat: A Turing Test based on Answering Questions about Images
Progress in language and image understanding by machines has sparkled the
interest of the research community in more open-ended, holistic tasks, and
refueled an old AI dream of building intelligent machines. We discuss a few
prominent challenges that characterize such holistic tasks and argue for
"question answering about images" as a particular appealing instance of such a
holistic task. In particular, we point out that it is a version of a Turing
Test that is likely to be more robust to over-interpretations and contrast it
with tasks like grounding and generation of descriptions. Finally, we discuss
tools to measure progress in this field.Comment: Presented in AAAI-15 Workshop: Beyond the Turing Tes
Recommended from our members
An Interactive Tablecloth for Facilitating Discussion in a Culturally Diverse Group
Group discussions are a useful tool in a number of environments: from working towards a common goal in a business setting, to gathering feedback on an exhibit in a museum for example. One issue in such sessions is that some group members can talk more loudly and confidently than others, making some group members change their mind or keep quiet, this can result in interesting differences of opinion being lost. In this paper we present a tool for facilitating such group discussions. The tool is an interactive tablecloth that is controlled with tangible interfaces, and provides a method for each group member’s voice to be heard prior to discussion, thus preserving the diversity of responses. When tested after an immersive theatre performance, the tool effectively allowed each group member to answer questions individually prior to beginning group discussion. This also allowed the facilitator to effectively coordinate the discussion in an efficient manner
- …