Search CORE

330,778 research outputs found

SVIT: Scaling up Visual Instruction Tuning

Author: Huang Tiejun
Wu Boya
Zhao Bo
Publication venue
Publication date: 08/07/2023
Field of study

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing

arXiv.org e-Print Archive

Deconstructing Visual Images of 1Malaysia

Author: Yusof Noraini Md.
Zeiny Esmaeil
Publication venue: Research on Humanities and Social Sciences
Publication date: 29/07/2015
Field of study

As Malaysia is a multiracial country, Prime Minister Najib has introduced the concept of 1Malaysia to protect each ethnic group and to bring unity to the country. To inform people about the importance of unity, media has been employed to publicize the concept by distributing images of 1Malaysia logo. 1Malaysia is being fetishized now so much so that even public transportations are painted with the 1Malaysia logo. To an outsider eye, this fetishization seems absolutely surprising and complex. Through deconstructing images of 1Malaysia in media from an outsider perspective, this paper examines the function of these visual discourses on Malaysians and tries to answer the question that whether the Malay, Chinese and Indian perceive themselves together as 1– Malaysians. We argue that Malaysia is still a work in progress to achieve ‘unity in diversity.’ Keywords: Malaysia, 1Malaysia, images, unity, diversity, ethnicit

CiteSeerX

International Institute for Science, Technology and Education (IISTE): E-Journals

Which visual questions are difficult to answer? Analysis with Entropy of Answer Distributions

Author: Kaneda Kazufumi
Raytchev Bisser
Satoh Shun'ichi
Tamaki Toru
Terao Kento
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/12/2020
Field of study

We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (i.e. the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms. Clustering results are available online at https://github.com/tttamaki/vqd .Comment: accepted by IEEE access available at https://doi.org/10.1109/ACCESS.2020.3022063 as "An Entropy Clustering Approach for Assessing Visual Question Difficulty

arXiv.org e-Print Archive

Effectiveness of dermoscopy in skin cancer diagnosis

Author: Davis Sydney
DeSanto Kristen
Lyon Corey
Piggot Cleveland
Publication venue: Family Physicians Inquiries Network
Publication date: 01/01/2020
Field of study

Clinical Inquiries question: Does dermoscopy improve the effectiveness of skin cancer diagnosis when used for skin cancer screening? Evidence-based answer: Dermoscopy added to visual inspection is more accurate than visual inspection alone in the diagnosis of melanoma and basal cell carcinoma (BCC). However, there is insufficient evidence to draw conclusions on the effectiveness of dermoscopy in the diagnosis of squamous cell carcinoma (SCC; strength of recommendation B: based on systematic reviews of randomized controlled trials [RCTs], and prospective and retrospective observational studies).Sydney Davis, MD; Cleveland Piggott, MD, MPH; Corey Lyon, DO; Kristen DeSanto, MSLS, MS, RD, AHIPDr Davis is a resident family physician, Dr Piggott is Assistant Professor and Director of Diversity & Health Equity for Family Medicine, Dr Lyon is Associate Professor in the Department of Family Medicine, and Ms DeSanto is Clinical Librarian in the Strauss Health Sciences Library, all at the University of Colorado in Denver.Includes bibliographical reference

University of Missouri: MOspace

Creativity: Generating Diverse Questions using Variational Autoencoders

Author: Jain Unnat
Schwing Alexander
Zhang Ziyu
Publication venue
Publication date: 11/04/2017
Field of study

Generating diverse questions for given images is an important task for computational education, entertainment and AI assistants. Different from many conventional prediction techniques is the need for algorithms to generate a diverse set of plausible questions, which we refer to as "creativity". In this paper we propose a creative algorithm for visual question generation which combines the advantages of variational autoencoders with long short-term memory networks. We demonstrate that our framework is able to generate a large set of varying questions given a single input image.Comment: Accepted to CVPR 201

arXiv.org e-Print Archive

Crossref

Learning by Asking Questions

Author: Fergus Rob
Girshick Ross
Gupta Abhinav
Hebert Martial
Misra Ishan
van der Maaten Laurens
Publication venue
Publication date: 04/12/2017
Field of study

We introduce an interactive learning framework for the development and testing of intelligent visual systems, called learning-by-asking (LBA). We explore LBA in context of the Visual Question Answering (VQA) task. LBA differs from standard VQA training in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics natural learning and has the potential to be more data-efficient than the traditional VQA setting. We present a model that performs LBA on the CLEVR dataset, and show that it automatically discovers an easy-to-hard curriculum when learning interactively from an oracle. Our LBA generated data consistently matches or outperforms the CLEVR train data and is more sample efficient. We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions

arXiv.org e-Print Archive

Crossref

Hard to Cheat: A Turing Test based on Answering Questions about Images

Author: Fritz Mario
Malinowski Mateusz
Publication venue
Publication date: 01/01/2015
Field of study

Progress in language and image understanding by machines has sparkled the interest of the research community in more open-ended, holistic tasks, and refueled an old AI dream of building intelligent machines. We discuss a few prominent challenges that characterize such holistic tasks and argue for "question answering about images" as a particular appealing instance of such a holistic task. In particular, we point out that it is a version of a Turing Test that is likely to be more robust to over-interpretations and contrast it with tasks like grounding and generation of descriptions. Finally, we discuss tools to measure progress in this field.Comment: Presented in AAAI-15 Workshop: Beyond the Turing Tes

arXiv.org e-Print Archive

CISPA – Helmholtz-Zentrum für Informationssicherheit

MPG.PuRe

Recommended from our members

An Interactive Tablecloth for Facilitating Discussion in a Culturally Diverse Group

Author: Oshodi Maria
Spiers Ad
van der Linden Janet
Wiseman Sarah
Publication venue
Publication date: 01/01/2015
Field of study

Group discussions are a useful tool in a number of environments: from working towards a common goal in a business setting, to gathering feedback on an exhibit in a museum for example. One issue in such sessions is that some group members can talk more loudly and confidently than others, making some group members change their mind or keep quiet, this can result in interesting differences of opinion being lost. In this paper we present a tool for facilitating such group discussions. The tool is an interactive tablecloth that is controlled with tangible interfaces, and provides a method for each group member’s voice to be heard prior to discussion, thus preserving the diversity of responses. When tested after an immersive theatre performance, the tool effectively allowed each group member to answer questions individually prior to beginning group discussion. This also allowed the facilitator to effectively coordinate the discussion in an efficient manner

Open Research Online (The Open University)