330,778 research outputs found

    SVIT: Scaling up Visual Instruction Tuning

    Full text link
    Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing

    Deconstructing Visual Images of 1Malaysia

    Get PDF
    As Malaysia is a multiracial country, Prime Minister Najib has introduced the concept of 1Malaysia to protect each ethnic group and to bring unity to the country. To inform people about the importance of unity, media has been employed to publicize the concept by distributing images of 1Malaysia logo. 1Malaysia is being fetishized now so much so that even public transportations are painted with the 1Malaysia logo. To an outsider eye, this fetishization seems absolutely surprising and complex. Through deconstructing images of 1Malaysia in media from an outsider perspective, this paper examines the function of these visual discourses on Malaysians and tries to answer the question that whether the Malay, Chinese and Indian perceive themselves together as 1– Malaysians. We argue that Malaysia is still a work in progress to achieve ‘unity in diversity.’ Keywords: Malaysia, 1Malaysia, images, unity, diversity, ethnicit

    Which visual questions are difficult to answer? Analysis with Entropy of Answer Distributions

    Full text link
    We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (i.e. the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms. Clustering results are available online at https://github.com/tttamaki/vqd .Comment: accepted by IEEE access available at https://doi.org/10.1109/ACCESS.2020.3022063 as "An Entropy Clustering Approach for Assessing Visual Question Difficulty

    Effectiveness of dermoscopy in skin cancer diagnosis

    Get PDF
    Clinical Inquiries question: Does dermoscopy improve the effectiveness of skin cancer diagnosis when used for skin cancer screening? Evidence-based answer: Dermoscopy added to visual inspection is more accurate than visual inspection alone in the diagnosis of melanoma and basal cell carcinoma (BCC). However, there is insufficient evidence to draw conclusions on the effectiveness of dermoscopy in the diagnosis of squamous cell carcinoma (SCC; strength of recommendation B: based on systematic reviews of randomized controlled trials [RCTs], and prospective and retrospective observational studies).Sydney Davis, MD; Cleveland Piggott, MD, MPH; Corey Lyon, DO; Kristen DeSanto, MSLS, MS, RD, AHIPDr Davis is a resident family physician, Dr Piggott is Assistant Professor and Director of Diversity & Health Equity for Family Medicine, Dr Lyon is Associate Professor in the Department of Family Medicine, and Ms DeSanto is Clinical Librarian in the Strauss Health Sciences Library, all at the University of Colorado in Denver.Includes bibliographical reference

    Creativity: Generating Diverse Questions using Variational Autoencoders

    Full text link
    Generating diverse questions for given images is an important task for computational education, entertainment and AI assistants. Different from many conventional prediction techniques is the need for algorithms to generate a diverse set of plausible questions, which we refer to as "creativity". In this paper we propose a creative algorithm for visual question generation which combines the advantages of variational autoencoders with long short-term memory networks. We demonstrate that our framework is able to generate a large set of varying questions given a single input image.Comment: Accepted to CVPR 201

    Learning by Asking Questions

    Full text link
    We introduce an interactive learning framework for the development and testing of intelligent visual systems, called learning-by-asking (LBA). We explore LBA in context of the Visual Question Answering (VQA) task. LBA differs from standard VQA training in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics natural learning and has the potential to be more data-efficient than the traditional VQA setting. We present a model that performs LBA on the CLEVR dataset, and show that it automatically discovers an easy-to-hard curriculum when learning interactively from an oracle. Our LBA generated data consistently matches or outperforms the CLEVR train data and is more sample efficient. We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions

    Hard to Cheat: A Turing Test based on Answering Questions about Images

    Full text link
    Progress in language and image understanding by machines has sparkled the interest of the research community in more open-ended, holistic tasks, and refueled an old AI dream of building intelligent machines. We discuss a few prominent challenges that characterize such holistic tasks and argue for "question answering about images" as a particular appealing instance of such a holistic task. In particular, we point out that it is a version of a Turing Test that is likely to be more robust to over-interpretations and contrast it with tasks like grounding and generation of descriptions. Finally, we discuss tools to measure progress in this field.Comment: Presented in AAAI-15 Workshop: Beyond the Turing Tes
    • …
    corecore