2,746 research outputs found

    Visual Question Answering: A Survey of Methods and Datasets

    Full text link
    Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires reasoning over visual elements of the image and general knowledge to infer the correct answer. In the first part of this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space. We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the second part of this survey, we review the datasets available for training and evaluating VQA systems. The various datatsets contain questions at different levels of complexity, which require different capabilities and types of reasoning. We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the field, in particular the connection to structured knowledge bases and the use of natural language processing models.Comment: 25 page

    DramaQA: Character-Centered Video Story Understanding with Hierarchical QA

    Full text link
    Despite recent progress on computer vision and natural language processing, developing video understanding intelligence is still hard to achieve due to the intrinsic difficulty of story in video. Moreover, there is not a theoretical metric for evaluating the degree of video understanding. In this paper, we propose a novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story. The DramaQA focused on two perspectives: 1) hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) character-centered video annotations to model local coherence of the story. Our dataset is built upon the TV drama "Another Miss Oh" and it contains 16,191 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels. We provide 217,308 annotated images with rich character-centered annotations, including visual bounding boxes, behaviors, and emotions of main characters, and coreference resolved scripts. Additionally, we provide analyses of the dataset as well as Dual Matching Multistream model which effectively learns character-centered representations of video to answer questions about the video. We are planning to release our dataset and model publicly for research purposes and expect that our work will provide a new perspective on video story understanding research.Comment: 21 pages, 10 figures, submitted to ECCV 202

    How might teachers enable self-confidence? A review study

    Get PDF
    In the context of learner-centred learning and curricular reform, self-confidence is invoked as an important construct. However, there is no easily available research-informed guidance on what self-confidence means for the professional teacher. This study uses the analytic technique of Concept Analysis to review psychology and education literatures to provide a 'take-home' message for teachers. The review identifies conceptual artefacts (ideas, theories, concepts which explain, connect, predict or apply knowledge) that the teacher can appropriate in order to enable learner self-confidence. These conceptual artefacts are classified in three groups: characterising self-confidence; self-judgements of confidence; and factors that influence the development of self-confidence. The review finds self-confidence to be a robust and stable psychological construct, best promoted through teachers' attention to learners' development of knowledge and engagement in socially designed learning activities. It further finds that teachers' attention to activities which involve learners' self-regulation are of importance

    Emotional Storyteller for Vision Impaired and Hearing-Impaired Children

    Get PDF
    Tellie is an innovative mobile app designed to offer an immersive and emotionally enriched storytelling experience for children who are visually and hearing impaired. It achieves this through four main objectives: Text extraction utilizes the CRAFT model and a combination of Convolutional Neural Networks (CNNs), Connectionist Temporal Classification (CTC), and Long Short-Term Memory (LSTM) networks to accurately extract and recognize text from images in storybooks. Recognition of Emotions in Sentences employs BERT to detect and distinguish emotions at the sentence level including happiness, anger, sadness, and surprise. Conversion of Text to Human Natural Audio with Emotion transforms text into emotionally expressive audio using Tacotron2 and Wave Glow, enhancing the synthesized speech with emotional styles to create engaging audio narratives. Conversion of Text to Sign Language: To cater to the Deaf and hard-of-hearing community, Tellie translates text into sign language using CNNs, ensuring alignment with real sign language expressions. These objectives combine to create Tellie, a groundbreaking app that empowers visually and hearing-impaired children with access to captivating storytelling experiences, promoting accessibility and inclusivity through the harmonious integration of language, creativity, and technology. This research demonstrates the potential of advanced technologies in fostering inclusive and emotionally engaging storytelling for all children

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Full text link
    Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction. We find that closed-loop language feedback significantly improves high-level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment in the real world.Comment: Project website: https://innermonologue.github.i

    Conversational Agent: Developing a Model for Intelligent Agents with Transient Emotional States

    Get PDF
    The inclusion of human characteristics (i.e., emotions, personality) within an intelligent agent can often increase the effectiveness of information delivery and retrieval. Chat-bots offer a plethora of benefits within an eclectic range of disciplines (e.g., education, medicine, clinical and mental health). Hence, chatbots offer an effective way to observe, assess, and evaluate human communication patterns. Current research aims to develop a computational model for conversational agents with an emotional component to be applied to the army leadership training program that will allow for the examination of interpersonal skills in future research. Overall, the current research explores the application of the deep learning algorithm to the development of a generalized framework that will be based upon modeling empathetic conversation between an intelligent conversational agent (chatbot) and a human user in order to allow for higher level observation of interpersonal communication skills. Preliminary results demonstrate the promising potential of the seq2seq technique (e.g., through the use of Dialog Flow Chatbot platform) when applied to emotion-oriented conversational tasks. Both the classification and generative conversational modeling tasks demonstrate the promising potential of the current research for representing human to agent dialogue. However, this implementation may be extended by utilizing, a larger more high-quality dataset
    corecore