18 research outputs found

    Human Attention in Image Captioning: Dataset and Analysis

    Get PDF
    In this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects (97%97\% of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around 78%78\%), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks. The dataset can be found at: https://github.com/SenHe/Human-Attention-in-Image-Captioning.Comment: To appear at ICCV 201

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    Human Attention in Image Captioning: Dataset and Analysis

    Get PDF
    This is the author accepted manuscript. The final version is available from IEE via the DOI in this record.Data availablility: The dataset can be found at: https://github.com/SenHe/ Human-Attention-in-Image-CaptioningIn this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects (97% of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around 78%), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks.Engineering and Physical Sciences Research Council (EPSRC)Alan Turing Institut

    Large-Scale Video Classification with Convolutional Neural Networks

    Full text link
    Convolutional Neural Networks (CNNs) have been es-tablished as a powerful class of models for image recog-nition problems. Encouraged by these results, we pro-vide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multi-ple approaches for extending the connectivity of the a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated archi-tecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant per-formance improvements compared to strong feature-based baselines (55.3 % to 63.9%), but only a surprisingly mod-est improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant per-formance improvements compared to the UCF-101 baseline model (63.3 % up from 43.9%). 1

    Towards Green Metaverse Networking Technologies, Advancements and Future Directions

    Full text link
    As the Metaverse is iteratively being defined, its potential to unleash the next wave of digital disruption and create real-life value becomes increasingly clear. With distinctive features of immersive experience, simultaneous interactivity, and user agency, the Metaverse has the capability to transform all walks of life. However, the enabling technologies of the Metaverse, i.e., digital twin, artificial intelligence, blockchain, and extended reality, are known to be energy-hungry, therefore raising concerns about the sustainability of its large-scale deployment and development. This article proposes Green Metaverse Networking for the first time to optimize energy efficiencies of all network components for Metaverse sustainable development. We first analyze energy consumption, efficiency, and sustainability of energy-intensive technologies in the Metaverse. Next, focusing on computation and networking, we present major advancements related to energy efficiency and their integration into the Metaverse. A case study of energy conservation by incorporating semantic communication and stochastic resource allocation in the Metaverse is presented. Finally, we outline the critical challenges of Metaverse sustainable development, thereby indicating potential directions of future research towards the green Metaverse

    A Comprehensive Literature Review on Convolutional Neural Networks

    Get PDF
    The fields of computer vision and image processing from their initial days have been dealing with the problems of visual recognition. Convolutional Neural Networks (CNNs) in machine learning are deep architectures built as feed-forward neural networks or perceptrons, which are inspired by the research done in the fields of visual analysis by the visual cortex of mammals like cats. This work gives a detailed analysis of CNNs for the computer vision tasks, natural language processing, fundamental sciences and engineering problems along with other miscellaneous tasks. The general CNN structure along with its mathematical intuition and working, a brief critical commentary on the advantages and disadvantages, which leads researchers to search for alternatives to CNN’s are also mentioned. The paper also serves as an appreciation of the brain-child of past researchers for the existence of such a fecund architecture for handling multidimensional data and approaches to improve their performance further
    corecore