297 research outputs found

    What value do explicit high level concepts have in vision to language problems?

    Full text link
    Much of the recent progress in Vision-to-Language (V2L) problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. We propose here a method of incorporating high-level concepts into the very successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art performance in both image captioning and visual question answering. We also show that the same mechanism can be used to introduce external semantic information and that doing so further improves performance. In doing so we provide an analysis of the value of high level semantic information in V2L problems.Comment: Accepted to IEEE Conf. Computer Vision and Pattern Recognition 2016. Fixed titl

    Learning visual representations with neural networks for video captioning and image generation

    Full text link
    La recherche sur les reĢseaux de neurones a permis de reĢaliser de larges progreĢ€s durant la dernieĢ€re deĢcennie. Non seulement les reĢseaux de neurones ont eĢteĢ appliqueĢs avec succeĢ€s pour reĢsoudre des probleĢ€mes de plus en plus complexes; mais ils sont aussi devenus lā€™approche dominante dans les domaines ouĢ€ ils ont eĢteĢ testeĢs tels que la compreĢhension du langage, les agents jouant aĢ€ des jeux de manieĢ€re automatique ou encore la vision par ordinateur, graĢ‚ce aĢ€ leurs capaciteĢs calculatoires et leurs efficaciteĢs statistiques. La preĢsente theĢ€se eĢtudie les reĢseaux de neurones appliqueĢs aĢ€ des probleĢ€mes en vision par ordinateur, ouĢ€ les repreĢsentations seĢmantiques abstraites jouent un roĢ‚le fondamental. Nous deĢmontrerons, aĢ€ la fois par la theĢorie et par lā€™expeĢrimentation, la capaciteĢ des reĢseaux de neurones aĢ€ apprendre de telles repreĢsentations aĢ€ partir de donneĢes, avec ou sans supervision. Le contenu de la theĢ€se est diviseĢ en deux parties. La premieĢ€re partie eĢtudie les reĢseaux de neurones appliqueĢs aĢ€ la description de videĢo en langage naturel, neĢcessitant lā€™apprentissage de repreĢsentation visuelle. Le premier modeĢ€le proposeĢ permet dā€™avoir une attention dynamique sur les diffeĢrentes trames de la videĢo lors de la geĢneĢration de la description textuelle pour de courtes videĢos. Ce modeĢ€le est ensuite ameĢlioreĢ par lā€™introduction dā€™une opeĢration de convolution reĢcurrente. Par la suite, la dernieĢ€re section de cette partie identifie un probleĢ€me fondamental dans la description de videĢo en langage naturel et propose un nouveau type de meĢtrique dā€™eĢvaluation qui peut eĢ‚tre utiliseĢ empiriquement comme un oracle afin dā€™analyser les performances de modeĢ€les concernant cette taĢ‚che. La deuxieĢ€me partie se concentre sur lā€™apprentissage non-superviseĢ et eĢtudie une famille de modeĢ€les capables de geĢneĢrer des images. En particulier, lā€™accent est mis sur les ā€œNeural Autoregressive Density Estimators (NADEs), une famille de modeĢ€les probabilistes pour les images naturelles. Ce travail met tout dā€™abord en eĢvidence une connection entre les modeĢ€les NADEs et les reĢseaux stochastiques geĢneĢratifs (GSN). De plus, une ameĢlioration des modeĢ€les NADEs standards est proposeĢe. DeĢnommeĢs NADEs iteĢratifs, cette ameĢlioration introduit plusieurs iteĢrations lors de lā€™infeĢrence du modeĢ€le NADEs tout en preĢservant son nombre de parameĢ€tres. DeĢbutant par une revue chronologique, ce travail se termine par un reĢsumeĢ des reĢcents deĢveloppements en lien avec les contributions preĢsenteĢes dans les deux parties principales, concernant les probleĢ€mes dā€™apprentissage de repreĢsentation seĢmantiques pour les images et les videĢos. De prometteuses directions de recherche sont envisageĢes.The past decade has been marked as a golden era of neural network research. Not only have neural networks been successfully applied to solve more and more challenging real- world problems, but also they have become the dominant approach in many of the places where they have been tested. These places include, for instance, language understanding, game playing, and computer vision, thanks to neural networksā€™ superiority in computational efficiency and statistical capacity. This thesis applies neural networks to problems in computer vision where high-level and semantically meaningful representations play a fundamental role. It demonstrates both in theory and in experiment the ability to learn such representations from data with and without supervision. The main content of the thesis is divided into two parts. The first part studies neural networks in the context of learning visual representations for the task of video captioning. Models are developed to dynamically focus on different frames while generating a natural language description of a short video. Such a model is further improved by recurrent convolutional operations. The end of this part identifies fundamental challenges in video captioning and proposes a new type of evaluation metric that may be used experimentally as an oracle to benchmark performance. The second part studies the family of models that generate images. While the first part is supervised, this part is unsupervised. The focus of it is the popular family of Neural Autoregressive Density Estimators (NADEs), a tractable probabilistic model for natural images. This work first makes a connection between NADEs and Generative Stochastic Networks (GSNs). The standard NADE is improved by introducing multiple iterations in its inference without increasing the number of parameters, which is dubbed iterative NADE. With a historical view at the beginning, this work ends with a summary of recent development for work discussed in the first two parts around the central topic of learning visual representations for images and videos. A bright future is envisioned at the end

    Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

    Get PDF
    While strong progress has been made in image captioning over the last years, machine and human captions are still quite distinct. A closer look reveals that this is due to the deficiencies in the generated word distribution, vocabulary size, and strong bias in the generators towards frequent captions. Furthermore, humans -- rightfully so -- generate multiple, diverse captions, due to the inherent ambiguity in the captioning task which is not considered in today's systems. To address these challenges, we change the training objective of the caption generator from reproducing groundtruth captions to generating a set of captions that is indistinguishable from human generated captions. Instead of handcrafting such a learning target, we employ adversarial training in combination with an approximate Gumbel sampler to implicitly match the generated distribution to the human one. While our method achieves comparable performance to the state-of-the-art in terms of the correctness of the captions, we generate a set of diverse captions, that are significantly less biased and match the word statistics better in several aspects.Comment: 16 pages, Published in ICCV 201

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    Language-Driven Video Understanding

    Full text link
    Video understanding has advanced quite a long way in the past decade, accomplishing tasks including low-level segmentation and tracking that study objects as pixel-level segments or bounding boxes to more high-level activity recognition or classification tasks that classify a video scene to a categorical action label. Despite the progress that has been made, much of this work remains a proxy for an eventual task or application that requires a holistic view of the video, such as objects, actions, attributes, and other semantic components. In this dissertation, we argue that language could deliver the required holistic representation. It plays a significant role in video understanding by allowing machines to communicate with humans and to understand our requests, as shown in tasks such as text-to-video search engine, voice-guided robot manipulation, to name a few. Our language-driven video understanding focuses on two specific problems: video description and visual grounding. What marks our viewpoint different from prior literature is twofold. First, we propose a bottom-up structured learning scheme by decomposing a long video into individual procedure steps and representing each step with a description. Second, we propose to have both explicit (i.e., supervised) and implicit (i.e., weakly-supervised and self-supervised) grounding between words and visual concepts which enables interpretable modeling of the two spaces. We start by drawing attention to the shortage of large benchmarks on long video-language and propose the largest-of-its-kind YouCook2 dataset and ActivityNet-Entities dataset in Chap. II and III. The rest of the chapters circle around two main problems: video description and visual grounding. For video description, we first address the problem of decomposing a long video into compact and self-contained event segments in Chap. IV. Given an event segment or short video clip in general, we propose a non-recurrent approach (i.e., Transformer) for video description generation in Chap. V as opposed to prior RNN-based methods and demonstrate superior performance. Moving forward, we notice one potential issue in end-to-end video description generation, i.e., lack of visual grounding ability and model interpretability that would allow humans to directly interact with machine vision models. To address this issue, we transition our focus from end-to-end, video-to-text systems to systems that could explicitly capture the grounding between the two modalities, with a novel grounded video description framework in Chap. VI. So far, all the methods are fully-supervised, i.e., the model training signal comes directly from heavy & expensive human annotations. In the following chapter, we answer the question "Can we perform visual grounding without explicit supervision?" with a weakly-supervised framework where models learn grounding from (weak) description signal. Finally, in Chap. VIII, we conclude the technical work by exploring a self-supervised grounding approachā€”vision-language pre-trainingā€”that implicitly learns visual grounding from web multi-modal data. This mimics how humans obtain their commonsense from the environment through multi-modal interactions.PHDRoboticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155174/1/luozhou_1.pd
    • ā€¦
    corecore