113 research outputs found

    Compréhension de contenus visuels par analyse conjointe du contenu et des usages

    Get PDF
    Dans cette thèse, nous traitons de la compréhension de contenus visuels, qu’il s’agisse d’images, de vidéos ou encore de contenus 3D. On entend par compréhension la capacité à inférer des informations sémantiques sur le contenu visuel. L’objectif de ce travail est d’étudier des méthodes combinant deux approches : 1) l’analyse automatique des contenus et 2) l’analyse des interactions liées à l’utilisation de ces contenus (analyse des usages, en plus bref). Dans un premier temps, nous étudions l’état de l’art issu des communautés de la vision par ordinateur et du multimédia. Il y a 20 ans, l’approche dominante visait une compréhension complètement automatique des images. Cette approche laisse aujourd’hui plus de place à différentes formes d’interventions humaines. Ces dernières peuvent se traduire par la constitution d’une base d’apprentissage annotée, par la résolution interactive de problèmes (par exemple de détection ou de segmentation) ou encore par la collecte d’informations implicites issues des usages du contenu. Il existe des liens riches et complexes entre supervision humaine d’algorithmes automatiques et adaptation des contributions humaines via la mise en œuvre d’algorithmes automatiques. Ces liens sont à l’origine de questions de recherche modernes : comment motiver des intervenants humains ? Comment concevoir des scénarii interactifs pour lesquels les interactions contribuent à comprendre le contenu manipulé ? Comment vérifier la qualité des traces collectées ? Comment agréger les données d’usage ? Comment fusionner les données d’usage avec celles, plus classiques, issues d’une analyse automatique ? Notre revue de la littérature aborde ces questions et permet de positionner les contributions de cette thèse. Celles-ci s’articulent en deux grandes parties. La première partie de nos travaux revisite la détection de régions importantes ou saillantes au travers de retours implicites d’utilisateurs qui visualisent ou acquièrent des con- tenus visuels. En 2D d’abord, plusieurs interfaces de vidéos interactives (en particulier la vidéo zoomable) sont conçues pour coordonner des analyses basées sur le contenu avec celles basées sur l’usage. On généralise ces résultats en 3D avec l’introduction d’un nouveau détecteur de régions saillantes déduit de la capture simultanée de vidéos de la même performance artistique publique (spectacles de danse, de chant etc.) par de nombreux utilisateurs. La seconde contribution de notre travail vise une compréhension sémantique d’images fixes. Nous exploitons les données récoltées à travers un jeu, Ask’nSeek, que nous avons créé. Les interactions élémentaires (comme les clics) et les données textuelles saisies par les joueurs sont, comme précédemment, rapprochées d’analyses automatiques des images. Nous montrons en particulier l’intérêt d’interactions révélatrices des relations spatiales entre différents objets détectables dans une même scène. Après la détection des objets d’intérêt dans une scène, nous abordons aussi le problème, plus ambitieux, de la segmentation. ABSTRACT : This thesis focuses on the problem of understanding visual contents, which can be images, videos or 3D contents. Understanding means that we aim at inferring semantic information about the visual content. The goal of our work is to study methods that combine two types of approaches: 1) automatic content analysis and 2) an analysis of how humans interact with the content (in other words, usage analysis). We start by reviewing the state of the art from both Computer Vision and Multimedia communities. Twenty years ago, the main approach was aiming at a fully automatic understanding of images. This approach today gives way to different forms of human intervention, whether it is through the constitution of annotated datasets, or by solving problems interactively (e.g. detection or segmentation), or by the implicit collection of information gathered from content usages. These different types of human intervention are at the heart of modern research questions: how to motivate human contributors? How to design interactive scenarii that will generate interactions that contribute to content understanding? How to check or ensure the quality of human contributions? How to aggregate human contributions? How to fuse inputs obtained from usage analysis with traditional outputs from content analysis? Our literature review addresses these questions and allows us to position the contributions of this thesis. In our first set of contributions we revisit the detection of important (or salient) regions through implicit feedback from users that either consume or produce visual contents. In 2D, we develop several interfaces of interactive video (e.g. zoomable video) in order to coordinate content analysis and usage analysis. We also generalize these results to 3D by introducing a new detector of salient regions that builds upon simultaneous video recordings of the same public artistic performance (dance show, chant, etc.) by multiple users. The second contribution of our work aims at a semantic understanding of fixed images. With this goal in mind, we use data gathered through a game, Ask’nSeek, that we created. Elementary interactions (such as clicks) together with textual input data from players are, as before, mixed with automatic analysis of images. In particular, we show the usefulness of interactions that help revealing spatial relations between different objects in a scene. After studying the problem of detecting objects on a scene, we also adress the more ambitious problem of segmentation

    Combining content analysis with usage analysis to better understand visual contents

    Get PDF
    This thesis focuses on the problem of understanding visual contents, which can be images, videos or 3D contents. Understanding means that we aim at inferring semantic information about the visual content. The goal of our work is to study methods that combine two types of approaches: 1) automatic content analysis and 2) an analysis of how humans interact with the content (in other words, usage analysis). We start by reviewing the state of the art from both Computer Vision and Multimedia communities. Twenty years ago, the main approach was aiming at a fully automatic understanding of images. This approach today gives way to different forms of human intervention, whether it is through the constitution of annotated datasets, or by solving problems interactively (e.g. detection or segmentation), or by the implicit collection of information gathered from content usages. These different types of human intervention are at the heart of modern research questions: how to motivate human contributors? How to design interactive scenarii that will generate interactions that contribute to content understanding? How to check or ensure the quality of human contributions? How to aggregate human contributions? How to fuse inputs obtained from usage analysis with traditional outputs from content analysis? Our literature review addresses these questions and allows us to position the contributions of this thesis. In our first set of contributions we revisit the detection of important (or salient) regions through implicit feedback from users that either consume or produce visual contents. In 2D, we develop several interfaces of interactive video (e.g. zoomable video) in order to coordinate content analysis and usage analysis. We also generalize these results to 3D by introducing a new detector of salient regions that builds upon simultaneous video recordings of the same public artistic performance (dance show, chant, etc.) by multiple users. The second contribution of our work aims at a semantic understanding of fixed images. With this goal in mind, we use data gathered through a game, Ask’nSeek, that we created. Elementary interactions (such as clicks) together with textual input data from players are, as before, mixed with automatic analysis of images. In particular, we show the usefulness of interactions that help revealing spatial relations between different objects in a scene. After studying the problem of detecting objects on a scene, we also adress the more ambitious problem of segmentation

    Enhancing the use of online 3d multimedia content through the analysis of user interactions

    Get PDF
    De plus en plus de contenus 3D interactifs sont disponibles sur la toile. Visualiser et manipuler ces contenus 3D en temps réel, de façon naturelle et intuitive, devient donc une nécessité. Les applications visées sont nombreuses : le e-commerce, l'éducation et la formation en ligne, la conception, ou l'architecture dans le contexte par exemple de musées virtuels ou de communautés virtuelles. L'utilisation de contenus 3D en ligne ne propose pas de remplacer les contenus traditionnels, tels que les textes, les images ou les vidéos, mais plutôt d'utiliser la 3D en complément, pour enrichir ces contenus. La toile est désormais une plate-forme où les contenus hypertexte, hypermédia, et 3D sont simultanément disponibles pour les utilisateurs. Cette utilisation des contenus 3D pose cependant deux questions principales. Tout d'abord, les interactions 3D sont souvent lourdes puisqu'elles comprennent de nombreux degrés de liberté; la navigation dans les contenus 3D peut s'en trouver inefficace et lente. Nous abordons ce problème en proposant un nouveau paradigme basé sur l'analyse des interactions (crowdsourcing). En analysant les interactions d'utilisateurs 3D, nous identifions des régions d'intérêt (ROI), et générons des recommandations pour les utilisateurs suivants. Ces recommandations permettent à la fois de réduire le temps d'interaction pour identifier une ROI d'un objet 3D et également de simplifier les interactions 3D nécessaires. De plus, les scènes ou objets 3D contiennent une information visuelle riche. Les sites Web traditionnels contiennent, eux, principalement des informations descriptives (textuelles) ainsi que des hyperliens pour permettre la navigation. Des sites contenants d'une part de l'information textuelle, et d'autre part de l'information 3D peuvent s'avérer difficile à appréhender pour les utilisateurs. Pour permettre une navigation cohérente entre les informations 3D et textuelles, nous proposons d'utiliser le crowdsourcing pour la construction d'associations sémantiques entre le texte et la visualisation en 3D. Les liens produits sont proposés aux utilisateurs suivants pour naviguer facilement vers un point de vue d'un objet 3D associé à un contenu textuel. Nous évaluons ces deux méthodes par des études expérimentales. Les évaluations montrent que les recommandations réduisent le temps d'interaction 3D. En outre, les utilisateurs apprécient l'association sémantique proposée, c'est-à-dire, une majorité d'utilisateurs indique que les recommandations ont été utiles pour eux, et préfèrent la navigation en 3D proposée qui consiste à utiliser les liens sémantiques ainsi que la souris par rapport à des interactions utilisant seulement la souris. ABSTRACT : Recent years have seen the development of interactive 3D graphics on the Web. The ability to visualize and manipulate 3D content in real time seems to be the next evolution of the Web for a wide number of application areas such as e-commerce, education and training, architecture design, virtual museums and virtual communities. The use of online 3D graphics in these application domains does not mean to substitute traditional web content of texts, images and videos, but rather acts as a complement for it. The Web is now a platform where hypertext, hypermedia, and 3D graphics are simultaneously available to users. This use of online 3D graphics, however, poses two main issues. First, since 3D interactions are cumbersome as they provide numerous degrees of freedom, 3D browsing may be inefficient. We tackle this problem by proposing a new paradigm based on crowdsourcing to ease online 3D interactions, that consists of analyzing 3D user interactions to identify Regions of Interest (ROIs), and generating recommendations to subsequent users. The recommendations both reduce 3D browsing time and simplify 3D interactions. Second, 3D graphics contain purely rich visual information of the concepts. On the other hand, traditional websites mainly contain descriptive information (text) with hyperlinks as navigation means. The problem is that viewing and interacting with the websites that use two very different mediums (hypertext and 3D graphics) may be complicated for users. To address this issue, we propose to use crowdsourcing for building semantic associations between texts and 3D visualizations. The produced links are suggested to upcoming users so that they can readily locate 3D visualization associated with a textual content. We evaluate the proposed methods with experimental user studies. The evaluations show that the recommendations reduce 3D interaction time. Moreover, the results from the user study showed that our proposed semantic association is appreciated by users, that is, a majority of users assess that recommendations were helpful for them, and browsing 3D objects using both mouse interactions and the proposed links is preferred compared to having only mouse interactions

    Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions

    Full text link
    The current study focuses on systematically analyzing the recent advances in the field of Multimodal eXplainable Artificial Intelligence (MXAI). In particular, the relevant primary prediction tasks and publicly available datasets are initially described. Subsequently, a structured presentation of the MXAI methods of the literature is provided, taking into account the following criteria: a) The number of the involved modalities, b) The stage at which explanations are produced, and c) The type of the adopted methodology (i.e. mathematical formalism). Then, the metrics used for MXAI evaluation are discussed. Finally, a comprehensive analysis of current challenges and future research directions is provided.Comment: 26 pages, 11 figure

    Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning

    Full text link
    Given the massive market of advertising and the sharply increasing online multimedia content (such as videos), it is now fashionable to promote advertisements (ads) together with the multimedia content. It is exhausted to find relevant ads to match the provided content manually, and hence, some automatic advertising techniques are developed. Since ads are usually hard to understand only according to its visual appearance due to the contained visual metaphor, some other modalities, such as the contained texts, should be exploited for understanding. To further improve user experience, it is necessary to understand both the topic and sentiment of the ads. This motivates us to develop a novel deep multimodal multitask framework to integrate multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. In particular, our model first extracts multimodal information from ads and learn high-level and comparable representations. The visual metaphor of the ad is decoded in an unsupervised manner. The obtained representations are then fed into the proposed hierarchical multimodal attention modules to learn task-specific representations for final prediction. A multitask loss function is also designed to train both the topic and sentiment prediction models jointly in an end-to-end manner. We conduct extensive experiments on the latest and large advertisement dataset and achieve state-of-the-art performance for both prediction tasks. The obtained results could be utilized as a benchmark for ads understanding.Comment: 8 pages, 5 figure

    GEO-REFERENCED VIDEO RETRIEVAL: TEXT ANNOTATION AND SIMILARITY SEARCH

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore