3,333 research outputs found

    Multi-modal Transformer for Video Retrieval

    Get PDF
    The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.Comment: ECCV 2020 (spotlight paper

    Retrieving Ambiguous Sounds Using Perceptual Timbral Attributes in Audio Production Environments

    Get PDF
    For over an decade, one of the well identified problem within audio production environments is the effective retrieval and management of sound libraries. Most of the self-recorded and commercially produced sound libraries are usually well structured in terms of meta-data and textual descriptions and thus allowing traditional text-based retrieval approaches to obtain satisfiable results. However, traditional information retrieval techniques pose limitations in retrieving ambiguous sound collections (ie. sounds with no identifiable origin, foley sounds, synthesized sound effects, abstract sounds) due to the difficulties in textual descriptions and the complex psychoacoustic nature of the sound. Early psychoacoustical studies propose perceptual acoustical qualities as an effective way of describing these category of sounds [1]. In Music Information Retrieval (MIR) studies, this problem were mostly studied and explored in context of content-based audio retrieval. However, we observed that most of the commercial available systems in the market neither integrated advanced content-based sound descriptions nor the visualization and interface design approaches evolved in the last years. Our research was mainly aimed to investigate two things; 1. Development of audio retrieval system incorporating high level timbral features as search parameters. 2. Investigate user-centered approach in integrating these features into audio production pipelines using expert-user studies. In this project, We present an prototype which is similar to traditional sound browsers (list-based browsing) with an added functionality of filtering and ranking sounds by perceptual timbral features such as brightness, depth, roughness and hardness. Our main focus was on the retrieval process by timbral features. Inspiring from the recent focus on user-centered systems ([2], [3]) in the MIR community, in-depth interviews and qualitative evaluation of the system were conducted with expert-user in order to identify the underlying problems. Our studies observed the potential applications of high-level perceptual timbral features in audio production pipelines using a probe system and expert-user studies. We also outlined future guidelines and possible improvements to the system from the outcomes of this research

    Conceptual Representations for Computational Concept Creation

    Get PDF
    Computational creativity seeks to understand computational mechanisms that can be characterized as creative. The creation of new concepts is a central challenge for any creative system. In this article, we outline different approaches to computational concept creation and then review conceptual representations relevant to concept creation, and therefore to computational creativity. The conceptual representations are organized in accordance with two important perspectives on the distinctions between them. One distinction is between symbolic, spatial and connectionist representations. The other is between descriptive and procedural representations. Additionally, conceptual representations used in particular creative domains, such as language, music, image and emotion, are reviewed separately. For every representation reviewed, we cover the inference it affords, the computational means of building it, and its application in concept creation.Peer reviewe

    Extraction and representation of semantic information in digital media

    Get PDF

    Multi-modal Transformer for Video Retrieval

    Get PDF
    International audienceThe task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT
    corecore