3,333 research outputs found
Multi-modal Transformer for Video Retrieval
The task of retrieving video content relevant to natural language queries
plays a critical role in effectively handling internet-scale datasets. Most of
the existing methods for this caption-to-video retrieval problem do not fully
exploit cross-modal cues present in video. Furthermore, they aggregate
per-frame visual features with limited or no temporal information. In this
paper, we present a multi-modal transformer to jointly encode the different
modalities in video, which allows each of them to attend to the others. The
transformer architecture is also leveraged to encode and model the temporal
information. On the natural language side, we investigate the best practices to
jointly optimize the language embedding together with the multi-modal
transformer. This novel framework allows us to establish state-of-the-art
results for video retrieval on three datasets. More details are available at
http://thoth.inrialpes.fr/research/MMT.Comment: ECCV 2020 (spotlight paper
Retrieving Ambiguous Sounds Using Perceptual Timbral Attributes in Audio Production Environments
For over an decade, one of the well identified problem within audio production environments is the effective retrieval and management of sound libraries. Most of the self-recorded and commercially produced sound libraries are usually well structured in terms of meta-data and textual descriptions and thus allowing traditional text-based retrieval approaches to obtain satisfiable results. However, traditional information retrieval techniques pose limitations in retrieving ambiguous sound collections (ie. sounds with no identifiable origin, foley sounds, synthesized sound effects, abstract sounds) due to the difficulties in textual descriptions and the complex psychoacoustic nature of the sound. Early psychoacoustical studies propose perceptual acoustical qualities as an effective way of describing these category of sounds [1]. In Music Information Retrieval (MIR) studies, this problem were mostly studied and explored in context of content-based audio retrieval. However, we observed that most of the commercial available systems in the market neither integrated advanced content-based sound descriptions nor the visualization and interface design approaches evolved in the last years.
Our research was mainly aimed to investigate two things; 1. Development of audio retrieval system incorporating high level timbral features as search parameters. 2. Investigate user-centered approach in integrating these features into audio production pipelines using expert-user studies. In this project, We present an prototype which is similar to traditional sound browsers (list-based browsing) with an added functionality of filtering and ranking sounds by perceptual timbral features such as brightness, depth, roughness and hardness. Our main focus was on the retrieval process by timbral features. Inspiring from the recent focus on user-centered systems ([2], [3]) in the MIR community, in-depth interviews and qualitative evaluation of the system were conducted with expert-user in order to identify the underlying problems. Our studies observed the potential applications of high-level perceptual timbral features in audio production pipelines using a probe system and expert-user studies. We also outlined future guidelines and possible improvements to the system from the outcomes of this research
Conceptual Representations for Computational Concept Creation
Computational creativity seeks to understand computational mechanisms that can be characterized as creative. The creation of new concepts is a central challenge for any creative system. In this article, we outline different approaches to computational concept creation and then review conceptual representations relevant to concept creation, and therefore to computational creativity. The conceptual representations are organized in accordance with two important perspectives on the distinctions between them. One distinction is between symbolic, spatial and connectionist representations. The other is between descriptive and procedural representations. Additionally, conceptual representations used in particular creative domains, such as language, music, image and emotion, are reviewed separately. For every representation reviewed, we cover the inference it affords, the computational means of building it, and its application in concept creation.Peer reviewe
Recommended from our members
Modality Bridging and Unified Multimodal Understanding
Multimodal understanding is a vast realm of research that covers multiple disciplines. Hence, it requires a correct understanding of the goal in a generic multimodal understanding research study. The definition of modalities of interest is important since each modality requires its own considerations. On the other hand, it is important to understand whether these modalities should be complimentary to each other or have significant overlap in terms of the information they carry. For example, most of the modalities in biological signals do not have significant overlap with each other, yet they can be used together to improve the range and accuracy of diagnoses. An extreme example of two modalities that have significant overlap is an instructional video and its corresponding instructions in detailed texts. In this study, we focus on multimedia, which includes image, video, audio, and text about real world everyday events, mostly focused on human activities.
We narrow our study to the important direction of common space learning since we want to bridge between different modalities using the overlap that a given pair of modalities have.There are multiple applications which require a strong common space to be able to perform desirably. We choose image-text grounding, video-audio autoencoding, video-conditioned text generation, and video-audio-text common space learning for semantic encoding. We examine multiple ideas in each direction and achieve important conclusions. In image-text grounding, we learn that different levels of semantic representations are helpful to achieve a thorough common space that is representative of two modalities. In video-audio autoencoding, we observe that reconstruction objectives can help with a representative common space. Moreover, there is an inherent problem when dealing with multiple modalities at the same time, and that is different levels of granularity. For example, the sampling rate and granularity of video is much higher and more complicated compared to audio. Hence, it might be more helpful to find a more semantically abstracted common space which does not carry redundant details, especially considering the temporal aspect of video and audio modalities. In video-conditioned text generation, we examine the possibility of encoding a video sequence using a Transformer (and later decoding the captions using a Transformer decoder). We further explore the possibility of learning latent states for storing real-world concepts without supervision.
Using the observations from these three directions, we propose a unified pipeline based on the Transformer architecture to examine whether it is possible to train a (true) unified pipeline on raw multimodal data without supervision in an end-to-end fashion. This pipeline eliminates ad-hoc feature extraction methods and is independent of any previously trained network, making it simpler and easier to use. Furthermore, since it only utilizes one architecture, which enables us to move towards even more simplicity. Hence, we take an ambitious step forward and further unify this pipeline by sharing only one backbone among four major modalities: image, video, audio, and text. We show that it is not only possible to achieve this goal, but we further show the inherent benefits of such pipeline. We propose a new research direction under multimodal understanding and that is Unified Multimodal Understanding. This study is the first that examines this idea and further pushes its limit by scaling up to multiple tasks, modalities, and datasets.
In a nutshell, we examine different possibilities for bridging between a pair of modalities in different applications and observe several limitations and propose solutions for them. Using these observations, we provide a unified and strong pipeline for learning a common space which could be used for many applications. We show that our approaches perform desirably and significantly outperform state-of-the-art in different downstream tasks. We set a new baseline with competitive performance for our proposed research direction, Unified Multimodal Understanding
Multi-modal Transformer for Video Retrieval
International audienceThe task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT
- …