4 research outputs found

    Multi-modal Video Content Understanding

    Get PDF
    Video is an important format of information. Humans use videos for a variety of purposes such as entertainment, education, communication, information sharing, and capturing memories. To this date, humankind accumulated a colossal amount of video material online which is freely available. Manual processing at this scale is simply impossible. To this end, many research efforts have been dedicated to the automatic processing of video content. At the same time, human perception of the world is multi-modal. A human uses multiple senses to understand the environment and objects, and their interactions. When watching a video, we perceive the content via both audio and visual modalities, and removing one of these modalities results in less immersive experience. Similarly, if information in both modalities does not correspond, it may create a sense of dissonance. Therefore, joint modelling of multiple modalities (such as audio, visual, and text) within one model is an active research area. In the last decade, the fields of automatic video understanding and multi-modal modelling have seen exceptional progress due to the ubiquitous success of deep learning models and, more recently, transformer-based architectures in particular. Our work draws on these advances and pushes the state-of-the-art of multi-modal video understanding forward. Applications of automatic multi-modal video processing are broad and exciting! For instance, the content-based textual description of a video (video captioning) may allow a visually- or auditory-impaired person to understand the content and, thus, engage in brighter social interactions. However, prior work in video content description relies on the visual input alone, missing vital information only available in the audio stream. To this end, we proposed two novel multi-modal transformer models that encode audio and visual interactions simultaneously. More specifically, first, we introduced a late-fusion multi-modal transformer that is highly modular and allows the processing of an arbitrary set of modalities. Second, an efficient bi-modal transformer was presented to encode audio-visual cues starting from the lower network layers allowing more rich audio-visual features and stronger performance as a result. Another application is the automatic visually-guided sound generation that might help professional sound (foley) designers who spend hours searching a database for relevant audio for a movie scene. Previous approaches for automatic conditional audio generation support only one class (e. g. “dog barking”), while real-life applications may require generation for hundreds of data classes and one would need to train one model for every data class which can be infeasible. To bridge this gap, we introduced a novel two-stage model that, first, efficiently encodes audio as a set of codebook vectors (i. e. trains to make “building blocks”) and, then, learns to sample these audio vectors given visual inputs to make a relevant audio track for this visual input. Moreover, we studied the automatic evaluation of the conditional audio generation model and proposed metrics that measure both quality and relevance of the generated samples. Finally, as video editing is becoming more common among non-professionals due to the increased popularity of such services as YouTube, automatic assistance during video editing grows in demand, e. g. off-sync detection between audio and visual tracks. Prior work in audio-visual synchronization was devoted to solving the task on lip-syncing datasets with “dense” signals, such as interviews and presentations. In such videos, synchronization cues occur “densely” across time, and it is enough to process just a few tens of a second to synchronize the tracks. In contrast, opendomain videos mostly have only “sparse” cues that occur just once in a seconds-long video clip (e. g. “chopping wood”). To address this, we: a) proposed a novel dataset with “sparse” sounds; b) designed a model which can efficiently encode seconds-long audio-visual tracks in a small set of “learnable selectors” that is, then, used for synchronization. In addition, we explored the temporal artefacts that common audio and video compression algorithms leave in data streams. To prevent a model from learning to rely on these artefacts, we introduced a list of recommendations on how to mitigate them. This thesis provides the details of the proposed methodologies as well as a comprehensive overview of advances in relevant fields of multi-modal video understanding. In addition, we provide a discussion of potential research directions that can bring significant contributions to the field

    Multimodal and Embodied Learning with Language as the Anchor

    Get PDF
    Since most worldly phenomena can be expressed via language, language is a crucial medium for transferring information and integrating multiple information sources. For example, humans can describe what they see, hear and feel, and also explain how they move with words. Conversely, humans can imagine scenes, sounds, and feelings, and move their body from language descriptions. Therefore, language plays an important role in solving machine learning (ML) and artificial intelligence (AI) problems with multimodal input sources. This thesis studies how different modalities can be integrated with language in multimodal learning settings as follows. First, we explore the possibility to integrate external information from the textual description about an image into a visual question answering system which integrates the key words/phrases in paragraph captions in semi-symbolic form, to make the alignment between features easier. We expand the direction to a video question answering task. We employ dense captions, which generate object-level descriptions of an image, to help localize the key frames in a video clip for answering a question. Next, we build benchmarks to evaluate embodied agents to perform tasks according to natural language instruction from humans. We introduce a new instruction-following navigation and object assembly system, called ArraMon in which agents follow the natural language instructions to collect an object and put it in a target location, requiring agents to deeply understand referring expressions and the concept of direction from the egocentric perspective. We also suggest a new task setup for the useful Cooperative Vision-and-Dialog Navigation (CVDN) dataset. We analyze scoring behaviors of models and find issues from the existing Navigation from Dialog History (NDH) task and propose a more realistic and challenging task setup, called NDH-Full which better appreciates the purpose of the CVDN dataset. Finally, we explore AI assistant systems which help humans with different tasks. We introduce a new correctional captioning dataset on human body pose, called FixMyPose, to encourage the ML/AI community to build such guidance systems that require models to learn to distinguish different levels of pose difference to describe desirable pose change. Also, we introduce a new conversational image search and editing assistant system, called CAISE, in which an agent helps a user to search images and edit them by holding a conversation.Doctor of Philosoph
    corecore