39 research outputs found

    Self-supervised learning of a facial attribute embedding from video

    Full text link
    We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm

    Helping hands: an object-aware ego-centric video recognition model

    Get PDF
    We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this).We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art—even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions.Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding

    Using projective invariants for constant time library indexing in model based vision

    Get PDF
    Projectively invariant shape descriptors allow fast indexing into model libraries, because recognition proceeds without reference to object pose. This paper describes progress in building a large model based vision system which uses many projectively invariant descriptors. We give a brief account of these descriptors and then describe the recognition system, giving examples of the invariant techniques working on real images. We demonstrate the ease of model acquisition in our system, where models are generated directly from images. We demonstrate fast recognition without determining object pose or camera parameters

    AutoAD II: The Sequel - who, when, and what in movie audio description

    Get PDF
    Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the `who', `when', and `what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison

    A sound approach: using large language models to generate audio descriptions for egocentric text-audio retrieval

    Get PDF
    Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using Large Language Models (LLMs). In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound

    Reshaping the future of Portuguese azulejo patterns

    Get PDF
    This paper introduces a new approach to the inventory and catalogue of azulejo patterns found in Portuguese buildings. It uses computer-vision based software tools for automatic search and matching of azulejo patterns, thereby improving the scalability and speed of existing cataloguing methodologies. The online catalogue of azulejo patterns is called Az Infinitum (Azulejo Referencing and Indexation System), a publicly accessible online portal suitable for both researchers and the general public who are interested in exploring and understanding this cultural heritage of Portugal. The effectiveness of this catalogue as a research support tool is demonstrated using a case study based on the Marvila pattern (i.e. P-17-00999). The online catalogue has inspired the development of an engaging application, called Azulejar, which allows one to create new patterns or understand the mathematical process behind existing azulejos patterns. This application has a potential to become an effective educational tool for inspiring everyone to explore and understand the science behind the beauty of azulejo patterns. Reshaping the future of Portuguese azulejo patterns
    corecore