194,573 research outputs found

    Prompting Visual-Language Models for Dynamic Facial Expression Recognition

    Get PDF
    This paper presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising – those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks

    TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

    Get PDF
    In this paper we propose a novel Temporal Attentive Relation Network (TARN) for the problems of few-shot and zero-shot action recognition. At the heart of our network is a meta-learning approach that learns to compare representations of variable temporal length, that is, either two videos of different length (in the case of few-shot action recognition) or a video and a semantic representation such as word vector (in the case of zero-shot action recognition). By contrast to other works in few-shot and zero-shot action recognition, we a) utilise attention mechanisms so as to perform temporal alignment, and b) learn a deep-distance measure on the aligned representations at video segment level. We adopt an episode-based training scheme and train our network in an end-to-end manner. The proposed method does not require any fine-tuning in the target domain or maintaining additional representations as is the case of memory networks. Experimental results show that the proposed architecture outperforms the state of the art in few-shot action recognition, and achieves competitive results in zero-shot action recognition

    Gesture-based Object Recognition using Histograms of Guiding Strokes

    Get PDF
    Sadeghipour A, Morency L-P, Kopp S. Gesture-based Object Recognition using Histograms of Guiding Strokes. In: Bowden R, Collomosse J, Mikolajczyk K, eds. Proceedings of the British Machine Vision Conference. BMVA Press; 2012: 44.1-44.11

    Learning Grimaces by Watching TV

    Full text link
    Differently from computer vision systems which require explicit supervision, humans can learn facial expressions by observing people in their environment. In this paper, we look at how similar capabilities could be developed in machine vision. As a starting point, we consider the problem of relating facial expressions to objectively measurable events occurring in videos. In particular, we consider a gameshow in which contestants play to win significant sums of money. We extract events affecting the game and corresponding facial expressions objectively and automatically from the videos, obtaining large quantities of labelled data for our study. We also develop, using benchmarks such as FER and SFEW 2.0, state-of-the-art deep neural networks for facial expression recognition, showing that pre-training on face verification data can be highly beneficial for this task. Then, we extend these models to use facial expressions to predict events in videos and learn nameable expressions from them. The dataset and emotion recognition models are available at http://www.robots.ox.ac.uk/~vgg/data/facevalueComment: British Machine Vision Conference (BMVC) 201

    Using illumination estimated from silhouettes to carve surface details on visual hull

    Get PDF
    This paper deals with the problems of scene illumination estimation and shape recovery from an image sequence of a smooth textureless object. A novel method that exploits the surface points estimated from the silhouettes for recovering the scene illumination is introduced. Those surface points are acquired by a dual space approach and filtered according to their rank errors. Selected surface points allow a direct closed-form solution of illumination. In the mesh evolution step, an algorithm for optimizing the visual hull mesh is developed. It evolves the mesh by iteratively estimating both the surface normal and depth that maximize the photometric consistency across the sequence. Compared with previous work which optimizes the mesh by estimating the surface normal only, the proposed method shows better convergence and can recover better surface details, especially when concavities are deep and sharp.postprintThe British Machine Vision Conference (BMVC) 2008, Leeds, U.K., 1-4 September 2008. In Proceedings of the British Machine Vision Conference, 2008, v. 2, p. 895-90

    One-Shot Learning for Semantic Segmentation

    Full text link
    Low-shot learning methods for image classification support learning from sparse data. We extend these techniques to support dense semantic image segmentation. Specifically, we train a network that, given a small set of annotated images, produces parameters for a Fully Convolutional Network (FCN). We use this FCN to perform dense pixel-level prediction on a test image for the new semantic class. Our architecture shows a 25% relative meanIoU improvement compared to the best baseline methods for one-shot segmentation on unseen classes in the PASCAL VOC 2012 dataset and is at least 3 times faster.Comment: To appear in the proceedings of the British Machine Vision Conference (BMVC) 2017. The code is available at https://github.com/lzzcd001/OSLS
    corecore