479 research outputs found

    Learning Visual Attributes

    Get PDF
    We present a probabilistic generative model of visual attributes, together with an efficient learning algorithm. Attributes are visual qualities of objects, such as ‘red’, ‘striped’, or ‘spotted’. The model sees attributes as patterns of image segments, repeatedly sharing some characteristic properties. These can be any combination of appearance, shape, or the layout of segments within the pattern. Moreover, attributes with general appearance are taken into account, such as the pattern of alternation of any two colors which is characteristic for stripes. To enable learning from unsegmented training images, the model is learnt discriminatively, by optimizing a likelihood ratio. As demonstrated in the experimental evaluation, our model can learn in a weakly supervised setting and encompasses a broad range of attributes. We show that attributes can be learnt starting from a text query to Google image search, and can then be used to recognize the attribute and determine its spatial extent in novel real-world images.

    Taking the bite out of automated naming of characters in TV video

    No full text
    We investigate the problem of automatically labelling appearances of characters in TV or film material with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ‘‘Buffy the Vampire Slayer”

    "'Who are you?' - Learning person specific classifiers from video"

    Get PDF
    We investigate the problem of automatically labelling faces of characters in TV or movie material with their names, using only weak supervision from automaticallyaligned subtitle and script text. Our previous work (Everingham et al. [8]) demonstrated promising results on the task, but the coverage of the method (proportion of video labelled) and generalization was limited by a restriction to frontal faces and nearest neighbour classification. In this paper we build on that method, extending the coverage greatly by the detection and recognition of characters in profile views. In addition, we make the following contributions: (i) seamless tracking, integration and recognition of profile and frontal detections, and (ii) a character specific multiple kernel classifier which is able to learn the features best able to discriminate between the characters. We report results on seven episodes of the TV series “Buffy the Vampire Slayer”, demonstrating significantly increased coverage and performance with respect to previous methods on this material

    Self-supervised learning of a facial attribute embedding from video

    Full text link
    We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods.Comment: To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.htm

    Two types of S phase precipitates in Al-Cu-Mg alloys

    Get PDF
    Transmission electron microscopy (TEM) and differential scanning calorimetry (DSC) have been used to study S phase precipitation in an Al-4.2Cu-1.5Mg-0.6Mn-0.5Si (AA2024) and an Al-4.2Cu-1.5Mg-0.6Mn-0.08Si (AA2324) (wt-%) alloy. In DSC experiments on as solution treated samples two distinct exothermic peaks are observed in the range 250 to 350°C, whereas only one peak is observed in solution treated and subsequently stretched or cold worked samples. Samples heated to 270°C and 400°C at a rate of 10°C/min in the DSC have been studied by TEM. The selected area diffraction patterns show that S phase precipitates with the classic orientation relationship form during the lower temperature peak, and for the solution treated samples, the higher temperature peak is caused by the formation of a second type of S phase precipitates which have an orientation relationship that is rotated by ~4 degrees to the classic one. The effects of Si and cold work on the formation of second type of S precipitates have been discussed

    Automatic face recognition for film character retrieval in feature-length films

    Full text link
    The objective of this work is to recognize all the frontal faces of a character in the closed world of a movie or situation comedy, given a small number of query faces. This is challenging because faces in a feature-length film are relatively uncontrolled with a wide variability of scale, pose, illumination, and expressions, and also may be partially occluded. We develop a recognition method based on a cascade of processing steps that normalize for the effects of the changing imaging environment. In particular there are three areas of novelty: (i) we suppress the background surrounding the face, enabling the maximum area of the face to be retained for recognition rather than a subset; (ii) we include a pose refinement step to optimize the registration between the test image and face exemplar; and (iii) we use robust distance to a sub-space to allow for partial occlusion and expression change. The method is applied and evaluated on several feature length films. It is demonstrated that high recall rates (over 92%) can be achieved whilst maintaining good precision (over 93%)

    Helping hands: an object-aware ego-centric video recognition model

    Get PDF
    We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this).We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art—even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions.Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding

    Robust detection of degenerate configurations while estimating the fundamental matrix

    Get PDF
    We present a new method for the detection of multiple solutions or degeneracy when estimating thefundamental matrix, with specific emphasis on robustness to data contamination (mismatches). The fundamental matrix encapsulates all the information on camera motion and internal parameters available from image feature correspondences between two views. It is often used as a first step in structure from motion algorithms. If the set of correspondences is degenerate, then this structure cannot be accurately recovered and many solutions explain the data equally well. It is essential that we are alerted to such eventualities. As current feature matchers are very prone to mismatching the degeneracy detection method must also be robust to outliers. In this paper a definition of degeneracy is given and all two-view nondegenerate and degenerate cases are catalogued in a logical way by introducing the language of varieties from algebraic geometry. It is then shown how each of the cases can be robustly determined from image correspondences via a scoring function we develop. These ideas define a methodology which allows the simultaneous detection of degeneracy and outliers. The method is called PLUNDER-DL and is a generalization of the robust estimator RANSAC. The method is evaluated on many differing pairs of real images. In particular it is demonstrated that proper modeling of degeneracy in the presence of outliers enables the detection of mismatches which would otherwise be missed. All processing including point matching, degeneracy detection, and outlier detection is automatic
    • 

    corecore