28,481 research outputs found
Taking the bite out of automated naming of characters in TV video
We investigate the problem of automatically labelling appearances of characters in TV or film material
with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying
when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ‘‘Buffy the Vampire Slayer”
A data augmentation methodology for training machine/deep learning gait recognition algorithms
There are several confounding factors that can reduce the accuracy of gait recognition systems. These factors can reduce the distinctiveness, or alter the features used to characterise gait; they include variations in clothing, lighting, pose and environment, such as the walking surface. Full invariance to all confounding factors is challenging in the absence of high-quality labelled training data. We introduce a simulation-based methodology and a subject-specific dataset which can be used for generating synthetic video frames and sequences for data augmentation. With this methodology, we generated a multi-modal dataset. In addition, we supply simulation files that provide the ability to simultaneously sample from several confounding variables. The basis of the data is real motion capture data of subjects walking and running on a treadmill at different speeds. Results from gait recognition experiments suggest that information about the identity of subjects is retained within synthetically generated examples. The dataset and methodology allow studies into fully-invariant identity recognition spanning a far greater number of observation conditions than would otherwise be possible
Predicting Motivations of Actions by Leveraging Text
Understanding human actions is a key problem in computer vision. However,
recognizing actions is only the first step of understanding what a person is
doing. In this paper, we introduce the problem of predicting why a person has
performed an action in images. This problem has many applications in human
activity understanding, such as anticipating or explaining an action. To study
this problem, we introduce a new dataset of people performing actions annotated
with likely motivations. However, the information in an image alone may not be
sufficient to automatically solve this task. Since humans can rely on their
lifetime of experiences to infer motivation, we propose to give computer vision
systems access to some of these experiences by using recently developed natural
language models to mine knowledge stored in massive amounts of text. While we
are still far away from fully understanding motivation, our results suggest
that transferring knowledge from language into vision can help machines
understand why people in images might be performing an action.Comment: CVPR 201
BodyNet: Volumetric Inference of 3D Human Body Shapes
Human shape estimation is an important task for video editing, animation and
fashion industry. Predicting 3D human body shape from natural images, however,
is highly challenging due to factors such as variation in human bodies,
clothing and viewpoint. Prior methods addressing this problem typically attempt
to fit parametric body models with certain priors on pose and shape. In this
work we argue for an alternative representation and propose BodyNet, a neural
network for direct inference of volumetric body shape from a single image.
BodyNet is an end-to-end trainable network that benefits from (i) a volumetric
3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate
supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them
results in performance improvement as demonstrated by our experiments. To
evaluate the method, we fit the SMPL model to our network output and show
state-of-the-art results on the SURREAL and Unite the People datasets,
outperforming recent approaches. Besides achieving state-of-the-art
performance, our method also enables volumetric body-part segmentation.Comment: Appears in: European Conference on Computer Vision 2018 (ECCV 2018).
27 page
- …