905 research outputs found
Detecting and Naming Actors in Movies using Generative Appearance Models
International audienceWe introduce a generative model for learning person and costume specific detectors from labeled examples. We demonstrate the model on the task of localizing and naming actors in long video sequences. More specifically, the actor's head and shoulders are each represented as a constellation of optional color regions. Detection can proceed despite changes in view-point and partial occlusions. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We present results on a challenging movie example, with 81% recall in actor detection (coverage) and 89% precision in actor identification (naming).Nous présentons un modèle génératif pemettant l'apprentissage de détecteurs de personnes et de costumes à partir d'exemples. Nous appliquons notre modèle au problème de la détection, de la localisation et de l'identification d'acteurs dans de longues séquences vidéo. Nous représentons la tete et les épaules de chaque acteur comme une constellation de régions de couleurs. Toutes les régions sont facultatives, ce qui nous permet de rendre la méthode robuste aux changements de points de vues et aux occultations partielles. Nous décrivons comment le modèle peut être appris à partir d'un petit nombre d'exemple, et décrivons un algorithm rapide de détection. Notre méthode permet de detecter et reconnaitre les 8 acteurs du film "La corde" d'Alfred Hitchcock dans 81 % des cas, avec une précision de 89 %
A Computational Framework for Vertical Video Editing
International audienceVertical video editing is the process of digitally editing the image within the frame as opposed to horizontal video editing, which arranges the shots along a timeline. Vertical editing can be a time-consuming and error-prone process when using manual key-framing and simple interpolation. In this paper, we present a general framework for automatically computing a variety of cinematically plausible shots from a single input video suitable to the special case of live performances. Drawing on working practices in traditional cinematography, the system acts as a virtual camera assistant to the film editor, who can call novel shots in the edit room with a combination of high-level instructions and manually selected keyframes
Taking the bite out of automated naming of characters in TV video
We investigate the problem of automatically labelling appearances of characters in TV or film material
with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying
when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ‘‘Buffy the Vampire Slayer”
Hiding in Plain Sight: A Longitudinal Study of Combosquatting Abuse
Domain squatting is a common adversarial practice where attackers register
domain names that are purposefully similar to popular domains. In this work, we
study a specific type of domain squatting called "combosquatting," in which
attackers register domains that combine a popular trademark with one or more
phrases (e.g., betterfacebook[.]com, youtube-live[.]com). We perform the first
large-scale, empirical study of combosquatting by analyzing more than 468
billion DNS records---collected from passive and active DNS data sources over
almost six years. We find that almost 60% of abusive combosquatting domains
live for more than 1,000 days, and even worse, we observe increased activity
associated with combosquatting year over year. Moreover, we show that
combosquatting is used to perform a spectrum of different types of abuse
including phishing, social engineering, affiliate abuse, trademark abuse, and
even advanced persistent threats. Our results suggest that combosquatting is a
real problem that requires increased scrutiny by the security community.Comment: ACM CCS 1
Person Recognition in Personal Photo Collections
Recognising persons in everyday photos presents major challenges (occluded
faces, different clothing, locations, etc.) for machine vision. We propose a
convnet based person recognition system on which we provide an in-depth
analysis of informativeness of different body cues, impact of training data,
and the common failure modes of the system. In addition, we discuss the
limitations of existing benchmarks and propose more challenging ones. Our
method is simple and is built on open source and open data, yet it improves the
state of the art results on a large dataset of social media photos (PIPA).Comment: Accepted to ICCV 2015, revise
Finding Actors and Actions in Movies
International audienceWe address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty
Detecting and Grounding Important Characters in Visual Stories
Characters are essential to the plot of any story. Establishing the
characters before writing a story can improve the clarity of the plot and the
overall flow of the narrative. However, previous work on visual storytelling
tends to focus on detecting objects in images and discovering relationships
between them. In this approach, characters are not distinguished from other
objects when they are fed into the generation pipeline. The result is a
coherent sequence of events rather than a character-centric story. In order to
address this limitation, we introduce the VIST-Character dataset, which
provides rich character-centric annotations, including visual and textual
co-reference chains and importance ratings for characters. Based on this
dataset, we propose two new tasks: important character detection and character
grounding in visual stories. For both tasks, we develop simple, unsupervised
models based on distributional similarity and pre-trained vision-and-language
models. Our new dataset, together with these models, can serve as the
foundation for subsequent work on analysing and generating stories from a
character-centric perspective.Comment: AAAI 202
Detecting People Looking at Each Other in Videos
The objective of this work is to determine if people are interacting in TV video by detecting whether they are looking at each other or not. We determine both the temporal period of the interaction and also spatially localize the relevant people. We make the following four contributions: (i) head detection with implicit coarse pose information (front, profile, back); (ii) continuous head pose estimation in unconstrained scenarios (TV video) using Gaussian process regression; (iii) propose and evaluate several methods for assessing whether and when pairs of people are looking at each other in a video shot; and (iv) introduce new ground truth annotation for this task, extending the TV human interactions dataset (Patron-Perez et al. 2010) The performance of the methods is evaluated on this dataset, which consists of 300 video clips extracted from TV shows. Despite the variety and difficulty of this video material, our best method obtains an average precision of 87.6 % in a fully automatic manner
- …