60 research outputs found
PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation
Highlight detection models are typically trained to identify cues that make
visual content appealing or interesting for the general public, with the
objective of reducing a video to such moments. However, the "interestingness"
of a video segment or image is subjective. Thus, such highlight models provide
results of limited relevance for the individual user. On the other hand,
training one model per user is inefficient and requires large amounts of
personal information which is typically not available. To overcome these
limitations, we present a global ranking model which conditions on each
particular user's interests. Rather than training one model per user, our model
is personalized via its inputs, which allows it to effectively adapt its
predictions, given only a few user-specific examples. To train this model, we
create a large-scale dataset of users and the GIFs they created, giving us an
accurate indication of their interests. Our experiments show that using the
user history substantially improves the prediction accuracy. On our test set of
850 videos, our model improves the recall by 8% with respect to generic
highlight detectors. Furthermore, our method proves more precise than the
user-agnostic baselines even with just one person-specific example.Comment: Accepted for publication at the 2018 ACM Multimedia Conference (MM
'18
Active video summarization: Customized summaries via on-line interaction with the user
To facilitate the browsing of long videos, automatic video summarization provides an excerpt that represents its content. In the case of egocentric and consumer videos, due to their personal nature, adapting the summary to specific user's preferences is desirable. Current approaches to customizable video summarization obtain the user's preferences prior to the summarization process. As a result, the user needs to manually modify the summary to further meet the preferences. In this paper, we introduce Active Video Summarization (AVS), an interactive approach to gather the user's preferences while creating the summary. AVS asks questions about the summary to update it on-line until the user is satisfied. To minimize the interaction, the best segment to inquire next is inferred from the previous feedback. We evaluate AVS in the commonly used UTEgo dataset. We also introduce a new dataset for customized video summarization (CSumm) recorded with a Google Glass. The results show that AVS achieves an excellent compromise between usability and quality. In 41% of the videos, AVS is considered the best over all tested baselines, including summaries manually generated. Also, when looking for specific events in the video, AVS provides an average level of satisfaction higher than those of all other baselines after only six questions to the user
Video Storytelling: Textual Summaries for Events
Bridging vision and natural language is a longstanding goal in computer
vision and multimedia research. While earlier works focus on generating a
single-sentence description for visual content, recent works have studied
paragraph generation. In this work, we introduce the problem of video
storytelling, which aims at generating coherent and succinct stories for long
videos. Video storytelling introduces new challenges, mainly due to the
diversity of the story and the length and complexity of the video. We propose
novel methods to address the challenges. First, we propose a context-aware
framework for multimodal embedding learning, where we design a Residual
Bidirectional Recurrent Neural Network to leverage contextual information from
past and future. Second, we propose a Narrator model to discover the underlying
storyline. The Narrator is formulated as a reinforcement learning agent which
is trained by directly optimizing the textual metric of the generated story. We
evaluate our method on the Video Story dataset, a new dataset that we have
collected to enable the study. We compare our method with multiple
state-of-the-art baselines, and show that our method achieves better
performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
Egocentric video summarisation via purpose-orientedframe scoring and selection
Existing video summarisation techniques are quite generic in nature, since they generally overlook the important aspect of what actual purpose the summary will be serving. In sharp contrast with this mainstream work, it can be acknowledged that there are many possible purposes the same videos can be summarised for. Accordingly, we consider a novel perspective: summaries with a purpose. This work is an attempt to both, call the attention on this neglected aspect of video summarisation research, and to illustrate it and explore it with two concrete purposes, focusing on first-person-view videos. The proposed purpose-oriented summarisation techniques are framed under the common (frame-level) scoring and selection paradigm, and have been tested on two egocentric datasets, BEOID and EGTEA-Gaze+. The necessary purpose-specific evaluation metrics are also introduced.
The proposed approach is compared with two purpose-agnostic summarisation baselines. On the one hand, a partially agnostic method uses the scores obtained by the proposed approach, but follows a standard generic frame selection technique. On the other hand, the fully agnostic method do not use any purpose-based information, and relies on generic concepts such as diversity and representativeness. The results of the experimental work show that the proposed approaches compare favourably with respect to both baselines. More specifically, the purpose-specific approach generally produces summaries with the best compromise between summary lengths and favourable purpose-specific metrics. Interestingly, it is also observed that results of the partially-agnostic baseline tend to be better than those of the fully-agnostic one. These observations provide strong evidence on the advantage and relevance of purpose-specific summarisation techniques and evaluation metrics, and encourage further work on this important subject.Funding for open access charge: CRUE-Universitat Jaume
Eyewear Computing \u2013 Augmenting the Human with Head-Mounted Wearable Assistants
The seminar was composed of workshops and tutorials on head-mounted eye tracking, egocentric
vision, optics, and head-mounted displays. The seminar welcomed 30 academic and industry
researchers from Europe, the US, and Asia with a diverse background, including wearable and
ubiquitous computing, computer vision, developmental psychology, optics, and human-computer
interaction. In contrast to several previous Dagstuhl seminars, we used an ignite talk format to
reduce the time of talks to one half-day and to leave the rest of the week for hands-on sessions,
group work, general discussions, and socialising. The key results of this seminar are 1) the
identification of key research challenges and summaries of breakout groups on multimodal eyewear
computing, egocentric vision, security and privacy issues, skill augmentation and task guidance,
eyewear computing for gaming, as well as prototyping of VR applications, 2) a list of datasets and
research tools for eyewear computing, 3) three small-scale datasets recorded during the seminar, 4)
an article in ACM Interactions entitled \u201cEyewear Computers for Human-Computer Interaction\u201d,
as well as 5) two follow-up workshops on \u201cEgocentric Perception, Interaction, and Computing\u201d
at the European Conference on Computer Vision (ECCV) as well as \u201cEyewear Computing\u201d at
the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)
Predicting visual context for unsupervised event segmentation in continuous photo-streams
Segmenting video content into events provides semantic structures for
indexing, retrieval, and summarization. Since motion cues are not available in
continuous photo-streams, and annotations in lifelogging are scarce and costly,
the frames are usually clustered into events by comparing the visual features
between them in an unsupervised way. However, such methodologies are
ineffective to deal with heterogeneous events, e.g. taking a walk, and
temporary changes in the sight direction, e.g. at a meeting. To address these
limitations, we propose Contextual Event Segmentation (CES), a novel
segmentation paradigm that uses an LSTM-based generative network to model the
photo-stream sequences, predict their visual context, and track their
evolution. CES decides whether a frame is an event boundary by comparing the
visual context generated from the frames in the past, to the visual context
predicted from the future. We implemented CES on a new and massive lifelogging
dataset consisting of more than 1.5 million images spanning over 1,723 days.
Experiments on the popular EDUB-Seg dataset show that our model outperforms the
state-of-the-art by over 16% in f-measure. Furthermore, CES' performance is
only 3 points below that of human annotators.Comment: Accepted for publication at the 2018 ACM Multimedia Conference (MM
'18
- …