605 research outputs found
Sentence Specified Dynamic Video Thumbnail Generation
With the tremendous growth of videos over the Internet, video thumbnails,
providing video content previews, are becoming increasingly crucial to
influencing users' online searching experiences. Conventional video thumbnails
are generated once purely based on the visual characteristics of videos, and
then displayed as requested. Hence, such video thumbnails, without considering
the users' searching intentions, cannot provide a meaningful snapshot of the
video contents that users concern. In this paper, we define a distinctively new
task, namely sentence specified dynamic video thumbnail generation, where the
generated thumbnails not only provide a concise preview of the original video
contents but also dynamically relate to the users' searching intentions with
semantic correspondences to the users' query sentences. To tackle such a
challenging task, we propose a novel graph convolved video thumbnail pointer
(GTP). Specifically, GTP leverages a sentence specified video graph
convolutional network to model both the sentence-video semantic interaction and
the internal video relationships incorporated with the sentence information,
based on which a temporal conditioned pointer network is then introduced to
sequentially generate the sentence specified video thumbnails. Moreover, we
annotate a new dataset based on ActivityNet Captions for the proposed new task,
which consists of 10,000+ video-sentence pairs with each accompanied by an
annotated sentence specified video thumbnail. We demonstrate that our proposed
GTP outperforms several baseline methods on the created dataset, and thus
believe that our initial results along with the release of the new dataset will
inspire further research on sentence specified dynamic video thumbnail
generation. Dataset and code are available at https://github.com/yytzsy/GTP
Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks
In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: 1) semantic feature extraction which builds video specific concept detectors; 2) multimodal feature embedding learning, that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested
CulturAI: Semantic Enrichment of Cultural Data Leveraging Artificial Intelligence
In this paper, we propose an innovative tool able to enrich cultural and creative spots (gems, hereinafter) extracted from the European Commission Cultural Gems portal, by suggesting relevant keywords (tags) and YouTube videos (represented with proper thumbnails). On the one hand, the system queries the YouTube search portal, selects the videos most related to the given gem, and extracts a set of meaningful thumbnails for each video. On the other hand, each tag is selected by identifying semantically related popular search queries (i.e., trends). In particular, trends are retrieved by querying the Google Trends platform. A further novelty is that our system suggests contents in a dynamic way. Indeed, as for both YouTube and Google Trends platforms the results of a given query include the most popular videos/trends, such that a gem may constantly be updated with trendy content by periodically running the tool. The system has been tested on a set of gems and evaluated with the support of human annotators. The results highlighted the effectiveness of our proposal
Automatic Image Cropping and Selection using Saliency: an Application to Historical Manuscripts
Automatic image cropping techniques are particularly important to improve the visual quality of cropped images and can be applied to a wide range of applications such as photo-editing, image compression, and thumbnail selection. In this paper, we propose a saliency-based image cropping method which produces significant cropped images by only relying on the corresponding saliency maps. Experiments on standard image cropping datasets demonstrate the benefit of the proposed solution with respect to other cropping methods. Moreover, we present an image selection method that can be effectively applied to automatically select the most representative pages of historical manuscripts thus improving the navigation of historical digital libraries
From Thumbnails to Summaries - A single Deep Neural Network to Rule Them All
Video summaries come in many forms, from traditional single-image thumbnails,
animated thumbnails, storyboards, to trailer-like video summaries. Content
creators use the summaries to display the most attractive portion of their
videos; the users use them to quickly evaluate if a video is worth watching.
All forms of summaries are essential to video viewers, content creators, and
advertisers. Often video content management systems have to generate multiple
versions of summaries that vary in duration and presentational forms. We
present a framework ReconstSum that utilizes LSTM-based autoencoder
architecture to extract and select a sparse subset of video frames or keyshots
that optimally represent the input video in an unsupervised manner. The encoder
selects a subset from the input video while the decoder seeks to reconstruct
the video from the selection. The goal is to minimize the difference between
the original input video and the reconstructed video. Our method is easily
extendable to generate a variety of applications including static video
thumbnails, animated thumbnails, storyboards and "trailer-like" highlights. We
specifically study and evaluate two most popular use cases: thumbnail
generation and storyboard generation. We demonstrate that our methods generate
better results than the state-of-the-art techniques in both use cases.Comment: 6 pages, 2 figures, IEEE International Conference on Multimedia and
Expo (ICME) 201
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
- …