471 research outputs found
Spott : on-the-spot e-commerce for television using deep learning-based video analysis techniques
Spott is an innovative second screen mobile multimedia application which offers viewers relevant information on objects (e.g., clothing, furniture, food) they see and like on their television screens. The application enables interaction between TV audiences and brands, so producers and advertisers can offer potential consumers tailored promotions, e-shop items, and/or free samples. In line with the current views on innovation management, the technological excellence of the Spott application is coupled with iterative user involvement throughout the entire development process. This article discusses both of these aspects and how they impact each other. First, we focus on the technological building blocks that facilitate the (semi-) automatic interactive tagging process of objects in the video streams. The majority of these building blocks extensively make use of novel and state-of-the-art deep learning concepts and methodologies. We show how these deep learning based video analysis techniques facilitate video summarization, semantic keyframe clustering, and (similar) object retrieval. Secondly, we provide insights in user tests that have been performed to evaluate and optimize the application's user experience. The lessons learned from these open field tests have already been an essential input in the technology development and will further shape the future modifications to the Spott application
Hierarchical Photo-Scene Encoder for Album Storytelling
In this paper, we propose a novel model with a hierarchical photo-scene
encoder and a reconstructor for the task of album storytelling. The photo-scene
encoder contains two sub-encoders, namely the photo and scene encoders, which
are stacked together and behave hierarchically to fully exploit the structure
information of the photos within an album. Specifically, the photo encoder
generates semantic representation for each photo while exploiting temporal
relationships among them. The scene encoder, relying on the obtained photo
representations, is responsible for detecting the scene changes and generating
scene representations. Subsequently, the decoder dynamically and attentively
summarizes the encoded photo and scene representations to generate a sequence
of album representations, based on which a story consisting of multiple
coherent sentences is generated. In order to fully extract the useful semantic
information from an album, a reconstructor is employed to reproduce the
summarized album representations based on the hidden states of the decoder. The
proposed model can be trained in an end-to-end manner, which results in an
improved performance over the state-of-the-arts on the public visual
storytelling (VIST) dataset. Ablation studies further demonstrate the
effectiveness of the proposed hierarchical photo-scene encoder and
reconstructor.Comment: 8 pages, 4 figure
Unsupervised Video Summarization via Attention-Driven Adversarial Learning
This paper presents a new video summarization approach that integrates an attention mechanism to identify the signi cant parts of the video, and is trained unsupervisingly via generative adversarial learning. Starting from the SUM-GAN model, we rst develop an improved version of it (called SUM-GAN-sl) that has a signi cantly reduced number of learned parameters, performs incremental training of the model's components, and applies a stepwise label-based strategy for updating the adversarial part. Subsequently, we introduce an attention mechanism to SUM-GAN-sl in two ways: i) by integrating an attention layer within the variational auto-encoder (VAE) of the architecture (SUM-GAN-VAAE), and ii) by replacing the VAE with a deterministic attention auto-encoder (SUM-GAN-AAE). Experimental evaluation on two datasets (SumMe and TVSum) documents the contribution of the attention auto-encoder to faster and more stable training of the model, resulting in a signi cant performance improvement with respect to the original model and demonstrating the competitiveness of the proposed SUM-GAN-AAE against the state of the art
Query Twice: Dual Mixture Attention Meta Learning for Video Summarization
Video summarization aims to select representative frames to retain high-level
information, which is usually solved by predicting the segment-wise importance
score via a softmax function. However, softmax function suffers in retaining
high-rank representations for complex visual or sequential information, which
is known as the Softmax Bottleneck problem. In this paper, we propose a novel
framework named Dual Mixture Attention (DMASum) model with Meta Learning for
video summarization that tackles the softmax bottleneck problem, where the
Mixture of Attention layer (MoA) effectively increases the model capacity by
employing twice self-query attention that can capture the second-order changes
in addition to the initial query-key attention, and a novel Single Frame Meta
Learning rule is then introduced to achieve more generalization to small
datasets with limited training sources. Furthermore, the DMASum significantly
exploits both visual and sequential attention that connects local key-frame and
global attention in an accumulative way. We adopt the new evaluation protocol
on two public datasets, SumMe, and TVSum. Both qualitative and quantitative
experiments manifest significant improvements over the state-of-the-art
methods.Comment: This manuscript has been accepted at ACM MM 202
- …