177 research outputs found
Measuring scene detection performance
In this paper we evaluate the performance of scene detection techniques, starting from the classic precision/recall approach, moving to the better designed coverage/overflow measures, and finally proposing an improved metric, in order to solve frequently observed cases in which the numeric interpretation is different from the expected results. Numerical evaluation is performed on two recent proposals for automatic scene detection, and comparing them with a simple but effective novel approach. Experimental results are conducted to show how different measures may lead to different interpretations
Shot and Scene Detection via Hierarchical Clustering for Re-using Broadcast Video
Video decomposition techniques are fundamental tools for allowing effective video browsing and re-using. In this work, we consider the problem of segmenting broadcast videos into coherent scenes, and propose a scene detection algorithm based on hierarchical clustering, along with a very fast state-of-the-art shot segmentation approach. Experiments are performed to demonstrate the effectiveness of our algorithms, by comparing against recent proposals for automatic shot and scene segmentation
A Video Library System Using Scene Detection and Automatic Tagging
We present a novel video browsing and retrieval system for edited videos, in which videos are automatically decomposed into meaningful and storytelling parts (i.e. scenes) and tagged according to their transcript. The system relies on a Triplet Deep Neural Network which exploits multimodal features, and has been implemented as a set of extensions to the eXo Platform Enterprise Content Management System (ECMS). This set of extensions enable the interactive visualization of a video, its automatic and semi-automatic annotation, as well as a keyword-based search inside the video collection. The platform also allows a natural integration with third-party add-ons, so that automatic annotations can be exploited outside the proposed platform
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell
SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability
The ability to generate natural language explanations conditioned on the
visual perception is a crucial step towards autonomous agents which can explain
themselves and communicate with humans. While the research efforts in image and
video captioning are giving promising results, this is often done at the
expense of the computational requirements of the approaches, limiting their
applicability to real contexts. In this paper, we propose a fully-attentive
captioning algorithm which can provide state-of-the-art performances on
language generation while restricting its computational demands. Our model is
inspired by the Transformer model and employs only two Transformer layers in
the encoding and decoding stages. Further, it incorporates a novel memory-aware
encoding of image regions. Experiments demonstrate that our approach achieves
competitive results in terms of caption quality while featuring reduced
computational demands. Further, to evaluate its applicability on autonomous
agents, we conduct experiments on simulated scenes taken from the perspective
of domestic robots.Comment: ICRA 202
Hierarchical Boundary-Aware Neural Encoder for Video Captioning
The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose a novel LSTM cell, which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly. We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus. Experiments show that our approach can discover appropriate hierarchical representations of input videos and improve the state of the art results on movie description datasets
- …