3 research outputs found
A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video
We address the problem of highlight detection from a 360 degree video by
summarizing it both spatially and temporally. Given a long 360 degree video, we
spatially select pleasantly-looking normal field-of-view (NFOV) segments from
unlimited field of views (FOV) of the 360 degree video, and temporally
summarize it into a concise and informative highlight as a selected subset of
subshots. We propose a novel deep ranking model named as Composition View Score
(CVS) model, which produces a spherical score map of composition per video
segment, and determines which view is suitable for highlight via a sliding
window kernel at inference. To evaluate the proposed framework, we perform
experiments on the Pano2Vid benchmark dataset and our newly collected 360
degree video highlight dataset from YouTube and Vimeo. Through evaluation using
both quantitative summarization metrics and user studies via Amazon Mechanical
Turk, we demonstrate that our approach outperforms several state-of-the-art
highlight detection methods. We also show that our model is 16 times faster at
inference than AutoCam, which is one of the first summarization algorithms of
360 degree videosComment: In AAAI 2018, 9 page
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
Sound can convey significant information for spatial reasoning in our daily
lives. To endow deep networks with such ability, we address the challenge of
dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge
distillation. In this work, we propose a Spatial Alignment via Matching (SAM)
distillation framework that elicits local correspondence between the two
modalities in vision-to-audio knowledge transfer. SAM integrates audio features
with visually coherent learnable spatial embeddings to resolve inconsistencies
in multiple layers of a student model. Our approach does not rely on a specific
input representation, allowing for flexibility in the input shapes or
dimensions without performance degradation. With a newly curated benchmark
named Dense Auditory Prediction of Surroundings (DAPS), we are the first to
tackle dense indoor prediction of omnidirectional surroundings in both 2D and
3D with audio observations. Specifically, for audio-based depth estimation,
semantic segmentation, and challenging 3D scene reconstruction, the proposed
distillation framework consistently achieves state-of-the-art performance
across various metrics and backbone architectures.Comment: Published to ICCV202
Multi-task self-supervised object detection via recycling of bounding box annotations
© 2019 IEEE.In spite of recent enormous success of deep convolutional networks in object detection, they require a large amount of bounding box annotations, which are often time-consuming and error-prone to obtain. To make better use of given limited labels, we propose a novel object detection approach that takes advantage of both multi-task learning (MTL) and self-supervised learning (SSL). We propose a set of auxiliary tasks that help improve the accuracy of object detection. They create their own labels by recycling the bounding box labels (i.e. annotations of the main task) in an SSL manner, and are jointly trained with the object detection model in an MTL way. Our approach is integrable with any region proposal based detection models. We empirically validate that our approach effectively improves detection performance on various architectures and datasets. We test two state-of-the-art region proposal object detectors, including Faster R-CNN and R-FCN, with three CNN backbones of ResNet-101, Inception-ResNet-v2, and MobileNet on two benchmark datasets of PASCAL VOC and COCO.N