2,765 research outputs found
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Self-supervised Learning for Video Correspondence Flow
The objective of this paper is self-supervised learning of feature embeddings
that are suitable for matching correspondences along the videos, which we term
correspondence flow. By leveraging the natural spatial-temporal coherence in
videos, we propose to train a ``pointer'' that reconstructs a target frame by
copying pixels from a reference frame.
We make the following contributions: First, we introduce a simple information
bottleneck that forces the model to learn robust features for correspondence
matching, and prevent it from learning trivial solutions, \eg matching based on
low-level colour information. Second, to tackle the challenges from tracker
drifting, due to complex object deformations, illumination changes and
occlusions, we propose to train a recursive model over long temporal windows
with scheduled sampling and cycle consistency. Third, we achieve
state-of-the-art performance on DAVIS 2017 video segmentation and JHMDB
keypoint tracking tasks, outperforming all previous self-supervised learning
approaches by a significant margin. Fourth, in order to shed light on the
potential of self-supervised learning on the task of video correspondence flow,
we probe the upper bound by training on additional data, \ie more diverse
videos, further demonstrating significant improvements on video segmentation.Comment: BMVC2019 (Oral Presentation
Crowded Scene Analysis: A Survey
Automated scene analysis has been a topic of great interest in computer
vision and cognitive science. Recently, with the growth of crowd phenomena in
the real world, crowded scene analysis has attracted much attention. However,
the visual occlusions and ambiguities in crowded scenes, as well as the complex
behaviors and scene semantics, make the analysis a challenging task. In the
past few years, an increasing number of works on crowded scene analysis have
been reported, covering different aspects including crowd motion pattern
learning, crowd behavior and activity analysis, and anomaly detection in
crowds. This paper surveys the state-of-the-art techniques on this topic. We
first provide the background knowledge and the available features related to
crowded scenes. Then, existing models, popular algorithms, evaluation
protocols, as well as system performance are provided corresponding to
different aspects of crowded scene analysis. We also outline the available
datasets for performance evaluation. Finally, some research problems and
promising future directions are presented with discussions.Comment: 20 pages in IEEE Transactions on Circuits and Systems for Video
Technology, 201
Towards Storytelling from Visual Lifelogging: An Overview
Visual lifelogging consists of acquiring images that capture the daily
experiences of the user by wearing a camera over a long period of time. The
pictures taken offer considerable potential for knowledge mining concerning how
people live their lives, hence, they open up new opportunities for many
potential applications in fields including healthcare, security, leisure and
the quantified self. However, automatically building a story from a huge
collection of unstructured egocentric data presents major challenges. This
paper provides a thorough review of advances made so far in egocentric data
analysis, and in view of the current state of the art, indicates new lines of
research to move us towards storytelling from visual lifelogging.Comment: 16 pages, 11 figures, Submitted to IEEE Transactions on Human-Machine
System
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video
We explore object discovery and detector adaptation based on unlabeled video
sequences captured from a mobile platform. We propose a fully automatic
approach for object mining from video which builds upon a generic object
tracking approach. By applying this method to three large video datasets from
autonomous driving and mobile robotics scenarios, we demonstrate its robustness
and generality. Based on the object mining results, we propose a novel approach
for unsupervised object discovery by appearance-based clustering. We show that
this approach successfully discovers interesting objects relevant to driving
scenarios. In addition, we perform self-supervised detector adaptation in order
to improve detection performance on the KITTI dataset for existing categories.
Our approach has direct relevance for enabling large-scale object learning for
autonomous driving.Comment: CVPR'18 submissio
Watch-Bot: Unsupervised Learning for Reminding Humans of Forgotten Actions
We present a robotic system that watches a human using a Kinect v2 RGB-D
sensor, detects what he forgot to do while performing an activity, and if
necessary reminds the person using a laser pointer to point out the related
object. Our simple setup can be easily deployed on any assistive robot.
Our approach is based on a learning algorithm trained in a purely
unsupervised setting, which does not require any human annotations. This makes
our approach scalable and applicable to variant scenarios. Our model learns the
action/object co-occurrence and action temporal relations in the activity, and
uses the learned rich relationships to infer the forgotten action and the
related object. We show that our approach not only improves the unsupervised
action segmentation and action cluster assignment performance, but also
effectively detects the forgotten actions on a challenging human activity RGB-D
video dataset. In robotic experiments, we show that our robot is able to remind
people of forgotten actions successfully
Multigrid Predictive Filter Flow for Unsupervised Learning on Videos
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for
unsupervised learning on videos. The mgPFF takes as input a pair of frames and
outputs per-pixel filters to warp one frame to the other. Compared to optical
flow used for warping frames, mgPFF is more powerful in modeling sub-pixel
movement and dealing with corruption (e.g., motion blur). We develop a
multigrid coarse-to-fine modeling strategy that avoids the requirement of
learning large filters to capture large displacement. This allows us to train
an extremely compact model (4.6MB) which operates in a progressive way over
multiple resolutions with shared weights. We train mgPFF on unsupervised,
free-form videos and show that mgPFF is able to not only estimate long-range
flow for frame reconstruction and detect video shot transitions, but also
readily amendable for video object segmentation and pose tracking, where it
substantially outperforms the published state-of-the-art without bells and
whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we
have the unique opportunity to visualize how each pixel is evolving during
solving these tasks, thus gaining better interpretability.Comment: webpage (https://www.ics.uci.edu/~skong2/mgpff.html
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Existing methods for instance segmentation in videos typi-cally involve
multi-stage pipelines that follow the tracking-by-detectionparadigm and model a
video clip as a sequence of images. Multiple net-works are used to detect
objects in individual frames, and then associatethese detections over time.
Hence, these methods are often non-end-to-end trainable and highly tailored to
specific tasks. In this paper, we pro-pose a different approach that is
well-suited to a variety of tasks involvinginstance segmentation in videos. In
particular, we model a video clip asa single 3D spatio-temporal volume, and
propose a novel approach thatsegments and tracks instances across space and
time in a single stage. Ourproblem formulation is centered around the idea of
spatio-temporal em-beddings which are trained to cluster pixels belonging to a
specific objectinstance over an entire video clip. To this end, we introduce
(i) novel mix-ing functions that enhance the feature representation of
spatio-temporalembeddings, and (ii) a single-stage, proposal-free network that
can rea-son about temporal context. Our network is trained end-to-end to
learnspatio-temporal embeddings as well as parameters required to clusterthese
embeddings, thus simplifying inference. Our method achieves state-of-the-art
results across multiple datasets and tasks. Code and modelsare available at
https://github.com/sabarim/STEm-Seg.Comment: 28 pages, 6 figure
Learning Correspondence from the Cycle-Consistency of Time
We introduce a self-supervised method for learning visual correspondence from
unlabeled video. The main idea is to use cycle-consistency in time as free
supervisory signal for learning visual representations from scratch. At
training time, our model learns a feature map representation to be useful for
performing cycle-consistent tracking. At test time, we use the acquired
representation to find nearest neighbors across space and time. We demonstrate
the generalizability of the representation -- without finetuning -- across a
range of visual correspondence tasks, including video object segmentation,
keypoint tracking, and optical flow. Our approach outperforms previous
self-supervised methods and performs competitively with strongly supervised
methods.Comment: CVPR 2019 Oral. Project page: http://ajabri.github.io/timecycl
- …