3,016 research outputs found
Review of Visual Saliency Detection with Comprehensive Information
Visual saliency detection model simulates the human visual system to perceive
the scene, and has been widely used in many vision tasks. With the acquisition
technology development, more comprehensive information, such as depth cue,
inter-image correspondence, or temporal relationship, is available to extend
image saliency detection to RGBD saliency detection, co-saliency detection, or
video saliency detection. RGBD saliency detection model focuses on extracting
the salient regions from RGBD images by combining the depth information.
Co-saliency detection model introduces the inter-image correspondence
constraint to discover the common salient object in an image group. The goal of
video saliency detection model is to locate the motion-related salient object
in video sequences, which considers the motion cue and spatiotemporal
constraint jointly. In this paper, we review different types of saliency
detection algorithms, summarize the important issues of the existing methods,
and discuss the existent problems and future works. Moreover, the evaluation
datasets and quantitative measurements are briefly introduced, and the
experimental analysis and discission are conducted to provide a holistic
overview of different saliency detection methods.Comment: 18 pages, 11 figures, 7 tables, Accepted by IEEE Transactions on
Circuits and Systems for Video Technology 2018, https://rmcong.github.io
A survey on trajectory clustering analysis
This paper comprehensively surveys the development of trajectory clustering.
Considering the critical role of trajectory data mining in modern intelligent
systems for surveillance security, abnormal behavior detection, crowd behavior
analysis, and traffic control, trajectory clustering has attracted growing
attention. Existing trajectory clustering methods can be grouped into three
categories: unsupervised, supervised and semi-supervised algorithms. In spite
of achieving a certain level of development, trajectory clustering is limited
in its success by complex conditions such as application scenarios and data
dimensions. This paper provides a holistic understanding and deep insight into
trajectory clustering, and presents a comprehensive analysis of representative
methods and promising future directions
Unsupervised Deep Context Prediction for Background Foreground Separation
In many advanced video based applications background modeling is a
pre-processing step to eliminate redundant data, for instance in tracking or
video surveillance applications. Over the past years background subtraction is
usually based on low level or hand-crafted features such as raw color
components, gradients, or local binary patterns. The background subtraction
algorithms performance suffer in the presence of various challenges such as
dynamic backgrounds, photometric variations, camera jitters, and shadows. To
handle these challenges for the purpose of accurate background modeling we
propose a unified framework based on the algorithm of image inpainting. It is
an unsupervised visual feature learning hybrid Generative Adversarial algorithm
based on context prediction. We have also presented the solution of random
region inpainting by the fusion of center region inpaiting and random region
inpainting with the help of poisson blending technique. Furthermore we also
evaluated foreground object detection with the fusion of our proposed method
and morphological operations. The comparison of our proposed method with 12
state-of-the-art methods shows its stability in the application of background
estimation and foreground detection.Comment: 17 page
You-Do, I-Learn: Unsupervised Multi-User egocentric Approach Towards Video-Based Guidance
This paper presents an unsupervised approach towards automatically extracting
video-based guidance on object usage, from egocentric video and wearable gaze
tracking, collected from multiple users while performing tasks. The approach i)
discovers task relevant objects, ii) builds a model for each, iii)
distinguishes different ways in which each discovered object has been used and
iv) discovers the dependencies between object interactions. The work
investigates using appearance, position, motion and attention, and presents
results using each and a combination of relevant features. Moreover, an online
scalable approach is presented and is compared to offline results. The paper
proposes a method for selecting a suitable video guide to be displayed to a
novice user indicating how to use an object, purely triggered by the user's
gaze. The potential assistive mode can also recommend an object to be used next
based on the learnt sequence of object interactions. The approach was tested on
a variety of daily tasks such as initialising a printer, preparing a coffee and
setting up a gym machine
Multigrid Predictive Filter Flow for Unsupervised Learning on Videos
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for
unsupervised learning on videos. The mgPFF takes as input a pair of frames and
outputs per-pixel filters to warp one frame to the other. Compared to optical
flow used for warping frames, mgPFF is more powerful in modeling sub-pixel
movement and dealing with corruption (e.g., motion blur). We develop a
multigrid coarse-to-fine modeling strategy that avoids the requirement of
learning large filters to capture large displacement. This allows us to train
an extremely compact model (4.6MB) which operates in a progressive way over
multiple resolutions with shared weights. We train mgPFF on unsupervised,
free-form videos and show that mgPFF is able to not only estimate long-range
flow for frame reconstruction and detect video shot transitions, but also
readily amendable for video object segmentation and pose tracking, where it
substantially outperforms the published state-of-the-art without bells and
whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we
have the unique opportunity to visualize how each pixel is evolving during
solving these tasks, thus gaining better interpretability.Comment: webpage (https://www.ics.uci.edu/~skong2/mgpff.html
Human Centred Object Co-Segmentation
Co-segmentation is the automatic extraction of the common semantic regions
given a set of images. Different from previous approaches mainly based on
object visuals, in this paper, we propose a human centred object
co-segmentation approach, which uses the human as another strong evidence. In
order to discover the rich internal structure of the objects reflecting their
human-object interactions and visual similarities, we propose an unsupervised
fully connected CRF auto-encoder incorporating the rich object features and a
novel human-object interaction representation. We propose an efficient learning
and inference algorithm to allow the full connectivity of the CRF with the
auto-encoder, that establishes pairwise relations on all pairs of the object
proposals in the dataset. Moreover, the auto-encoder learns the parameters from
the data itself rather than supervised learning or manually assigned parameters
in the conventional CRF. In the extensive experiments on four datasets, we show
that our approach is able to extract the common objects more accurately than
the state-of-the-art co-segmentation algorithms
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Large-scale labeled data are generally required to train deep neural networks
in order to obtain better performance in visual feature learning from images or
videos for computer vision applications. To avoid extensive cost of collecting
and annotating large-scale datasets, as a subset of unsupervised learning
methods, self-supervised learning methods are proposed to learn general image
and video features from large-scale unlabeled data without using any
human-annotated labels. This paper provides an extensive review of deep
learning-based self-supervised general visual feature learning methods from
images or videos. First, the motivation, general pipeline, and terminologies of
this field are described. Then the common deep neural network architectures
that used for self-supervised learning are summarized. Next, the main
components and evaluation metrics of self-supervised learning methods are
reviewed followed by the commonly used image and video datasets and the
existing self-supervised visual feature learning methods. Finally, quantitative
performance comparisons of the reviewed methods on benchmark datasets are
summarized and discussed for both image and video feature learning. At last,
this paper is concluded and lists a set of promising future directions for
self-supervised visual feature learning
AED-Net: An Abnormal Event Detection Network
It is challenging to detect the anomaly in crowded scenes for quite a long
time. In this paper, a self-supervised framework, abnormal event detection
network (AED-Net), which is composed of PCAnet and kernel principal component
analysis (kPCA), is proposed to address this problem. Using surveillance video
sequences of different scenes as raw data, PCAnet is trained to extract
high-level semantics of crowd's situation. Next, kPCA,a one-class classifier,
is trained to determine anomaly of the scene. In contrast to some prevailing
deep learning methods,the framework is completely self-supervised because it
utilizes only video sequences in a normal situation. Experiments of global and
local abnormal event detection are carried out on UMN and UCSD datasets, and
competitive results with higher EER and AUC compared to other state-of-the-art
methods are observed. Furthermore, by adding local response normalization (LRN)
layer, we propose an improvement to original AED-Net. And it is proved to
perform better by promoting the framework's generalization capacity according
to the experiments.Comment: 14 pages, 7 figure
Learning to see like children: proof of concept
In the last few years we have seen a growing interest in machine learning
approaches to computer vision and, especially, to semantic labeling. Nowadays
state of the art systems use deep learning on millions of labeled images with
very successful results on benchmarks, though it is unlikely to expect similar
results in unrestricted visual environments. Most learning schemes essentially
ignore the inherent sequential structure of videos: this might be a critical
issue, since any visual recognition process is remarkably more complex when
shuffling video frames. Based on this remark, we propose a re-foundation of the
communication protocol between visual agents and the environment, which is
referred to as learning to see like children. Like for human interaction,
visual concepts are acquired by the agents solely by processing their own
visual stream along with human supervisions on selected pixels. We give a proof
of concept that remarkable semantic labeling can emerge within this protocol by
using only a few supervised examples. This is made possible by exploiting a
constraint of motion coherent labeling that virtually offers tons of
supervisions. Additional visual constraints, including those associated with
object supervisions, are used within the context of learning from constraints.
The framework is extended in the direction of lifelong learning, so as our
visual agents live in their own visual environment without distinguishing
learning and test set. Learning takes place in deep architectures under a
progressive developmental scheme. In order to evaluate our Developmental Visual
Agents (DVAs), in addition to classic benchmarks, we open the doors of our lab,
allowing people to evaluate DVAs by crowd-sourcing. Such assessment mechanism
might result in a paradigm shift in methodologies and algorithms for computer
vision, encouraging truly novel solutions within the proposed framework
A Survey on Recent Advances of Computer Vision Algorithms for Egocentric Video
Recent technological advances have made lightweight, head mounted cameras
both practical and affordable and products like Google Glass show first
approaches to introduce the idea of egocentric (first-person) video to the
mainstream. Interestingly, the computer vision community has only recently
started to explore this new domain of egocentric vision, where research can
roughly be categorized into three areas: Object recognition, activity
detection/recognition, video summarization. In this paper, we try to give a
broad overview about the different problems that have been addressed and
collect and compare evaluation results. Moreover, along with the emergence of
this new domain came the introduction of numerous new and versatile benchmark
datasets, which we summarize and compare as well
- …