6,500 research outputs found
Multi-feature Bottom-up Processing and Top-down Selection for an Object-based Visual Attention Model
Artificial vision systems can not process all the information that they receive from the world in real time because it is highly expensive and inefficient in terms of computational cost. However, inspired by biological perception systems, it is possible to develop an artificial attention model able to select only the relevant part of the scene, as human vision does. This paper presents an attention model which draws attention over perceptual units of visual information, called proto-objects, and which uses a linear combination of multiple low-level features (such as colour, symmetry or shape) in order to calculate the saliency of each of them. But not only bottom-up processing is addressed, the proposed model also deals with the top-down component of attention. It is shown how a high-level task can modulate the global saliency computation, modifying the weights involved in the basic features linear combination.Ministerio de Economía y Competitividad (MINECO), proyectos: TIN2008-06196 y TIN2012-38079-C03-03. Campus de Excelencia Internacional Andalucía Tech
Data-Driven Shape Analysis and Processing
Data-driven methods play an increasingly important role in discovering
geometric, structural, and semantic relationships between 3D shapes in
collections, and applying this analysis to support intelligent modeling,
editing, and visualization of geometric data. In contrast to traditional
approaches, a key feature of data-driven approaches is that they aggregate
information from a collection of shapes to improve the analysis and processing
of individual shapes. In addition, they are able to learn models that reason
about properties and relationships of shapes without relying on hard-coded
rules or explicitly programmed instructions. We provide an overview of the main
concepts and components of these techniques, and discuss their application to
shape classification, segmentation, matching, reconstruction, modeling and
exploration, as well as scene analysis and synthesis, through reviewing the
literature and relating the existing works with both qualitative and numerical
comparisons. We conclude our report with ideas that can inspire future research
in data-driven shape analysis and processing.Comment: 10 pages, 19 figure
Towards Segmenting Anything That Moves
Detecting and segmenting individual objects, regardless of their category, is
crucial for many applications such as action detection or robotic interaction.
While this problem has been well-studied under the classic formulation of
spatio-temporal grouping, state-of-the-art approaches do not make use of
learning-based methods. To bridge this gap, we propose a simple learning-based
approach for spatio-temporal grouping. Our approach leverages motion cues from
optical flow as a bottom-up signal for separating objects from each other.
Motion cues are then combined with appearance cues that provide a generic
objectness prior for capturing the full extent of objects. We show that our
approach outperforms all prior work on the benchmark FBMS dataset. One
potential worry with learning-based methods is that they might overfit to the
particular type of objects that they have been trained on. To address this
concern, we propose two new benchmarks for generic, moving object detection,
and show that our model matches top-down methods on common categories, while
significantly out-performing both top-down and bottom-up methods on
never-before-seen categories.Comment: Website: http://www.achaldave.com/projects/anything-that-moves/.
Code: https://github.com/achalddave/segment-any-movin
Multigrid Predictive Filter Flow for Unsupervised Learning on Videos
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for
unsupervised learning on videos. The mgPFF takes as input a pair of frames and
outputs per-pixel filters to warp one frame to the other. Compared to optical
flow used for warping frames, mgPFF is more powerful in modeling sub-pixel
movement and dealing with corruption (e.g., motion blur). We develop a
multigrid coarse-to-fine modeling strategy that avoids the requirement of
learning large filters to capture large displacement. This allows us to train
an extremely compact model (4.6MB) which operates in a progressive way over
multiple resolutions with shared weights. We train mgPFF on unsupervised,
free-form videos and show that mgPFF is able to not only estimate long-range
flow for frame reconstruction and detect video shot transitions, but also
readily amendable for video object segmentation and pose tracking, where it
substantially outperforms the published state-of-the-art without bells and
whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we
have the unique opportunity to visualize how each pixel is evolving during
solving these tasks, thus gaining better interpretability.Comment: webpage (https://www.ics.uci.edu/~skong2/mgpff.html
Spatio-temporal Video Parsing for Abnormality Detection
Abnormality detection in video poses particular challenges due to the
infinite size of the class of all irregular objects and behaviors. Thus no (or
by far not enough) abnormal training samples are available and we need to find
abnormalities in test data without actually knowing what they are.
Nevertheless, the prevailing concept of the field is to directly search for
individual abnormal local patches or image regions independent of another. To
address this problem, we propose a method for joint detection of abnormalities
in videos by spatio-temporal video parsing. The goal of video parsing is to
find a set of indispensable normal spatio-temporal object hypotheses that
jointly explain all the foreground of a video, while, at the same time, being
supported by normal training samples. Consequently, we avoid a direct detection
of abnormalities and discover them indirectly as those hypotheses which are
needed for covering the foreground without finding an explanation for
themselves by normal samples. Abnormalities are localized by MAP inference in a
graphical model and we solve it efficiently by formulating it as a convex
optimization problem. We experimentally evaluate our approach on several
challenging benchmark sets, improving over the state-of-the-art on all standard
benchmarks both in terms of abnormality classification and localization.Comment: 15 pages, 12 figures, 3 table
Saliency-Guided Perceptual Grouping Using Motion Cues in Region-Based Artificial Visual Attention
Region-based artificial attention constitutes a framework for bio-inspired
attentional processes on an intermediate abstraction level for the use in
computer vision and mobile robotics. Segmentation algorithms produce regions of
coherently colored pixels. These serve as proto-objects on which the
attentional processes determine image portions of relevance. A single
region---which not necessarily represents a full object---constitutes the focus
of attention. For many post-attentional tasks, however, such as identifying or
tracking objects, single segments are not sufficient. Here, we present a
saliency-guided approach that groups regions that potentially belong to the
same object based on proximity and similarity of motion. We compare our results
to object selection by thresholding saliency maps and a further
attention-guided strategy
Factors in Finetuning Deep Model for object detection
Finetuning from a pretrained deep model is found to yield state-of-the-art
performance for many vision tasks. This paper investigates many factors that
influence the performance in finetuning for object detection. There is a
long-tailed distribution of sample numbers for classes in object detection. Our
analysis and empirical results show that classes with more samples have higher
impact on the feature learning. And it is better to make the sample number more
uniform across classes. Generic object detection can be considered as multiple
equally important tasks. Detection of each class is a task. These classes/tasks
have their individuality in discriminative visual appearance representation.
Taking this individuality into account, we cluster objects into visually
similar class groups and learn deep representations for these groups
separately. A hierarchical feature learning scheme is proposed. In this scheme,
the knowledge from the group with large number of classes is transferred for
learning features in its sub-groups. Finetuned on the GoogLeNet model,
experimental results show 4.7% absolute mAP improvement of our approach on the
ImageNet object detection dataset without increasing much computational cost at
the testing stage.Comment: CVPR2016 camera ready version. Our ImageNet large scale recognition
challenge (ILSVRC15) object detection results (rank 3rd for provided data and
2nd for external data) are based on this method. Code available later on
http://www.ee.cuhk.edu.hk/~wlouyang/projects/ImageNetFactors/CVPR16.htm
3D Object Discovery and Modeling Using Single RGB-D Images Containing Multiple Object Instances
Unsupervised object modeling is important in robotics, especially for
handling a large set of objects. We present a method for unsupervised 3D object
discovery, reconstruction, and localization that exploits multiple instances of
an identical object contained in a single RGB-D image. The proposed method does
not rely on segmentation, scene knowledge, or user input, and thus is easily
scalable. Our method aims to find recurrent patterns in a single RGB-D image by
utilizing appearance and geometry of the salient regions. We extract keypoints
and match them in pairs based on their descriptors. We then generate triplets
of the keypoints matching with each other using several geometric criteria to
minimize false matches. The relative poses of the matched triplets are computed
and clustered to discover sets of triplet pairs with similar relative poses.
Triplets belonging to the same set are likely to belong to the same object and
are used to construct an initial object model. Detection of remaining instances
with the initial object model using RANSAC allows to further expand and refine
the model. The automatically generated object models are both compact and
descriptive. We show quantitative and qualitative results on RGB-D images with
various objects including some from the Amazon Picking Challenge. We also
demonstrate the use of our method in an object picking scenario with a robotic
arm
Towards Storytelling from Visual Lifelogging: An Overview
Visual lifelogging consists of acquiring images that capture the daily
experiences of the user by wearing a camera over a long period of time. The
pictures taken offer considerable potential for knowledge mining concerning how
people live their lives, hence, they open up new opportunities for many
potential applications in fields including healthcare, security, leisure and
the quantified self. However, automatically building a story from a huge
collection of unstructured egocentric data presents major challenges. This
paper provides a thorough review of advances made so far in egocentric data
analysis, and in view of the current state of the art, indicates new lines of
research to move us towards storytelling from visual lifelogging.Comment: 16 pages, 11 figures, Submitted to IEEE Transactions on Human-Machine
System
Robust event-stream pattern tracking based on correlative filter
Object tracking based on retina-inspired and event-based dynamic vision
sensor (DVS) is challenging for the noise events, rapid change of event-stream
shape, chaos of complex background textures, and occlusion. To address these
challenges, this paper presents a robust event-stream pattern tracking method
based on correlative filter mechanism. In the proposed method, rate coding is
used to encode the event-stream object in each segment. Feature representations
from hierarchical convolutional layers of a deep convolutional neural network
(CNN) are used to represent the appearance of the rate encoded event-stream
object. The results prove that our method not only achieves good tracking
performance in many complicated scenes with noise events, complex background
textures, occlusion, and intersected trajectories, but also is robust to
variable scale, variable pose, and non-rigid deformations. In addition, this
correlative filter based event-stream tracking has the advantage of high speed.
The proposed approach will promote the potential applications of these
event-based vision sensors in self-driving, robots and many other high-speed
scenes
- …