42,530 research outputs found
Background Subtraction in Real Applications: Challenges, Current Models and Future Directions
Computer vision applications based on videos often require the detection of
moving objects in their first step. Background subtraction is then applied in
order to separate the background and the foreground. In literature, background
subtraction is surely among the most investigated field in computer vision
providing a big amount of publications. Most of them concern the application of
mathematical and machine learning models to be more robust to the challenges
met in videos. However, the ultimate goal is that the background subtraction
methods developed in research could be employed in real applications like
traffic surveillance. But looking at the literature, we can remark that there
is often a gap between the current methods used in real applications and the
current methods in fundamental research. In addition, the videos evaluated in
large-scale datasets are not exhaustive in the way that they only covered a
part of the complete spectrum of the challenges met in real applications. In
this context, we attempt to provide the most exhaustive survey as possible on
real applications that used background subtraction in order to identify the
real challenges met in practice, the current used background models and to
provide future directions. Thus, challenges are investigated in terms of
camera, foreground objects and environments. In addition, we identify the
background models that are effectively used in these applications in order to
find potential usable recent background models in terms of robustness, time and
memory requirements.Comment: Submitted to Computer Science Revie
FlightGoggles: A Modular Framework for Photorealistic Camera, Exteroceptive Sensor, and Dynamics Simulation
FlightGoggles is a photorealistic sensor simulator for perception-driven
robotic vehicles. The key contributions of FlightGoggles are twofold. First,
FlightGoggles provides photorealistic exteroceptive sensor simulation using
graphics assets generated with photogrammetry. Second, it provides the ability
to combine (i) synthetic exteroceptive measurements generated in silico in real
time and (ii) vehicle dynamics and proprioceptive measurements generated in
motio by vehicle(s) in a motion-capture facility. FlightGoggles is capable of
simulating a virtual-reality environment around autonomous vehicle(s). While a
vehicle is in flight in the FlightGoggles virtual reality environment,
exteroceptive sensors are rendered synthetically in real time while all complex
extrinsic dynamics are generated organically through the natural interactions
of the vehicle. The FlightGoggles framework allows for researchers to
accelerate development by circumventing the need to estimate complex and
hard-to-model interactions such as aerodynamics, motor mechanics, battery
electrochemistry, and behavior of other agents. The ability to perform
vehicle-in-the-loop experiments with photorealistic exteroceptive sensor
simulation facilitates novel research directions involving, e.g., fast and
agile autonomous flight in obstacle-rich environments, safe human interaction,
and flexible sensor selection. FlightGoggles has been utilized as the main test
for selecting nine teams that will advance in the AlphaPilot autonomous drone
racing challenge. We survey approaches and results from the top AlphaPilot
teams, which may be of independent interest.Comment: Initial version appeared at IROS 2019. Supplementary material can be
found at https://flightgoggles.mit.edu. Revision includes description of new
FlightGoggles features, such as a photogrammetric model of the MIT Stata
Center, new rendering settings, and a Python AP
Robust event-stream pattern tracking based on correlative filter
Object tracking based on retina-inspired and event-based dynamic vision
sensor (DVS) is challenging for the noise events, rapid change of event-stream
shape, chaos of complex background textures, and occlusion. To address these
challenges, this paper presents a robust event-stream pattern tracking method
based on correlative filter mechanism. In the proposed method, rate coding is
used to encode the event-stream object in each segment. Feature representations
from hierarchical convolutional layers of a deep convolutional neural network
(CNN) are used to represent the appearance of the rate encoded event-stream
object. The results prove that our method not only achieves good tracking
performance in many complicated scenes with noise events, complex background
textures, occlusion, and intersected trajectories, but also is robust to
variable scale, variable pose, and non-rigid deformations. In addition, this
correlative filter based event-stream tracking has the advantage of high speed.
The proposed approach will promote the potential applications of these
event-based vision sensors in self-driving, robots and many other high-speed
scenes
RGBD Datasets: Past, Present and Future
Since the launch of the Microsoft Kinect, scores of RGBD datasets have been
released. These have propelled advances in areas from reconstruction to gesture
recognition. In this paper we explore the field, reviewing datasets across
eight categories: semantics, object pose estimation, camera tracking, scene
reconstruction, object tracking, human actions, faces and identification. By
extracting relevant information in each category we help researchers to find
appropriate data for their needs, and we consider which datasets have succeeded
in driving computer vision forward and why.
Finally, we examine the future of RGBD datasets. We identify key areas which
are currently underexplored, and suggest that future directions may include
synthetic data and dense reconstructions of static and dynamic scenes.Comment: 8 pages excluding references (CVPR style
Facial Landmark Detection: a Literature Survey
The locations of the fiducial facial landmark points around facial components
and facial contour capture the rigid and non-rigid facial deformations due to
head movements and facial expressions. They are hence important for various
facial analysis tasks. Many facial landmark detection algorithms have been
developed to automatically detect those key points over the years, and in this
paper, we perform an extensive review of them. We classify the facial landmark
detection algorithms into three major categories: holistic methods, Constrained
Local Model (CLM) methods, and the regression-based methods. They differ in the
ways to utilize the facial appearance and shape information. The holistic
methods explicitly build models to represent the global facial appearance and
shape information. The CLMs explicitly leverage the global shape model but
build the local appearance models. The regression-based methods implicitly
capture facial shape and appearance information. For algorithms within each
category, we discuss their underlying theories as well as their differences. We
also compare their performances on both controlled and in the wild benchmark
datasets, under varying facial expressions, head poses, and occlusion. Based on
the evaluations, we point out their respective strengths and weaknesses. There
is also a separate section to review the latest deep learning-based algorithms.
The survey also includes a listing of the benchmark databases and existing
software. Finally, we identify future research directions, including combining
methods in different categories to leverage their respective strengths to solve
landmark detection "in-the-wild"
Towards a Robust Aerial Cinematography Platform: Localizing and Tracking Moving Targets in Unstructured Environments
The use of drones for aerial cinematography has revolutionized several
applications and industries that require live and dynamic camera viewpoints
such as entertainment, sports, and security. However, safely controlling a
drone while filming a moving target usually requires multiple expert human
operators; hence the need for an autonomous cinematographer. Current approaches
have severe real-life limitations such as requiring fully scripted scenes,
high-precision motion-capture systems or GPS tags to localize targets, and
prior maps of the environment to avoid obstacles and plan for occlusion.
In this work, we overcome such limitations and propose a complete system for
aerial cinematography that combines: (1) a vision-based algorithm for target
localization; (2) a real-time incremental 3D signed-distance map algorithm for
occlusion and safety computation; and (3) a real-time camera motion planner
that optimizes smoothness, collisions, occlusions and artistic guidelines. We
evaluate robustness and real-time performance in series of field experiments
and simulations by tracking dynamic targets moving through unknown,
unstructured environments. Finally, we verify that despite removing previous
limitations, our system achieves state-of-the-art performance. Videos of the
system in action can be seen at https://youtu.be/ZE9MnCVmum
Multi-modal Tracking for Object based SLAM
We present an on-line 3D visual object tracking framework for monocular
cameras by incorporating spatial knowledge and uncertainty from semantic
mapping along with high frequency measurements from visual odometry. Using a
combination of vision and odometry that are tightly integrated we can increase
the overall performance of object based tracking for semantic mapping. We
present a framework for integration of the two data-sources into a coherent
framework through information based fusion/arbitration. We demonstrate the
framework in the context of OmniMapper[1] and present results on 6 challenging
sequences over multiple objects compared to data obtained from a motion capture
systems. We are able to achieve a mean error of 0.23m for per frame tracking
showing 9% relative error less than state of the art tracker.Comment: Submitted to IROS 201
Goal-oriented Object Importance Estimation in On-road Driving Videos
We formulate a new problem as Object Importance Estimation (OIE) in on-road
driving videos, where the road users are considered as important objects if
they have influence on the control decision of the ego-vehicle's driver. The
importance of a road user depends on both its visual dynamics, e.g.,
appearance, motion and location, in the driving scene and the driving goal,
\emph{e.g}., the planned path, of the ego vehicle. We propose a novel framework
that incorporates both visual model and goal representation to conduct OIE. To
evaluate our framework, we collect an on-road driving dataset at traffic
intersections in the real world and conduct human-labeled annotation of the
important objects. Experimental results show that our goal-oriented method
outperforms baselines and has much more improvement on the left-turn and
right-turn scenarios. Furthermore, we explore the possibility of using object
importance for driving control prediction and demonstrate that binary brake
prediction can be improved with the information of object importance
Identifying Most Walkable Direction for Navigation in an Outdoor Environment
We present an approach for identifying the most walkable direction for
navigation using a hand-held camera. Our approach extracts semantically rich
contextual information from the scene using a custom encoder-decoder
architecture for semantic segmentation and models the spatial and temporal
behavior of objects in the scene using a spatio-temporal graph. The system
learns to minimize a cost function over the spatial and temporal object
attributes to identify the most walkable direction. We construct a new
annotated navigation dataset collected using a hand-held mobile camera in an
unconstrained outdoor environment, which includes challenging settings such as
highly dynamic scenes, occlusion between objects, and distortions. Our system
achieves an accuracy of 84% on predicting a safe direction. We also show that
our custom segmentation network is both fast and accurate, achieving mIOU (mean
intersection over union) scores of 81 and 44.7 on the PASCAL VOC and the PASCAL
Context datasets, respectively, while running at about 21 frames per second
- …