6,586 research outputs found
Lucid Data Dreaming for Video Object Segmentation
Convolutional networks reach top quality in pixel-level video object
segmentation but require a large amount of training data (1k~100k) to deliver
such results. We propose a new training strategy which achieves
state-of-the-art results across three evaluation datasets while using 20x~1000x
less annotated data than competing methods. Our approach is suitable for both
single and multiple object segmentation. Instead of using large training sets
hoping to generalize across domains, we generate in-domain training data using
the provided annotation on the first frame of each video to synthesize ("lucid
dream") plausible future video frames. In-domain per-video training data allows
us to train high quality appearance- and motion-based models, as well as tune
the post-processing stage. This approach allows to reach competitive results
even when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the video object segmentation task a smaller
training set that is closer to the target domain is more effective. This
changes the mindset regarding how many training samples and general
"objectness" knowledge are required for the video object segmentation task.Comment: Accepted in International Journal of Computer Vision (IJCV
Multi-stream CNN based Video Semantic Segmentation for Automated Driving
Majority of semantic segmentation algorithms operate on a single frame even
in the case of videos. In this work, the goal is to exploit temporal
information within the algorithm model for leveraging motion cues and temporal
consistency. We propose two simple high-level architectures based on Recurrent
FCN (RFCN) and Multi-Stream FCN (MSFCN) networks. In case of RFCN, a recurrent
network namely LSTM is inserted between the encoder and decoder. MSFCN combines
the encoders of different frames into a fused encoder via 1x1 channel-wise
convolution. We use a ResNet50 network as the baseline encoder and construct
three networks namely MSFCN of order 2 & 3 and RFCN of order 2. MSFCN-3
produces the best results with an accuracy improvement of 9% and 15% for
Highway and New York-like city scenarios in the SYNTHIA-CVPR'16 dataset using
mean IoU metric. MSFCN-3 also produced 11% and 6% for SegTrack V2 and DAVIS
datasets over the baseline FCN network. We also designed an efficient version
of MSFCN-2 and RFCN-2 using weight sharing among the two encoders. The
efficient MSFCN-2 provided an improvement of 11% and 5% for KITTI and SYNTHIA
with negligible increase in computational complexity compared to the baseline
version.Comment: Accepted for Oral Presentation at VISAPP 201
Multi-Scale 3D Scene Flow from Binocular Stereo Sequences
Scene ïŹow methods estimate the three-dimensional motion ïŹeld for points in the world, using multi-camera video data. Such methods combine multi-view reconstruction with motion estimation. This paper describes an alternative formulation for dense scene ïŹow estimation that provides reliable results using only two cameras by fusing stereo and optical ïŹow estimation into a single coherent framework. Internally, the proposed algorithm generates probability distributions for optical ïŹow and disparity. Taking into account the uncertainty in the intermediate stages allows for more reliable estimation of the 3D scene ïŹow than previous methods allow. To handle the aperture problems inherent in the estimation of optical ïŹow and disparity, a multi-scale method along with a novel region-based technique is used within a regularized solution. This combined approach both preserves discontinuities and prevents over-regularization â two problems commonly associated with the basic multi-scale approaches. Experiments with synthetic and real test data demonstrate the strength of the proposed approach.National Science Foundation (CNS-0202067, IIS-0208876); Office of Naval Research (N00014-03-1-0108
A framework based on Gaussian mixture models and Kalman filters for the segmentation and tracking of anomalous events in shipboard video
Anomalous indications in monitoring equipment on board U.S. Navy vessels must be handled in a timely manner to prevent catastrophic system failure. The development of sensor data analysis techniques to assist a ship\u27s crew in monitoring machinery and summon required ship-to-shore assistance is of considerable benefit to the Navy. In addition, the Navy has a large interest in the development of distance support technology in its ongoing efforts to reduce manning on ships. In this thesis, algorithms have been developed for the detection of anomalous events that can be identified from the analysis of monochromatic stationary ship surveillance video streams. The specific anomalies that we have focused on are the presence and growth of smoke and fire events inside the frames of the video stream. The algorithm consists of the following steps. First, a foreground segmentation algorithm based on adaptive Gaussian mixture models is employed to detect the presence of motion in a scene. The algorithm is adapted to emphasize gray-level characteristics related to smoke and fire events in the frame. Next, shape discriminant features in the foreground are enhanced using morphological operations. Following this step, the anomalous indication is tracked between frames using Kalman filtering. Finally, gray level shape and motion features corresponding to the anomaly are subjected to principal component analysis and classified using a multilayer perceptron neural network. The algorithm is exercised on 68 video streams that include the presence of anomalous events (such as fire and smoke) and benign/nuisance events (such as humans walking the field of view). Initial results show that the algorithm is successful in detecting anomalies in video streams, and is suitable for application in shipboard environments
Machine Understanding of Human Behavior
A widely accepted prediction is that computing will move to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. If this prediction is to come true, then next generation computing, which we will call human computing, should be about anticipatory user interfaces that should be human-centered, built for humans based on human models. They should transcend the traditional keyboard and mouse to include natural, human-like interactive functions including understanding and emulating certain human behaviors such as affective and social signaling. This article discusses a number of components of human behavior, how they might be integrated into computers, and how far we are from realizing the front end of human computing, that is, how far are we from enabling computers to understand human behavior
Machine Analysis of Facial Expressions
No abstract
Keeping track of worm trackers
C. elegans is used extensively as a model system in the neurosciences due to its well defined nervous system. However, the seeming simplicity of this nervous system in anatomical structure and neuronal connectivity, at least compared to higher animals, underlies a rich diversity of behaviors. The usefulness of the worm in genome-wide mutagenesis or RNAi screens, where thousands of strains are assessed for phenotype, emphasizes the need for computational methods for automated parameterization of generated behaviors. In addition, behaviors can be modulated upon external cues like temperature, O2 and CO2 concentrations, mechanosensory and chemosensory inputs. Different machine vision tools have been developed to aid researchers in their efforts to inventory and characterize defined behavioral âoutputsâ. Here we aim at providing an overview of different worm-tracking packages or video analysis tools designed to quantify different aspects of locomotion such as the occurrence of directional changes (turns, omega bends), curvature of the sinusoidal shape (amplitude, body bend angles) and velocity (speed, backward or forward movement)
EV-IMO: Motion Segmentation Dataset and Learning Pipeline for Event Cameras
We present the first event-based learning approach for motion segmentation in
indoor scenes and the first event-based dataset - EV-IMO - which includes
accurate pixel-wise motion masks, egomotion and ground truth depth. Our
approach is based on an efficient implementation of the SfM learning pipeline
using a low parameter neural network architecture on event data. In addition to
camera egomotion and a dense depth map, the network estimates pixel-wise
independently moving object segmentation and computes per-object 3D
translational velocities for moving objects. We also train a shallow network
with just 40k parameters, which is able to compute depth and egomotion.
Our EV-IMO dataset features 32 minutes of indoor recording with up to 3 fast
moving objects simultaneously in the camera field of view. The objects and the
camera are tracked by the VICON motion capture system. By 3D scanning the room
and the objects, accurate depth map ground truth and pixel-wise object masks
are obtained, which are reliable even in poor lighting conditions and during
fast motion. We then train and evaluate our learning pipeline on EV-IMO and
demonstrate that our approach far surpasses its rivals and is well suited for
scene constrained robotics applications.Comment: 8 pages, 6 figures. Submitted to 2019 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS 2019
- âŠ