36,657 research outputs found
End-to-End Tracking and Semantic Segmentation Using Recurrent Neural Networks
In this work we present a novel end-to-end framework for tracking and
classifying a robot's surroundings in complex, dynamic and only partially
observable real-world environments. The approach deploys a recurrent neural
network to filter an input stream of raw laser measurements in order to
directly infer object locations, along with their identity in both visible and
occluded areas. To achieve this we first train the network using unsupervised
Deep Tracking, a recently proposed theoretical framework for end-to-end space
occupancy prediction. We show that by learning to track on a large amount of
unsupervised data, the network creates a rich internal representation of its
environment which we in turn exploit through the principle of inductive
transfer of knowledge to perform the task of it's semantic classification. As a
result, we show that only a small amount of labelled data suffices to steer the
network towards mastering this additional task. Furthermore we propose a novel
recurrent neural network architecture specifically tailored to tracking and
semantic classification in real-world robotics applications. We demonstrate the
tracking and classification performance of the method on real-world data
collected at a busy road junction. Our evaluation shows that the proposed
end-to-end framework compares favourably to a state-of-the-art, model-free
tracking solution and that it outperforms a conventional one-shot training
scheme for semantic classification
GUSOT: Green and Unsupervised Single Object Tracking for Long Video Sequences
Supervised and unsupervised deep trackers that rely on deep learning
technologies are popular in recent years. Yet, they demand high computational
complexity and a high memory cost. A green unsupervised single-object tracker,
called GUSOT, that aims at object tracking for long videos under a
resource-constrained environment is proposed in this work. Built upon a
baseline tracker, UHP-SOT++, which works well for short-term tracking, GUSOT
contains two additional new modules: 1) lost object recovery, and 2)
color-saliency-based shape proposal. They help resolve the tracking loss
problem and offer a more flexible object proposal, respectively. Thus, they
enable GUSOT to achieve higher tracking accuracy in the long run. We conduct
experiments on the large-scale dataset LaSOT with long video sequences, and
show that GUSOT offers a lightweight high-performance tracking solution that
finds applications in mobile and edge computing platforms
Clustering Learning for Robotic Vision
We present the clustering learning technique applied to multi-layer
feedforward deep neural networks. We show that this unsupervised learning
technique can compute network filters with only a few minutes and a much
reduced set of parameters. The goal of this paper is to promote the technique
for general-purpose robotic vision systems. We report its use in static image
datasets and object tracking datasets. We show that networks trained with
clustering learning can outperform large networks trained for many hours on
complex datasets.Comment: Code for this paper is available here:
https://github.com/culurciello/CL_paper1_cod
Unsupervised Learning of Visual Representations using Videos
Is strong supervision necessary for learning a good visual representation? Do
we really need millions of semantically-labeled images to train a Convolutional
Neural Network (CNN)? In this paper, we present a simple yet surprisingly
powerful approach for unsupervised learning of CNN. Specifically, we use
hundreds of thousands of unlabeled videos from the web to learn visual
representations. Our key idea is that visual tracking provides the supervision.
That is, two patches connected by a track should have similar visual
representation in deep feature space since they probably belong to the same
object or object part. We design a Siamese-triplet network with a ranking loss
function to train this CNN representation. Without using a single image from
ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train
an ensemble of unsupervised networks that achieves 52% mAP (no bounding box
regression). This performance comes tantalizingly close to its
ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We
also show that our unsupervised network can perform competitively in other
tasks such as surface-normal estimation
Memory Based Online Learning of Deep Representations from Video Streams
We present a novel online unsupervised method for face identity learning from
video streams. The method exploits deep face descriptors together with a memory
based learning mechanism that takes advantage of the temporal coherence of
visual data. Specifically, we introduce a discriminative feature matching
solution based on Reverse Nearest Neighbour and a feature forgetting strategy
that detect redundant features and discard them appropriately while time
progresses. It is shown that the proposed learning procedure is asymptotically
stable and can be effectively used in relevant applications like multiple face
identification and tracking from unconstrained video streams. Experimental
results show that the proposed method achieves comparable results in the task
of multiple face tracking and better performance in face identification with
offline approaches exploiting future information. Code will be publicly
available.Comment: arXiv admin note: text overlap with arXiv:1708.0361
- …