2,776 research outputs found
Mobile Video Object Detection with Temporally-Aware Feature Maps
This paper introduces an online model for object detection in videos designed
to run in real-time on low-powered mobile and embedded devices. Our approach
combines fast single-image object detection with convolutional long short term
memory (LSTM) layers to create an interweaved recurrent-convolutional
architecture. Additionally, we propose an efficient Bottleneck-LSTM layer that
significantly reduces computational cost compared to regular LSTMs. Our network
achieves temporal awareness by using Bottleneck-LSTMs to refine and propagate
feature maps across frames. This approach is substantially faster than existing
detection methods in video, outperforming the fastest single-frame models in
model size and computational cost while attaining accuracy comparable to much
more expensive single-frame models on the Imagenet VID 2015 dataset. Our model
reaches a real-time inference speed of up to 15 FPS on a mobile CPU.Comment: In CVPR 201
Deep Lidar CNN to Understand the Dynamics of Moving Vehicles
Perception technologies in Autonomous Driving are experiencing their golden
age due to the advances in Deep Learning. Yet, most of these systems rely on
the semantically rich information of RGB images. Deep Learning solutions
applied to the data of other sensors typically mounted on autonomous cars (e.g.
lidars or radars) are not explored much. In this paper we propose a novel
solution to understand the dynamics of moving vehicles of the scene from only
lidar information. The main challenge of this problem stems from the fact that
we need to disambiguate the proprio-motion of the 'observer' vehicle from that
of the external 'observed' vehicles. For this purpose, we devise a CNN
architecture which at testing time is fed with pairs of consecutive lidar
scans. However, in order to properly learn the parameters of this network,
during training we introduce a series of so-called pretext tasks which also
leverage on image data. These tasks include semantic information about
vehicleness and a novel lidar-flow feature which combines standard image-based
optical flow with lidar scans. We obtain very promising results and show that
including distilled image information only during training, allows improving
the inference results of the network at test time, even when image data is no
longer used.Comment: Presented in IEEE ICRA 2018. IEEE Copyrights: Personal use of this
material is permitted. Permission from IEEE must be obtained for all other
uses. (V2 just corrected comments on arxiv submission
Massively Parallel Video Networks
We introduce a class of causal video understanding models that aims to
improve efficiency of video processing by maximising throughput, minimising
latency, and reducing the number of clock cycles. Leveraging operation
pipelining and multi-rate clocks, these models perform a minimal amount of
computation (e.g. as few as four convolutional layers) for each frame per
timestep to produce an output. The models are still very deep, with dozens of
such operations being performed but in a pipelined fashion that enables
depth-parallel computation. We illustrate the proposed principles by applying
them to existing image architectures and analyse their behaviour on two video
tasks: action recognition and human keypoint localisation. The results show
that a significant degree of parallelism, and implicitly speedup, can be
achieved with little loss in performance.Comment: Fixed typos in densenet model definition in appendi
End-to-End Learning of Representations for Asynchronous Event-Based Data
Event cameras are vision sensors that record asynchronous streams of
per-pixel brightness changes, referred to as "events". They have appealing
advantages over frame-based cameras for computer vision, including high
temporal resolution, high dynamic range, and no motion blur. Due to the sparse,
non-uniform spatiotemporal layout of the event signal, pattern recognition
algorithms typically aggregate events into a grid-based representation and
subsequently process it by a standard vision pipeline, e.g., Convolutional
Neural Network (CNN). In this work, we introduce a general framework to convert
event streams into grid-based representations through a sequence of
differentiable operations. Our framework comes with two main advantages: (i)
allows learning the input event representation together with the task dedicated
network in an end to end manner, and (ii) lays out a taxonomy that unifies the
majority of extant event representations in the literature and identifies novel
ones. Empirically, we show that our approach to learning the event
representation end-to-end yields an improvement of approximately 12% on optical
flow estimation and object recognition over state-of-the-art methods.Comment: To appear at ICCV 201
- …