23,743 research outputs found
Motion tubes for the representation of images sequences
International audienceIn this paper, we introduce a novel way to represent an image sequence, which naturally exhibits the temporal persistence of the textures. Standardized representations have been thoroughly optimized, and getting significant improvements has become more and more difficult. As an alternative, Analysis-Synthesis (AS) coders have focused on the use of texture within a video coder. We introduce here a new AS representation of image sequences that remains close to the classic block-based representation. By tracking textures throughout the sequence, we propose to reconstruct it from a set of moving textures which we call motion tubes. A new motion model is then proposed, which allows for motion field continuities and discontinuities, by hybridizing Block Matching and a low-computational mesh-based representation. Finally, we propose a bi-predictional framework for motion tubes management
Action tube extraction based 3D-CNN for RGB-D action recognition
In this paper we propose a novel action tube extractor for RGB-D action recognition in trimmed videos. The action tube extractor takes as input a video and outputs an action tube. The method consists of two parts: spatial tube extraction and temporal sampling. The first part is built upon MobileNet-SSD and its role is to define the spatial region where the action takes place. The second part is based on the structural similarity index (SSIM) and is designed to remove frames without obvious motion from the primary action tube. The final extracted action tube has two benefits: 1) a higher ratio of ROI (subjects of action) to background; 2) most frames contain obvious motion change. We propose to use a two-stream (RGB and Depth) I3D architecture as our 3D-CNN model. Our approach outperforms the state-of-the-art methods on the OA and NTU RGB-D datasets. © 2018 IEEE.Peer ReviewedPostprint (published version
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
In this work, we propose an approach to the spatiotemporal localisation
(detection) and classification of multiple concurrent actions within temporally
untrimmed videos. Our framework is composed of three stages. In stage 1,
appearance and motion detection networks are employed to localise and score
actions from colour images and optical flow. In stage 2, the appearance network
detections are boosted by combining them with the motion detection scores, in
proportion to their respective spatial overlap. In stage 3, sequences of
detection boxes most likely to be associated with a single action instance,
called action tubes, are constructed by solving two energy maximisation
problems via dynamic programming. While in the first pass, action paths
spanning the whole video are built by linking detection boxes over time using
their class-specific scores and their spatial overlap, in the second pass,
temporal trimming is performed by ensuring label consistency for all
constituting detection boxes. We demonstrate the performance of our algorithm
on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new
state-of-the-art results across the board and significantly increasing
detection speed at test time. We achieve a huge leap forward in action
detection performance and report a 20% and 11% gain in mAP (mean average
precision) on UCF-101 and J-HMDB-21 datasets respectively when compared to the
state-of-the-art.Comment: Accepted by British Machine Vision Conference 201
Am I Done? Predicting Action Progress in Videos
In this paper we deal with the problem of predicting action progress in
videos. We argue that this is an extremely important task since it can be
valuable for a wide range of interaction applications. To this end we introduce
a novel approach, named ProgressNet, capable of predicting when an action takes
place in a video, where it is located within the frames, and how far it has
progressed during its execution. To provide a general definition of action
progress, we ground our work in the linguistics literature, borrowing terms and
concepts to understand which actions can be the subject of progress estimation.
As a result, we define a categorization of actions and their phases. Motivated
by the recent success obtained from the interaction of Convolutional and
Recurrent Neural Networks, our model is based on a combination of the Faster
R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate
action progress through time. After introducing two evaluation protocols for
the task at hand, we demonstrate the capability of our model to effectively
predict action progress on the UCF-101 and J-HMDB datasets
Generalized Rank Pooling for Activity Recognition
Most popular deep models for action recognition split video sequences into
short sub-sequences consisting of a few frames; frame-based features are then
pooled for recognizing the activity. Usually, this pooling step discards the
temporal order of the frames, which could otherwise be used for better
recognition. Towards this end, we propose a novel pooling method, generalized
rank pooling (GRP), that takes as input, features from the intermediate layers
of a CNN that is trained on tiny sub-sequences, and produces as output the
parameters of a subspace which (i) provides a low-rank approximation to the
features and (ii) preserves their temporal order. We propose to use these
parameters as a compact representation for the video sequence, which is then
used in a classification setup. We formulate an objective for computing this
subspace as a Riemannian optimization problem on the Grassmann manifold, and
propose an efficient conjugate gradient scheme for solving it. Experiments on
several activity recognition datasets show that our scheme leads to
state-of-the-art performance.Comment: Accepted at IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR), 201
Detect to Track and Track to Detect
Recent approaches for high accuracy detection and tracking of object
categories in video consist of complex multistage solutions that become more
cumbersome each year. In this paper we propose a ConvNet architecture that
jointly performs detection and tracking, solving the task in a simple and
effective way. Our contributions are threefold: (i) we set up a ConvNet
architecture for simultaneous detection and tracking, using a multi-task
objective for frame-based object detection and across-frame track regression;
(ii) we introduce correlation features that represent object co-occurrences
across time to aid the ConvNet during tracking; and (iii) we link the frame
level detections based on our across-frame tracklets to produce high accuracy
detections at the video level. Our ConvNet architecture for spatiotemporal
object detection is evaluated on the large-scale ImageNet VID dataset where it
achieves state-of-the-art results. Our approach provides better single model
performance than the winning method of the last ImageNet challenge while being
conceptually much simpler. Finally, we show that by increasing the temporal
stride we can dramatically increase the tracker speed.Comment: ICCV 2017. Code and models:
https://github.com/feichtenhofer/Detect-Track Results:
https://www.robots.ox.ac.uk/~vgg/research/detect-track
A LabVIEW® based generic CT scanner control software platform
UGCT, the Centre for X-ray tomography at Ghent University (Belgium) does research on X-ray tomography and its applications. This includes the development and construction of state-of-the-art CT scanners for scientific research. Because these scanners are built for very different purposes they differ considerably in their physical implementations. However, they all share common principle functionality. In this context a generic software platform was developed using LabVIEW (R) in order to provide the same interface and functionality on all scanners. This article describes the concept and features of this software, and its potential for tomography in a research setting. The core concept is to rigorously separate the abstract operation of a CT scanner from its actual physical configuration. This separation is achieved by implementing a sender-listener architecture. The advantages are that the resulting software platform is generic, scalable, highly efficient, easy to develop and to extend, and that it can be deployed on future scanners with minimal effort
- …