8 research outputs found
Interpretable Transformations with Encoder-Decoder Networks
Deep feature spaces have the capacity to encode complex transformations of
their input data. However, understanding the relative feature-space
relationship between two transformed encoded images is difficult. For instance,
what is the relative feature space relationship between two rotated images?
What is decoded when we interpolate in feature space? Ideally, we want to
disentangle confounding factors, such as pose, appearance, and illumination,
from object identity. Disentangling these is difficult because they interact in
very nonlinear ways. We propose a simple method to construct a deep feature
space, with explicitly disentangled representations of several known
transformations. A person or algorithm can then manipulate the disentangled
representation, for example, to re-render an image with explicit control over
parameterized degrees of freedom. The feature space is constructed using a
transforming encoder-decoder network with a custom feature transform layer,
acting on the hidden representations. We demonstrate the advantages of explicit
disentangling on a variety of datasets and transformations, and as an aid for
traditional tasks, such as classification.Comment: Accepted at ICCV 201
CubeNet: Equivariance to 3D Rotation and Translation
3D Convolutional Neural Networks are sensitive to transformations applied to
their input. This is a problem because a voxelized version of a 3D object, and
its rotated clone, will look unrelated to each other after passing through to
the last layer of a network. Instead, an idealized model would preserve a
meaningful representation of the voxelized object, while explaining the
pose-difference between the two inputs. An equivariant representation vector
has two components: the invariant identity part, and a discernable encoding of
the transformation. Models that can't explain pose-differences risk "diluting"
the representation, in pursuit of optimizing a classification or regression
loss function.
We introduce a Group Convolutional Neural Network with linear equivariance to
translations and right angle rotations in three dimensions. We call this
network CubeNet, reflecting its cube-like symmetry. By construction, this
network helps preserve a 3D shape's global and local signature, as it is
transformed through successive layers. We apply this network to a variety of 3D
inference problems, achieving state-of-the-art on the ModelNet10 classification
challenge, and comparable performance on the ISBI 2012 Connectome Segmentation
Benchmark. To the best of our knowledge, this is the first 3D rotation
equivariant CNN for voxel representations.Comment: Preprin
A generic framework for median graph computation based on a recursive embedding approach
The median graph has been shown to be a good choice to obtain a representative of a set of graphs. However, its computation is a complex problem. Recently, graph embedding into vector spaces has been proposed to obtain approximations of the median graph. The problem with such an approach is how to go from a point in the vector space back to a graph in the graph space. The main contribution of this paper is the generalization of this previous method, proposing a generic recursive procedure that permits to recover the graph corresponding to a point in the vector space, introducing only the amount of approximation inherent to the use of graph matching algorithms. In order to evaluate the proposed method, we compare it with the set median and with the other state-of-the-art embedding-based methods for the median graph computation. The experiments are carried out using four different databases (one semi-artificial and three containing real-world data). Results show that with the proposed approach we can obtain better medians, in terms of the sum of distances to the training graphs, than with the previous existing methods. © 2011 Elsevier Inc. All rights reserved.This work has been supported by the Spanish research programmes Consolider Ingenio 2010 CSD2007-00018, TIN2006-15694-C02-02 and TIN2008-04998 and the fellowship RYC-2009-05031.Peer Reviewe
BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection
Temporal action detection (TAD) is extensively studied in the video
understanding community by generally following the object detection pipeline in
images. However, complex designs are not uncommon in TAD, such as two-stream
feature extraction, multi-stage training, complex temporal modeling, and global
context fusion. In this paper, we do not aim to introduce any novel technique
for TAD. Instead, we study a simple, straightforward, yet must-known baseline
given the current status of complex design and low detection efficiency in TAD.
In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into
several essential components: data sampling, backbone design, neck
construction, and detection head. We extensively investigate the existing
techniques in each component for this baseline, and more importantly, perform
end-to-end training over the entire pipeline thanks to the simplicity of
design. As a result, this simple BasicTAD yields an astounding and real-time
RGB-Only baseline very close to the state-of-the-art methods with two-stream
inputs. In addition, we further improve the BasicTAD by preserving more
temporal and spatial information in network representation (termed as PlusTAD).
Empirical results demonstrate that our PlusTAD is very efficient and
significantly outperforms the previous methods on the datasets of THUMOS14 and
FineAction. Meanwhile, we also perform in-depth visualization and error
analysis on our proposed method and try to provide more insights on the TAD
problem. Our approach can serve as a strong baseline for future TAD research.
The code and model will be released at https://github.com/MCG-NJU/BasicTAD.Comment: Accepted by CVI