3,992 research outputs found
Learning Deep Representations for Scene Labeling with Semantic Context Guided Supervision
Scene labeling is a challenging classification problem where each input image
requires a pixel-level prediction map. Recently, deep-learning-based methods
have shown their effectiveness on solving this problem. However, we argue that
the large intra-class variation provides ambiguous training information and
hinders the deep models' ability to learn more discriminative deep feature
representations. Unlike existing methods that mainly utilize semantic context
for regularizing or smoothing the prediction map, we design novel supervisions
from semantic context for learning better deep feature representations. Two
types of semantic context, scene names of images and label map statistics of
image patches, are exploited to create label hierarchies between the original
classes and newly created subclasses as the learning supervisions. Such
subclasses show lower intra-class variation, and help CNN detect more
meaningful visual patterns and learn more effective deep features. Novel
training strategies and network structure that take advantages of such label
hierarchies are introduced. Our proposed method is evaluated extensively on
four popular datasets, Stanford Background (8 classes), SIFTFlow (33 classes),
Barcelona (170 classes) and LM+Sun datasets (232 classes) with 3 different
networks structures, and show state-of-the-art performance. The experiments
show that our proposed method makes deep models learn more discriminative
feature representations without increasing model size or complexity.Comment: 13 page
Kernalised Multi-resolution Convnet for Visual Tracking
Visual tracking is intrinsically a temporal problem. Discriminative
Correlation Filters (DCF) have demonstrated excellent performance for
high-speed generic visual object tracking. Built upon their seminal work, there
has been a plethora of recent improvements relying on convolutional neural
network (CNN) pretrained on ImageNet as a feature extractor for visual
tracking. However, most of their works relying on ad hoc analysis to design the
weights for different layers either using boosting or hedging techniques as an
ensemble tracker. In this paper, we go beyond the conventional DCF framework
and propose a Kernalised Multi-resolution Convnet (KMC) formulation that
utilises hierarchical response maps to directly output the target movement.
When directly deployed the learnt network to predict the unseen challenging UAV
tracking dataset without any weight adjustment, the proposed model consistently
achieves excellent tracking performance. Moreover, the transfered
multi-reslution CNN renders it possible to be integrated into the RNN temporal
learning framework, therefore opening the door on the end-to-end temporal deep
learning (TDL) for visual tracking.Comment: CVPRW 201
Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video
Recent studies have demonstrated the power of recurrent neural networks for
machine translation, image captioning and speech recognition. For the task of
capturing temporal structure in video, however, there still remain numerous
open research questions. Current research suggests using a simple temporal
feature pooling strategy to take into account the temporal aspect of video. We
demonstrate that this method is not sufficient for gesture recognition, where
temporal information is more discriminative compared to general video
classification tasks. We explore deep architectures for gesture recognition in
video and propose a new end-to-end trainable neural network architecture
incorporating temporal convolutions and bidirectional recurrence. Our main
contributions are twofold; first, we show that recurrence is crucial for this
task; second, we show that adding temporal convolutions leads to significant
improvements. We evaluate the different approaches on the Montalbano gesture
recognition dataset, where we achieve state-of-the-art results
Point Linking Network for Object Detection
Object detection is a core problem in computer vision. With the development
of deep ConvNets, the performance of object detectors has been dramatically
improved. The deep ConvNets based object detectors mainly focus on regressing
the coordinates of bounding box, e.g., Faster-R-CNN, YOLO and SSD. Different
from these methods that considering bounding box as a whole, we propose a novel
object bounding box representation using points and links and implemented using
deep ConvNets, termed as Point Linking Network (PLN). Specifically, we regress
the corner/center points of bounding-box and their links using a fully
convolutional network; then we map the corner points and their links back to
multiple bounding boxes; finally an object detection result is obtained by
fusing the multiple bounding boxes. PLN is naturally robust to object occlusion
and flexible to object scale variation and aspect ratio variation. In the
experiments, PLN with the Inception-v2 model achieves state-of-the-art
single-model and single-scale results on the PASCAL VOC 2007, the PASCAL VOC
2012 and the COCO detection benchmarks without bells and whistles. The source
code will be released
Multi-scale Transformer Language Models
We investigate multi-scale transformer language models that learn
representations of text at multiple scales, and present three different
architectures that have an inductive bias to handle the hierarchical nature of
language. Experiments on large-scale language modeling benchmarks empirically
demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show
that it is possible to train a hierarchical variant with 30 layers that has 23%
smaller memory footprint and better perplexity, compared to a vanilla
transformer with less than half the number of layers, on the Toronto
BookCorpus. We analyze the advantages of learned representations at multiple
scales in terms of memory footprint, compute time, and perplexity, which are
particularly appealing given the quadratic scaling of transformers' run time
and memory usage with respect to sequence length
Deep Learning Multi-View Representation for Face Recognition
Various factors, such as identities, views (poses), and illuminations, are
coupled in face images. Disentangling the identity and view representations is
a major challenge in face recognition. Existing face recognition systems either
use handcrafted features or learn features discriminatively to improve
recognition accuracy. This is different from the behavior of human brain.
Intriguingly, even without accessing 3D data, human not only can recognize face
identity, but can also imagine face images of a person under different
viewpoints given a single 2D image, making face perception in the brain robust
to view changes. In this sense, human brain has learned and encoded 3D face
models from 2D images. To take into account this instinct, this paper proposes
a novel deep neural net, named multi-view perceptron (MVP), which can untangle
the identity and view features, and infer a full spectrum of multi-view images
in the meanwhile, given a single 2D face image. The identity features of MVP
achieve superior performance on the MultiPIE dataset. MVP is also capable to
interpolate and predict images under viewpoints that are unobserved in the
training data
Intriguing properties of neural networks
Deep neural networks are highly expressive models that have recently achieved
state of the art performance on speech and visual recognition tasks. While
their expressiveness is the reason they succeed, it also causes them to learn
uninterpretable solutions that could have counter-intuitive properties. In this
paper we report two such properties.
First, we find that there is no distinction between individual high level
units and random linear combinations of high level units, according to various
methods of unit analysis. It suggests that it is the space, rather than the
individual units, that contains of the semantic information in the high layers
of neural networks.
Second, we find that deep neural networks learn input-output mappings that
are fairly discontinuous to a significant extend. We can cause the network to
misclassify an image by applying a certain imperceptible perturbation, which is
found by maximizing the network's prediction error. In addition, the specific
nature of these perturbations is not a random artifact of learning: the same
perturbation can cause a different network, that was trained on a different
subset of the dataset, to misclassify the same input
Spatio-Temporal Action Graph Networks
Events defined by the interaction of objects in a scene are often of critical
importance; yet important events may have insufficient labeled examples to
train a conventional deep model to generalize to future object appearance.
Activity recognition models that represent object interactions explicitly have
the potential to learn in a more efficient manner than those that represent
scenes with global descriptors. We propose a novel inter-object graph
representation for activity recognition based on a disentangled graph embedding
with direct observation of edge appearance. We employ a novel factored
embedding of the graph structure, disentangling a representation hierarchy
formed over spatial dimensions from that found over temporal variation. We
demonstrate the effectiveness of our model on the Charades activity recognition
benchmark, as well as a new dataset of driving activities focusing on
multi-object interactions with near-collision events. Our model offers
significantly improved performance compared to baseline approaches without
object-graph representations, or with previous graph-based models.Comment: IEEE/CVF International Conference on Computer Vision Workshop
(ICCVW), 201
Time-series modeling with undecimated fully convolutional neural networks
We present a new convolutional neural network-based time-series model.
Typical convolutional neural network (CNN) architectures rely on the use of
max-pooling operators in between layers, which leads to reduced resolution at
the top layers. Instead, in this work we consider a fully convolutional network
(FCN) architecture that uses causal filtering operations, and allows for the
rate of the output signal to be the same as that of the input signal. We
furthermore propose an undecimated version of the FCN, which we refer to as the
undecimated fully convolutional neural network (UFCNN), and is motivated by the
undecimated wavelet transform. Our experimental results verify that using the
undecimated version of the FCN is necessary in order to allow for effective
time-series modeling. The UFCNN has several advantages compared to other
time-series models such as the recurrent neural network (RNN) and long
short-term memory (LSTM), since it does not suffer from either the vanishing or
exploding gradients problems, and is therefore easier to train. Convolution
operations can also be implemented more efficiently compared to the recursion
that is involved in RNN-based models. We evaluate the performance of our model
in a synthetic target tracking task using bearing only measurements generated
from a state-space model, a probabilistic modeling of polyphonic music
sequences problem, and a high frequency trading task using a time-series of
ask/bid quotes and their corresponding volumes. Our experimental results using
synthetic and real datasets verify the significant advantages of the UFCNN
compared to the RNN and LSTM baselines
An Interactive Insight Identification and Annotation Framework for Power Grid Pixel Maps using DenseU-Hierarchical VAE
Insights in power grid pixel maps (PGPMs) refer to important facility
operating states and unexpected changes in the power grid. Identifying insights
helps analysts understand the collaboration of various parts of the grid so
that preventive and correct operations can be taken to avoid potential
accidents. Existing solutions for identifying insights in PGPMs are performed
manually, which may be laborious and expertise-dependent. In this paper, we
propose an interactive insight identification and annotation framework by
leveraging an enhanced variational autoencoder (VAE). In particular, a new
architecture, DenseU-Hierarchical VAE (DUHiV), is designed to learn
representations from large-sized PGPMs, which achieves a significantly tighter
evidence lower bound (ELBO) than existing Hierarchical VAEs with a Multilayer
Perceptron architecture. Our approach supports modulating the derived
representations in an interactive visual interface, discover potential insights
and create multi-label annotations. Evaluations using real-world PGPMs datasets
show that our framework outperforms the baseline models in identifying and
annotating insights
- …