10 research outputs found
Learning Spatiotemporal Features for Infrared Action Recognition with 3D Convolutional Neural Networks
Infrared (IR) imaging has the potential to enable more robust action
recognition systems compared to visible spectrum cameras due to lower
sensitivity to lighting conditions and appearance variability. While the action
recognition task on videos collected from visible spectrum imaging has received
much attention, action recognition in IR videos is significantly less explored.
Our objective is to exploit imaging data in this modality for the action
recognition task. In this work, we propose a novel two-stream 3D convolutional
neural network (CNN) architecture by introducing the discriminative code layer
and the corresponding discriminative code loss function. The proposed network
processes IR image and the IR-based optical flow field sequences. We pretrain
the 3D CNN model on the visible spectrum Sports-1M action dataset and finetune
it on the Infrared Action Recognition (InfAR) dataset. To our best knowledge,
this is the first application of the 3D CNN to action recognition in the IR
domain. We conduct an elaborate analysis of different fusion schemes (weighted
average, single and double-layer neural nets) applied to different 3D CNN
outputs. Experimental results demonstrate that our approach can achieve
state-of-the-art average precision (AP) performances on the InfAR dataset: (1)
the proposed two-stream 3D CNN achieves the best reported 77.5% AP, and (2) our
3D CNN model applied to the optical flow fields achieves the best reported
single stream 75.42% AP
Collaborative Layer-wise Discriminative Learning in Deep Neural Networks
Intermediate features at different layers of a deep neural network are known
to be discriminative for visual patterns of different complexities. However,
most existing works ignore such cross-layer heterogeneities when classifying
samples of different complexities. For example, if a training sample has
already been correctly classified at a specific layer with high confidence, we
argue that it is unnecessary to enforce rest layers to classify this sample
correctly and a better strategy is to encourage those layers to focus on other
samples.
In this paper, we propose a layer-wise discriminative learning method to
enhance the discriminative capability of a deep network by allowing its layers
to work collaboratively for classification. Towards this target, we introduce
multiple classifiers on top of multiple layers. Each classifier not only tries
to correctly classify the features from its input layer, but also coordinates
with other classifiers to jointly maximize the final classification
performance. Guided by the other companion classifiers, each classifier learns
to concentrate on certain training examples and boosts the overall performance.
Allowing for end-to-end training, our method can be conveniently embedded into
state-of-the-art deep networks. Experiments with multiple popular deep
networks, including Network in Network, GoogLeNet and VGGNet, on scale-various
object classification benchmarks, including CIFAR100, MNIST and ImageNet, and
scene classification benchmarks, including MIT67, SUN397 and Places205,
demonstrate the effectiveness of our method. In addition, we also analyze the
relationship between the proposed method and classical conditional random
fields models.Comment: To appear in ECCV 2016. Maybe subject to minor changes before
camera-ready versio
The Impact of Explanations on AI Competency Prediction in VQA
Explainability is one of the key elements for building trust in AI systems.
Among numerous attempts to make AI explainable, quantifying the effect of
explanations remains a challenge in conducting human-AI collaborative tasks.
Aside from the ability to predict the overall behavior of AI, in many
applications, users need to understand an AI agent's competency in different
aspects of the task domain. In this paper, we evaluate the impact of
explanations on the user's mental model of AI agent competency within the task
of visual question answering (VQA). We quantify users' understanding of
competency, based on the correlation between the actual system performance and
user rankings. We introduce an explainable VQA system that uses spatial and
object features and is powered by the BERT language model. Each group of users
sees only one kind of explanation to rank the competencies of the VQA model.
The proposed model is evaluated through between-subject experiments to probe
explanations' impact on the user's perception of competency. The comparison
between two VQA models shows BERT based explanations and the use of object
features improve the user's prediction of the model's competencies.Comment: Submitted to HCCAI 202
Beyond One-hot Encoding: lower dimensional target embedding
Target encoding plays a central role when learning Convolutional Neural
Networks. In this realm, One-hot encoding is the most prevalent strategy due to
its simplicity. However, this so widespread encoding schema assumes a flat
label space, thus ignoring rich relationships existing among labels that can be
exploited during training. In large-scale datasets, data does not span the full
label space, but instead lies in a low-dimensional output manifold. Following
this observation, we embed the targets into a low-dimensional space,
drastically improving convergence speed while preserving accuracy. Our
contribution is two fold: (i) We show that random projections of the label
space are a valid tool to find such lower dimensional embeddings, boosting
dramatically convergence rates at zero computational cost; and (ii) we propose
a normalized eigenrepresentation of the class manifold that encodes the targets
with minimal information loss, improving the accuracy of random projections
encoding while enjoying the same convergence rates. Experiments on CIFAR-100,
CUB200-2011, Imagenet, and MIT Places demonstrate that the proposed approach
drastically improves convergence speed while reaching very competitive accuracy
rates.Comment: Published at Image and Vision Computin
Discriminative Feature Learning with Application to Fine-grained Recognition
For various computer vision tasks, finding suitable feature representations is fundamental. Fine-grained recognition, distinguishing sub-categories under the same super-category (e.g., bird species, car makes and models, etc.), serves as a good task to study discriminative feature learning for visual recognition task. The main reason is that the inter-class variations between fine-grained categories are very subtle and even smaller than intra-class variations caused by pose or deformation.
This thesis focuses on tasks mostly related to fine-grained categories. After briefly discussing our earlier attempt to capture subtle visual differences using sparse/low-rank analysis, the main part of the thesis reflects the trends in the past a few years as deep learning prevails.
In the first part of the thesis, we address the problem of fine-grained recognition via a patch-based framework built upon Convolutional Neural Network (CNN) features. We introduce triplets of patches with two geometric constraints to improve the accuracy of patch localization, and automatically mine discriminative geometrically-constrained triplets for recognition.
In the second part we begin to learn discriminative features in an end-to-end fashion. We propose a supervised feature learning approach, Label Consistent Neural Network, which enforces direct supervision in late hidden layers. We associate each neuron in a hidden layer with a particular class and encourage it to be activated for input signals from the same class by introducing a label consistency regularization. This label consistency constraint makes the features more discriminative and tends to faster convergence.
The third part proposes a more sophisticated and effective end-to-end network specifically designed for fine-grained recognition, which learns discriminative patches within a CNN. We show that patch-level learning capability of CNN can be enhanced by learning a bank of convolutional filters that capture class-specific discriminative patches without extra part or bounding box annotations. Such a filter bank is well structured, properly initialized and discriminatively learned through a novel asymmetric multi-stream architecture with convolutional filter supervision and a non-random layer initialization.
In the last part we goes beyond obtaining category labels and study the problem of continuous 3D pose estimation for fine-grained object categories. We augment three existing popular fine-grained recognition datasets by annotating each instance in the image with corresponding fine-grained 3D shape and ground-truth 3D pose. We cast the problem into a detection framework based on Faster/Mask R-CNN. To utilize the 3D information, we also introduce a novel 3D representation, named as location field, that is effective for representing 3D shapes