381,873 research outputs found
Benchmark Analysis of Representative Deep Neural Network Architectures
This work presents an in-depth analysis of the majority of the deep neural
networks (DNNs) proposed in the state of the art for image recognition. For
each DNN multiple performance indices are observed, such as recognition
accuracy, model complexity, computational complexity, memory usage, and
inference time. The behavior of such performance indices and some combinations
of them are analyzed and discussed. To measure the indices we experiment the
use of DNNs on two different computer architectures, a workstation equipped
with a NVIDIA Titan X Pascal and an embedded system based on a NVIDIA Jetson
TX1 board. This experimentation allows a direct comparison between DNNs running
on machines with very different computational capacity. This study is useful
for researchers to have a complete view of what solutions have been explored so
far and in which research directions are worth exploring in the future; and for
practitioners to select the DNN architecture(s) that better fit the resource
constraints of practical deployments and applications. To complete this work,
all the DNNs, as well as the software used for the analysis, are available
online.Comment: Will appear in IEEE Acces
FoveaBox: Beyond Anchor-based Object Detector
We present FoveaBox, an accurate, flexible, and completely anchor-free
framework for object detection. While almost all state-of-the-art object
detectors utilize predefined anchors to enumerate possible locations, scales
and aspect ratios for the search of the objects, their performance and
generalization ability are also limited to the design of anchors. Instead,
FoveaBox directly learns the object existing possibility and the bounding box
coordinates without anchor reference. This is achieved by: (a) predicting
category-sensitive semantic maps for the object existing possibility, and (b)
producing category-agnostic bounding box for each position that potentially
contains an object. The scales of target boxes are naturally associated with
feature pyramid representations. In FoveaBox, an instance is assigned to
adjacent feature levels to make the model more accurate.We demonstrate its
effectiveness on standard benchmarks and report extensive experimental
analysis. Without bells and whistles, FoveaBox achieves state-of-the-art single
model performance on the standard COCO and Pascal VOC object detection
benchmark. More importantly, FoveaBox avoids all computation and
hyper-parameters related to anchor boxes, which are often sensitive to the
final detection performance. We believe the simple and effective approach will
serve as a solid baseline and help ease future research for object detection.
The code has been made publicly available at
https://github.com/taokong/FoveaBox .Comment: IEEE Transactions on Image Processing, code at:
https://github.com/taokong/FoveaBo
VideoCapsuleNet: A Simplified Network for Action Detection
The recent advances in Deep Convolutional Neural Networks (DCNNs) have shown
extremely good results for video human action classification, however, action
detection is still a challenging problem. The current action detection
approaches follow a complex pipeline which involves multiple tasks such as tube
proposals, optical flow, and tube classification. In this work, we present a
more elegant solution for action detection based on the recently developed
capsule network. We propose a 3D capsule network for videos, called
VideoCapsuleNet: a unified network for action detection which can jointly
perform pixel-wise action segmentation along with action classification. The
proposed network is a generalization of capsule network from 2D to 3D, which
takes a sequence of video frames as input. The 3D generalization drastically
increases the number of capsules in the network, making capsule routing
computationally expensive. We introduce capsule-pooling in the convolutional
capsule layer to address this issue which makes the voting algorithm tractable.
The routing-by-agreement in the network inherently models the action
representations and various action characteristics are captured by the
predicted capsules. This inspired us to utilize the capsules for action
localization and the class-specific capsules predicted by the network are used
to determine a pixel-wise localization of actions. The localization is further
improved by parameterized skip connections with the convolutional capsule
layers and the network is trained end-to-end with a classification as well as
localization loss. The proposed network achieves sate-of-the-art performance on
multiple action detection datasets including UCF-Sports, J-HMDB, and UCF-101
(24 classes) with an impressive ~20% improvement on UCF-101 and ~15%
improvement on J-HMDB in terms of v-mAP scores
Exploiting Image-trained CNN Architectures for Unconstrained Video Classification
We conduct an in-depth exploration of different strategies for doing event
detection in videos using convolutional neural networks (CNNs) trained for
image classification. We study different ways of performing spatial and
temporal pooling, feature normalization, choice of CNN layers as well as choice
of classifiers. Making judicious choices along these dimensions led to a very
significant increase in performance over more naive approaches that have been
used till now. We evaluate our approach on the challenging TRECVID MED'14
dataset with two popular CNN architectures pretrained on ImageNet. On this
MED'14 dataset, our methods, based entirely on image-trained CNN features, can
outperform several state-of-the-art non-CNN models. Our proposed late fusion of
CNN- and motion-based features can further increase the mean average precision
(mAP) on MED'14 from 34.95% to 38.74%. The fusion approach achieves the
state-of-the-art classification performance on the challenging UCF-101 dataset
- …