11,736 research outputs found
CNN Architectures for Large-Scale Audio Classification
Convolutional Neural Networks (CNNs) have proven very effective in image
classification and show promise for audio. We use various CNN architectures to
classify the soundtracks of a dataset of 70M training videos (5.24 million
hours) with 30,871 video-level labels. We examine fully connected Deep Neural
Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We
investigate varying the size of both training set and label vocabulary, finding
that analogs of the CNNs used in image classification do well on our audio
classification task, and larger training and label sets help up to a point. A
model using embeddings from these classifiers does much better than raw
features on the Audio Set [5] Acoustic Event Detection (AED) classification
task.Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of
mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on
changes of latest Audio Set revision. Changed wording to fit 4 page limit
with new addition
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Very deep convolutional networks have been central to the largest advances in
image recognition performance in recent years. One example is the Inception
architecture that has been shown to achieve very good performance at relatively
low computational cost. Recently, the introduction of residual connections in
conjunction with a more traditional architecture has yielded state-of-the-art
performance in the 2015 ILSVRC challenge; its performance was similar to the
latest generation Inception-v3 network. This raises the question of whether
there are any benefit in combining the Inception architecture with residual
connections. Here we give clear empirical evidence that training with residual
connections accelerates the training of Inception networks significantly. There
is also some evidence of residual Inception networks outperforming similarly
expensive Inception networks without residual connections by a thin margin. We
also present several new streamlined architectures for both residual and
non-residual Inception networks. These variations improve the single-frame
recognition performance on the ILSVRC 2012 classification task significantly.
We further demonstrate how proper activation scaling stabilizes the training of
very wide residual Inception networks. With an ensemble of three residual and
one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the
ImageNet classification (CLS) challeng
Benchmark Analysis of Representative Deep Neural Network Architectures
This work presents an in-depth analysis of the majority of the deep neural
networks (DNNs) proposed in the state of the art for image recognition. For
each DNN multiple performance indices are observed, such as recognition
accuracy, model complexity, computational complexity, memory usage, and
inference time. The behavior of such performance indices and some combinations
of them are analyzed and discussed. To measure the indices we experiment the
use of DNNs on two different computer architectures, a workstation equipped
with a NVIDIA Titan X Pascal and an embedded system based on a NVIDIA Jetson
TX1 board. This experimentation allows a direct comparison between DNNs running
on machines with very different computational capacity. This study is useful
for researchers to have a complete view of what solutions have been explored so
far and in which research directions are worth exploring in the future; and for
practitioners to select the DNN architecture(s) that better fit the resource
constraints of practical deployments and applications. To complete this work,
all the DNNs, as well as the software used for the analysis, are available
online.Comment: Will appear in IEEE Acces
Is Robustness the Cost of Accuracy? -- A Comprehensive Study on the Robustness of 18 Deep Image Classification Models
The prediction accuracy has been the long-lasting and sole standard for
comparing the performance of different image classification models, including
the ImageNet competition. However, recent studies have highlighted the lack of
robustness in well-trained deep neural networks to adversarial examples.
Visually imperceptible perturbations to natural images can easily be crafted
and mislead the image classifiers towards misclassification. To demystify the
trade-offs between robustness and accuracy, in this paper we thoroughly
benchmark 18 ImageNet models using multiple robustness metrics, including the
distortion, success rate and transferability of adversarial examples between
306 pairs of models. Our extensive experimental results reveal several new
insights: (1) linear scaling law - the empirical and
distortion metrics scale linearly with the logarithm of classification error;
(2) model architecture is a more critical factor to robustness than model size,
and the disclosed accuracy-robustness Pareto frontier can be used as an
evaluation criterion for ImageNet model designers; (3) for a similar network
architecture, increasing network depth slightly improves robustness in
distortion; (4) there exist models (in VGG family) that exhibit
high adversarial transferability, while most adversarial examples crafted from
one model can only be transferred within the same family. Experiment code is
publicly available at \url{https://github.com/huanzhang12/Adversarial_Survey}.Comment: Accepted by the European Conference on Computer Vision (ECCV) 201
Speed/accuracy trade-offs for modern convolutional object detectors
The goal of this paper is to serve as a guide for selecting a detection
architecture that achieves the right speed/memory/accuracy balance for a given
application and platform. To this end, we investigate various ways to trade
accuracy for speed and memory usage in modern convolutional object detection
systems. A number of successful systems have been proposed in recent years, but
apples-to-apples comparisons are difficult due to different base feature
extractors (e.g., VGG, Residual Networks), different default image resolutions,
as well as different hardware and software platforms. We present a unified
implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016]
and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and
trace out the speed/accuracy trade-off curve created by using alternative
feature extractors and varying other critical parameters such as image size
within each of these meta-architectures. On one extreme end of this spectrum
where speed and memory are critical, we present a detector that achieves real
time speeds and can be deployed on a mobile device. On the opposite end in
which accuracy is critical, we present a detector that achieves
state-of-the-art performance measured on the COCO detection task.Comment: Accepted to CVPR 201
Deformable Convolutional Networks
Convolutional neural networks (CNNs) are inherently limited to model
geometric transformations due to the fixed geometric structures in its building
modules. In this work, we introduce two new modules to enhance the
transformation modeling capacity of CNNs, namely, deformable convolution and
deformable RoI pooling. Both are based on the idea of augmenting the spatial
sampling locations in the modules with additional offsets and learning the
offsets from target tasks, without additional supervision. The new modules can
readily replace their plain counterparts in existing CNNs and can be easily
trained end-to-end by standard back-propagation, giving rise to deformable
convolutional networks. Extensive experiments validate the effectiveness of our
approach on sophisticated vision tasks of object detection and semantic
segmentation. The code would be released
- …