63 research outputs found
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Very deep convolutional networks have been central to the largest advances in
image recognition performance in recent years. One example is the Inception
architecture that has been shown to achieve very good performance at relatively
low computational cost. Recently, the introduction of residual connections in
conjunction with a more traditional architecture has yielded state-of-the-art
performance in the 2015 ILSVRC challenge; its performance was similar to the
latest generation Inception-v3 network. This raises the question of whether
there are any benefit in combining the Inception architecture with residual
connections. Here we give clear empirical evidence that training with residual
connections accelerates the training of Inception networks significantly. There
is also some evidence of residual Inception networks outperforming similarly
expensive Inception networks without residual connections by a thin margin. We
also present several new streamlined architectures for both residual and
non-residual Inception networks. These variations improve the single-frame
recognition performance on the ILSVRC 2012 classification task significantly.
We further demonstrate how proper activation scaling stabilizes the training of
very wide residual Inception networks. With an ensemble of three residual and
one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the
ImageNet classification (CLS) challeng
Rethinking the Inception Architecture for Computer Vision
Convolutional networks are at the core of most state-of-the-art computer
vision solutions for a wide variety of tasks. Since 2014 very deep
convolutional networks started to become mainstream, yielding substantial gains
in various benchmarks. Although increased model size and computational cost
tend to translate to immediate quality gains for most tasks (as long as enough
labeled data is provided for training), computational efficiency and low
parameter count are still enabling factors for various use cases such as mobile
vision and big-data scenarios. Here we explore ways to scale up networks in
ways that aim at utilizing the added computation as efficiently as possible by
suitably factorized convolutions and aggressive regularization. We benchmark
our methods on the ILSVRC 2012 classification challenge validation set
demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6%
top-5 error for single frame evaluation using a network with a computational
cost of 5 billion multiply-adds per inference and with using less than 25
million parameters. With an ensemble of 4 models and multi-crop evaluation, we
report 3.5% top-5 error on the validation set (3.6% error on the test set) and
17.3% top-1 error on the validation set
Going Deeper with Convolutions
We propose a deep convolutional neural network architecture codenamed
"Inception", which was responsible for setting the new state of the art for
classification and detection in the ImageNet Large-Scale Visual Recognition
Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the
improved utilization of the computing resources inside the network. This was
achieved by a carefully crafted design that allows for increasing the depth and
width of the network while keeping the computational budget constant. To
optimize quality, the architectural decisions were based on the Hebbian
principle and the intuition of multi-scale processing. One particular
incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22
layers deep network, the quality of which is assessed in the context of
classification and detection
Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping
Instrumenting and collecting annotated visual grasping datasets to train
modern machine learning algorithms can be extremely time-consuming and
expensive. An appealing alternative is to use off-the-shelf simulators to
render synthetic data for which ground-truth annotations are generated
automatically. Unfortunately, models trained purely on simulated data often
fail to generalize to the real world. We study how randomized simulated
environments and domain adaptation methods can be extended to train a grasping
system to grasp novel objects from raw monocular RGB images. We extensively
evaluate our approaches with a total of more than 25,000 physical test grasps,
studying a range of simulation conditions and domain adaptation methods,
including a novel extension of pixel-level domain adaptation that we term the
GraspGAN. We show that, by using synthetic data and domain adaptation, we are
able to reduce the number of real-world samples needed to achieve a given level
of performance by up to 50 times, using only randomly generated simulated
objects. We also show that by using only unlabeled real-world data and our
GraspGAN methodology, we obtain real-world grasping performance without any
real-world labels that is similar to that achieved with 939,777 labeled
real-world samples.Comment: 9 pages, 5 figures, 3 table
- …