435 research outputs found
Image classification and retrieval with random depthwise signed convolutional neural networks
We propose a random convolutional neural network to generate a feature space
in which we study image classification and retrieval performance. Put briefly
we apply random convolutional blocks followed by global average pooling to
generate a new feature, and we repeat this k times to produce a k-dimensional
feature space. This can be interpreted as partitioning the space of image
patches with random hyperplanes which we formalize as a random depthwise
convolutional neural network. In the network's final layer we perform image
classification and retrieval with the linear support vector machine and
k-nearest neighbor classifiers and study other empirical properties. We show
that the ratio of image pixel distribution similarity across classes to within
classes is higher in our network's final layer compared to the input space.
When we apply the linear support vector machine for image classification we see
that the accuracy is higher than if we were to train just the final layer of
VGG16, ResNet18, and DenseNet40 with random weights. In the same setting we
compare it to an unsupervised feature learning method and find our accuracy to
be comparable on CIFAR10 but higher on CIFAR100 and STL10. We see that the
accuracy is not far behind that of trained networks, particularly in the top-k
setting. For example the top-2 accuracy of our network is near 90% on both
CIFAR10 and a 10-class mini ImageNet, and 85% on STL10. We find that k-nearest
neighbor gives a comparable precision on the Corel Princeton Image Similarity
Benchmark than if we were to use the final layer of trained networks. As with
other networks we find that our network fails to a black box attack even though
we lack a gradient and use the sign activation. We highlight sensitivity of our
network to background as a potential pitfall and an advantage. Overall our work
pushes the boundary of what can be achieved with random weights
BlockQNN: Efficient Block-wise Neural Network Architecture Generation
Convolutional neural networks have gained a remarkable success in computer
vision. However, most usable network architectures are hand-crafted and usually
require expertise and elaborate design. In this paper, we provide a block-wise
network generation pipeline called BlockQNN which automatically builds
high-performance networks using the Q-Learning paradigm with epsilon-greedy
exploration strategy. The optimal network block is constructed by the learning
agent which is trained to choose component layers sequentially. We stack the
block to construct the whole auto-generated network. To accelerate the
generation process, we also propose a distributed asynchronous framework and an
early stop strategy. The block-wise generation brings unique advantages: (1) it
yields state-of-the-art results in comparison to the hand-crafted networks on
image classification, particularly, the best network generated by BlockQNN
achieves 2.35% top-1 error rate on CIFAR-10. (2) it offers tremendous reduction
of the search space in designing networks, spending only 3 days with 32 GPUs. A
faster version can yield a comparable result with only 1 GPU in 20 hours. (3)
it has strong generalizability in that the network built on CIFAR also performs
well on the larger-scale dataset. The best network achieves very competitive
accuracy of 82.0% top-1 and 96.0% top-5 on ImageNet.Comment: 14 pages, 18 figure
Selective Kernel Networks
In standard Convolutional Neural Networks (CNNs), the receptive fields of
artificial neurons in each layer are designed to share the same size. It is
well-known in the neuroscience community that the receptive field size of
visual cortical neurons are modulated by the stimulus, which has been rarely
considered in constructing CNNs. We propose a dynamic selection mechanism in
CNNs that allows each neuron to adaptively adjust its receptive field size
based on multiple scales of input information. A building block called
Selective Kernel (SK) unit is designed, in which multiple branches with
different kernel sizes are fused using softmax attention that is guided by the
information in these branches. Different attentions on these branches yield
different sizes of the effective receptive fields of neurons in the fusion
layer. Multiple SK units are stacked to a deep network termed Selective Kernel
Networks (SKNets). On the ImageNet and CIFAR benchmarks, we empirically show
that SKNet outperforms the existing state-of-the-art architectures with lower
model complexity. Detailed analyses show that the neurons in SKNet can capture
target objects with different scales, which verifies the capability of neurons
for adaptively adjusting their receptive field sizes according to the input.
The code and models are available at https://github.com/implus/SKNet.Comment: CVPR 201
Convolutional Neural Networks with Layer Reuse
A convolutional layer in a Convolutional Neural Network (CNN) consists of
many filters which apply convolution operation to the input, capture some
special patterns and pass the result to the next layer. If the same patterns
also occur at the deeper layers of the network, why wouldn't the same
convolutional filters be used also in those layers? In this paper, we propose a
CNN architecture, Layer Reuse Network (LruNet), where the convolutional layers
are used repeatedly without the need of introducing new layers to get a better
performance. This approach introduces several advantages: (i) Considerable
amount of parameters are saved since we are reusing the layers instead of
introducing new layers, (ii) the Memory Access Cost (MAC) can be reduced since
reused layer parameters can be fetched only once, (iii) the number of
nonlinearities increases with layer reuse, and (iv) reused layers get gradient
updates from multiple parts of the network. The proposed approach is evaluated
on CIFAR-10, CIFAR-100 and Fashion-MNIST datasets for image classification
task, and layer reuse improves the performance by 5.14%, 5.85% and 2.29%,
respectively. The source code and pretrained models are publicly available.Comment: Computer Vision and Pattern Recognitio
Hello Edge: Keyword Spotting on Microcontrollers
Keyword spotting (KWS) is a critical component for enabling speech based user
interactions on smart devices. It requires real-time response and high accuracy
for good user experience. Recently, neural networks have become an attractive
choice for KWS architecture because of their superior accuracy compared to
traditional speech processing algorithms. Due to its always-on nature, KWS
application has highly constrained power budget and typically runs on tiny
microcontrollers with limited memory and compute capability. The design of
neural network architecture for KWS must consider these constraints. In this
work, we perform neural network architecture evaluation and exploration for
running KWS on resource-constrained microcontrollers. We train various neural
network architectures for keyword spotting published in literature to compare
their accuracy and memory/compute requirements. We show that it is possible to
optimize these neural network architectures to fit within the memory and
compute constraints of microcontrollers without sacrificing accuracy. We
further explore the depthwise separable convolutional neural network (DS-CNN)
and compare it against other neural network architectures. DS-CNN achieves an
accuracy of 95.4%, which is ~10% higher than the DNN model with similar number
of parameters.Comment: Code available in github at
https://github.com/ARM-software/ML-KWS-for-MC
Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation
We address talker-independent monaural speaker separation from the
perspectives of deep learning and computational auditory scene analysis (CASA).
Specifically, we decompose the multi-speaker separation task into the stages of
simultaneous grouping and sequential grouping. Simultaneous grouping is first
performed in each time frame by separating the spectra of different speakers
with a permutation-invariantly trained neural network. In the second stage, the
frame-level separated spectra are sequentially grouped to different speakers by
a clustering network. The proposed deep CASA approach optimizes frame-level
separation and speaker tracking in turn, and produces excellent results for
both objectives. Experimental results on the benchmark WSJ0-2mix database show
that the new approach achieves the state-of-the-art results with a modest model
size.Comment: 10 pages, 5 figure
Fully Learnable Group Convolution for Acceleration of Deep Neural Networks
Benefitted from its great success on many tasks, deep learning is
increasingly used on low-computational-cost devices, e.g. smartphone, embedded
devices, etc. To reduce the high computational and memory cost, in this work,
we propose a fully learnable group convolution module (FLGC for short) which is
quite efficient and can be embedded into any deep neural networks for
acceleration. Specifically, our proposed method automatically learns the group
structure in the training stage in a fully end-to-end manner, leading to a
better structure than the existing pre-defined, two-steps, or iterative
strategies. Moreover, our method can be further combined with depthwise
separable convolution, resulting in 5 times acceleration than the vanilla
Resnet50 on single CPU. An additional advantage is that in our FLGC the number
of groups can be set as any value, but not necessarily 2^k as in most existing
methods, meaning better tradeoff between accuracy and speed. As evaluated in
our experiments, our method achieves better performance than existing learnable
group convolution and standard group convolution when using the same number of
groups.Comment: Accepted by CVPR 201
FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search
Designing accurate and efficient ConvNets for mobile devices is challenging
because the design space is combinatorially large. Due to this, previous neural
architecture search (NAS) methods are computationally expensive. ConvNet
architecture optimality depends on factors such as input resolution and target
devices. However, existing approaches are too expensive for case-by-case
redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP
count does not always reflect actual latency. To address these, we propose a
differentiable neural architecture search (DNAS) framework that uses
gradient-based methods to optimize ConvNet architectures, avoiding enumerating
and training individual architectures separately as in previous methods.
FBNets, a family of models discovered by DNAS surpass state-of-the-art models
both designed manually and generated automatically. FBNet-B achieves 74.1%
top-1 accuracy on ImageNet with 295M FLOPs and 23.1 ms latency on a Samsung S8
phone, 2.4x smaller and 1.5x faster than MobileNetV2-1.3 with similar accuracy.
Despite higher accuracy and lower latency than MnasNet, we estimate FBNet-B's
search cost is 420x smaller than MnasNet's, at only 216 GPU-hours. Searched for
different resolutions and channel sizes, FBNets achieve 1.5% to 6.4% higher
accuracy than MobileNetV2. The smallest FBNet achieves 50.2% accuracy and 2.9
ms latency (345 frames per second) on a Samsung S8. Over a Samsung-optimized
FBNet, the iPhone-X-optimized model achieves a 1.4x speedup on an iPhone X
Depthwise Separable Convolutions Allow for Fast and Memory-Efficient Spectral Normalization
An increasing number of models require the control of the spectral norm of
convolutional layers of a neural network. While there is an abundance of
methods for estimating and enforcing upper bounds on those during training,
they are typically costly in either memory or time. In this work, we introduce
a very simple method for spectral normalization of depthwise separable
convolutions, which introduces negligible computational and memory overhead. We
demonstrate the effectiveness of our method on image classification tasks using
standard architectures like MobileNetV2
Non-Volume Preserving-based Feature Fusion Approach to Group-Level Expression Recognition on Crowd Videos
Group-level emotion recognition (ER) is a growing research area as the
demands for assessing crowds of all sizes is becoming an interest in both the
security arena as well as social media. This work extends the earlier ER
investigations, which focused on either group-level ER on single images or
within a video, by fully investigating group-level expression recognition on
crowd videos. In this paper, we propose an effective deep feature level fusion
mechanism to model the spatial-temporal information in the crowd videos. In our
approach, the fusing process is performed on deep feature domain by a
generative probabilistic model, Non-Volume Preserving Fusion (NVPF), that
models spatial information relationship. Furthermore, we extend our proposed
spatial NVPF approach to spatial-temporal NVPF approach to learn the temporal
information between frames. In order to demonstrate the robustness and
effectiveness of each component in the proposed approach, three experiments
were conducted: (i) evaluation on AffectNet database to benchmark the proposed
EmoNet for recognizing facial expression; (ii) evaluation on EmotiW2018 to
benchmark the proposed deep feature level fusion mechanism NVPF; and, (iii)
examine the proposed TNVPF on an innovative Group-level Emotion on Crowd Videos
(GECV) dataset composed of 627 videos collected from publicly available
sources. GECV dataset is a collection of videos containing crowds of people.
Each video is labeled with emotion categories at three levels: individual
faces, group of people and the entire video frame.Comment: Under review at Patter Recognitio
- …