169 research outputs found
Fully Convolutional Networks for Semantic Segmentation
Convolutional networks are powerful visual models that yield hierarchies of
features. We show that convolutional networks by themselves, trained
end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic
segmentation. Our key insight is to build "fully convolutional" networks that
take input of arbitrary size and produce correspondingly-sized output with
efficient inference and learning. We define and detail the space of fully
convolutional networks, explain their application to spatially dense prediction
tasks, and draw connections to prior models. We adapt contemporary
classification networks (AlexNet, the VGG net, and GoogLeNet) into fully
convolutional networks and transfer their learned representations by
fine-tuning to the segmentation task. We then define a novel architecture that
combines semantic information from a deep, coarse layer with appearance
information from a shallow, fine layer to produce accurate and detailed
segmentations. Our fully convolutional network achieves state-of-the-art
segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012),
NYUDv2, and SIFT Flow, while inference takes one third of a second for a
typical image.Comment: to appear in CVPR (2015
Performance Primitives for Artificial Neural Networks
Optimized software implementations of artificial neural networks leverage primitives from performance libraries, such as the BLAS. However, these primitives were prototyped decades ago, and do not necessarily reflect the patterns of computations in neural networks. I propose modifications to common primitives provided by performance libraries to make them better building blocks for artificial neural networks, with a focus on inference, i.e. evaluation of a pre-trained artificial neural network. I suggest three classes of performance primitives for the convolutional operators and two optimized building blocks for softmax operators.
High-intensity convolutional operators with large kernel sizes and unit stride benefit from asymptotically fast convolution algorithms based on Winograd transform and Fast Fourier transform. I jointly consider Fourier or Winograd transform and the matrix-matrix multiplication of blocks of transformed coefficients and suggest tuple-GEMM primitive which balance the number of irregular memory writes in the transformation with sufficient register blocking and instruction-level parallelism in the matrix-matrix multiplication part. Tuple-GEMM primitive can be thought of as a batched GEMM with a fixed architecture-dependent batch size and can be efficiently implemented as a modification of the Goto matrix-matrix multiplication algorithm. I additionally analyze small 2D Fast Fourier transforms, and suggest options that work best for modern wide-SIMD processors.
Lower-intensity convolutional operators with small kernel sizes, non-unit strides, or dilation do not benefit from the fast convolution algorithms and require a different set of optimizations. To accelerate these cases I suggest replacing the traditional GEMM primitive with a novel Indirect GEMM primitive. Indirect GEMM primitive is a slight modification of GEMM and can leverage the extensive research on efficient GEMM implementations. I further introduce the Indirect Convolution algorithm which builds on top of the Indirect GEMM primitive, eliminates the runtime overhead of patch-building memory transformations and substantially reduce the memory complexity in convolutional operators compared to the traditional GEMM-based algorithms.
Pointwise, or 1x1, convolutional operators directly map to matrix-matrix multiplication, and prompt yet another approach to optimization. I demonstrate that neural networks heavy on pointwise convolutions can greatly benefit from sparsification of the weights tensor and representing the operation as a sparse-matrix-dense-matrix multiplication (SpMM) and introduce neural network-optimized SpMM primitives. While SpMM primitives in Sparse BLAS libraries target problems with extremely high sparsity (commonly 99+% sparsity) and non-random sparsity patterns, the proposed SpMM primitive is demonstrated to work well with moderate sparsity in the 70-95% range and unpredictable sparsity patterns.
Softmax operator is light on elementary floating-point operations, but involves evaluation of the exponential function, which in many implementations becomes the bottleneck. I demonstrate that with the high-throughput vector exponential function the softmax computation saturates the memory bandwidth and can be further improved only by reducing the number of memory access operations. I then constructively prove that it is possible to replace the traditional three-pass softmax algorithms with a novel two-pass algorithm for up to 28% runtime reduction.
I implemented the proposed ideas in the open source NNPACK, QNNPACK, and XNNPACK libraries for acceleration of neural networks on CPUs, which at the time of release delivered state-of-the-art performance on mobile, server, and Web platforms.Ph.D
Deep Contrast Learning for Salient Object Detection
Salient object detection has recently witnessed substantial progress due to
powerful features extracted using deep convolutional neural networks (CNNs).
However, existing CNN-based methods operate at the patch level instead of the
pixel level. Resulting saliency maps are typically blurry, especially near the
boundary of salient objects. Furthermore, image patches are treated as
independent samples even when they are overlapping, giving rise to significant
redundancy in computation and storage. In this CVPR 2016 paper, we propose an
end-to-end deep contrast network to overcome the aforementioned limitations.
Our deep network consists of two complementary components, a pixel-level fully
convolutional stream and a segment-wise spatial pooling stream. The first
stream directly produces a saliency map with pixel-level accuracy from an input
image. The second stream extracts segment-wise features very efficiently, and
better models saliency discontinuities along object boundaries. Finally, a
fully connected CRF model can be optionally incorporated to improve spatial
coherence and contour localization in the fused result from these two streams.
Experimental results demonstrate that our deep model significantly improves the
state of the art.Comment: To appear in CVPR 201
Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer
Semantic annotations are vital for training models for object recognition,
semantic segmentation or scene understanding. Unfortunately, pixelwise
annotation of images at very large scale is labor-intensive and only little
labeled data is available, particularly at instance level and for street
scenes. In this paper, we propose to tackle this problem by lifting the
semantic instance labeling task from 2D into 3D. Given reconstructions from
stereo or laser data, we annotate static 3D scene elements with rough bounding
primitives and develop a model which transfers this information into the image
domain. We leverage our method to obtain 2D labels for a novel suburban video
dataset which we have collected, resulting in 400k semantic and instance image
annotations. A comparison of our method to state-of-the-art label transfer
baselines reveals that 3D information enables more efficient annotation while
at the same time resulting in improved accuracy and time-coherent labels.Comment: 10 pages in Conference on Computer Vision and Pattern Recognition
(CVPR), 201
- …