25,154 research outputs found
Deep Surface Normal Estimation with Hierarchical RGB-D Fusion
The growing availability of commodity RGB-D cameras has boosted the
applications in the field of scene understanding. However, as a fundamental
scene understanding task, surface normal estimation from RGB-D data lacks
thorough investigation. In this paper, a hierarchical fusion network with
adaptive feature re-weighting is proposed for surface normal estimation from a
single RGB-D image. Specifically, the features from color image and depth are
successively integrated at multiple scales to ensure global surface smoothness
while preserving visually salient details. Meanwhile, the depth features are
re-weighted with a confidence map estimated from depth before merging into the
color branch to avoid artifacts caused by input depth corruption. Additionally,
a hybrid multi-scale loss function is designed to learn accurate normal
estimation given noisy ground-truth dataset. Extensive experimental results
validate the effectiveness of the fusion strategy and the loss design,
outperforming state-of-the-art normal estimation schemes
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space
Few prior works study deep learning on point sets. PointNet by Qi et al. is a
pioneer in this direction. However, by design PointNet does not capture local
structures induced by the metric space points live in, limiting its ability to
recognize fine-grained patterns and generalizability to complex scenes. In this
work, we introduce a hierarchical neural network that applies PointNet
recursively on a nested partitioning of the input point set. By exploiting
metric space distances, our network is able to learn local features with
increasing contextual scales. With further observation that point sets are
usually sampled with varying densities, which results in greatly decreased
performance for networks trained on uniform densities, we propose novel set
learning layers to adaptively combine features from multiple scales.
Experiments show that our network called PointNet++ is able to learn deep point
set features efficiently and robustly. In particular, results significantly
better than state-of-the-art have been obtained on challenging benchmarks of 3D
point clouds
Learning to Forecast Videos of Human Activity with Multi-granularity Models and Adaptive Rendering
We propose an approach for forecasting video of complex human activity
involving multiple people. Direct pixel-level prediction is too simple to
handle the appearance variability in complex activities. Hence, we develop
novel intermediate representations. An architecture combining a hierarchical
temporal model for predicting human poses and encoder-decoder convolutional
neural networks for rendering target appearances is proposed. Our hierarchical
model captures interactions among people by adopting a dynamic group-based
interaction mechanism. Next, our appearance rendering network encodes the
targets' appearances by learning adaptive appearance filters using a fully
convolutional network. Finally, these filters are placed in encoder-decoder
neural networks to complete the rendering. We demonstrate that our model can
generate videos that are superior to state-of-the-art methods, and can handle
complex human activity scenarios in video forecasting
Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features
Motivated by the fact that characteristics of different sound classes are
highly diverse in different temporal scales and hierarchical levels, a novel
deep convolutional neural network (CNN) architecture is proposed for the
environmental sound classification task. This network architecture takes raw
waveforms as input, and a set of separated parallel CNNs are utilized with
different convolutional filter sizes and strides, in order to learn feature
representations with multi-temporal resolutions. On the other hand, the
proposed architecture also aggregates hierarchical features from multi-level
CNN layers for classification using direct connections between convolutional
layers, which is beyond the typical single-level CNN features employed by the
majority of previous studies. This network architecture also improves the flow
of information and avoids vanishing gradient problem. The combination of
multi-level features boosts the classification performance significantly.
Comparative experiments are conducted on two datasets: the environmental sound
classification dataset (ESC-50), and DCASE 2017 audio scene classification
dataset. Results demonstrate that the proposed method is highly effective in
the classification tasks by employing multi-temporal resolution and multi-level
features, and it outperforms the previous methods which only account for
single-level features.Comment: Submit to PCM 201
CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution
We introduce the Coarse-Grain Out-of-Order (CG- OoO) general purpose
processor designed to achieve close to In-Order processor energy while
maintaining Out-of-Order (OoO) performance. CG-OoO is an energy-performance
proportional general purpose architecture that scales according to the program
load. Block-level code processing is at the heart of the this architecture;
CG-OoO speculates, fetches, schedules, and commits code at block-level
granularity. It eliminates unnecessary accesses to energy consuming tables, and
turns large tables into smaller and distributed tables that are cheaper to
access. CG-OoO leverages compiler-level code optimizations to deliver efficient
static code, and exploits dynamic instruction-level parallelism and block-level
parallelism. CG-OoO introduces Skipahead issue, a complexity effective, limited
out-of-order instruction scheduling model. Through the energy efficiency
techniques applied to the compiler and processor pipeline stages, CG-OoO closes
64% of the average energy gap between the In-Order and Out-of-Order baseline
processors at the performance of the OoO baseline. This makes CG-OoO 1.9x more
efficient than the OoO on the energy-delay product inverse metric.Comment: 11 page
Codestitcher: Inter-Procedural Basic Block Layout Optimization
Modern software executes a large amount of code. Previous techniques of code
layout optimization were developed one or two decades ago and have become
inadequate to cope with the scale and complexity of new types of applications
such as compilers, browsers, interpreters, language VMs and shared libraries.
This paper presents Codestitcher, an inter-procedural basic block code layout
optimizer which reorders basic blocks in an executable to benefit from better
cache and TLB performance. Codestitcher provides a hierarchical framework which
can be used to improve locality in various layers of the memory hierarchy. Our
evaluation shows that Codestitcher improves the performance of the original
program by 3\% to 25\% (on average, by 10\%) on 5 widely used applications with
large code sizes: MySQL, Clang, Firefox, Apache, and Python. It gives an
additional improvement of 4\% over LLVM's PGO and 3\% over PGO combined with
the best function reordering technique.Comment: 24 pages, 6 figures, preprin
End-to-End Video Captioning with Multitask Reinforcement Learning
Although end-to-end (E2E) learning has led to impressive progress on a
variety of visual understanding tasks, it is often impeded by hardware
constraints (e.g., GPU memory) and is prone to overfitting. When it comes to
video captioning, one of the most challenging benchmark tasks in computer
vision, those limitations of E2E learning are especially amplified by the fact
that both the input videos and output captions are lengthy sequences. Indeed,
state-of-the-art methods for video captioning process video frames by
convolutional neural networks and generate captions by unrolling recurrent
neural networks. If we connect them in an E2E manner, the resulting model is
both memory-consuming and data-hungry, making it extremely hard to train. In
this paper, we propose a multitask reinforcement learning approach to training
an E2E video captioning model. The main idea is to mine and construct as many
effective tasks (e.g., attributes, rewards, and the captions) as possible from
the human captioned videos such that they can jointly regulate the search space
of the E2E neural network, from which an E2E video captioning model can be
found and generalized to the testing phase. To the best of our knowledge, this
is the first video captioning model that is trained end-to-end from the raw
video input to the caption output. Experimental results show that such a model
outperforms existing ones to a large margin on two benchmark video captioning
datasets
Prediction and Description of Near-Future Activities in Video
Most of the existing works on human activity analysis focus on recognition or
early recognition of the activity labels from complete or partial observations.
Similarly, existing video captioning approaches focus on the observed events in
videos. Predicting the labels and the captions of future activities where no
frames of the predicted activities have been observed is a challenging problem,
with important applications that require anticipatory response. In this work,
we propose a system that can infer the labels and the captions of a sequence of
future activities. Our proposed network for label prediction of a future
activity sequence is similar to a hybrid Siamese network with three branches
where the first branch takes visual features from the objects present in the
scene, the second branch takes observed activity features and the third branch
captures the last observed activity features. The predicted labels and the
observed scene context are then mapped to meaningful captions using a
sequence-to-sequence learning-based method. Experiments on three challenging
activity analysis datasets and a video description dataset demonstrate that
both our label prediction framework and captioning framework outperform the
state-of-the-arts.Comment: 14 pages, 4 figures, 14 table
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
Recently, Neural Architecture Search (NAS) has successfully identified neural
network architectures that exceed human designed ones on large-scale image
classification. In this paper, we study NAS for semantic image segmentation.
Existing works often focus on searching the repeatable cell structure, while
hand-designing the outer network structure that controls the spatial resolution
changes. This choice simplifies the search space, but becomes increasingly
problematic for dense image prediction which exhibits a lot more network level
architectural variations. Therefore, we propose to search the network level
structure in addition to the cell level structure, which forms a hierarchical
architecture search space. We present a network level search space that
includes many popular designs, and develop a formulation that allows efficient
gradient-based architecture search (3 P100 GPU days on Cityscapes images). We
demonstrate the effectiveness of the proposed method on the challenging
Cityscapes, PASCAL VOC 2012, and ADE20K datasets. Auto-DeepLab, our
architecture searched specifically for semantic image segmentation, attains
state-of-the-art performance without any ImageNet pretraining.Comment: To appear in CVPR 2019 as oral. Code for Auto-DeepLab released at
https://github.com/tensorflow/models/tree/master/research/deepla
BranchConnect: Large-Scale Visual Recognition with Learned Branch Connections
We introduce an architecture for large-scale image categorization that
enables the end-to-end learning of separate visual features for the different
classes to distinguish. The proposed model consists of a deep CNN shaped like a
tree. The stem of the tree includes a sequence of convolutional layers common
to all classes. The stem then splits into multiple branches implementing
parallel feature extractors, which are ultimately connected to the final
classification layer via learned gated connections. These learned gates
determine for each individual class the subset of features to use. Such a
scheme naturally encourages the learning of a heterogeneous set of specialized
features through the separate branches and it allows each class to use the
subset of features that are optimal for its recognition. We show the generality
of our proposed method by reshaping several popular CNNs from the literature
into our proposed architecture. Our experiments on the CIFAR100, CIFAR10, and
Synth datasets show that in each case our resulting model yields a substantial
improvement in accuracy over the original CNN. Our empirical analysis also
suggests that our scheme acts as a form of beneficial regularization improving
generalization performance.Comment: WACV 201
- …