5,295 research outputs found
Cross-Modal Attentional Context Learning for RGB-D Object Detection
Recognizing objects from simultaneously sensed photometric (RGB) and depth
channels is a fundamental yet practical problem in many machine vision
applications such as robot grasping and autonomous driving. In this paper, we
address this problem by developing a Cross-Modal Attentional Context (CMAC)
learning framework, which enables the full exploitation of the context
information from both RGB and depth data. Compared to existing RGB-D object
detection frameworks, our approach has several appealing properties. First, it
consists of an attention-based global context model for exploiting adaptive
contextual information and incorporating this information into a region-based
CNN (e.g., Fast RCNN) framework to achieve improved object detection
performance. Second, our CMAC framework further contains a fine-grained object
part attention module to harness multiple discriminative object parts inside
each possible object region for superior local feature representation. While
greatly improving the accuracy of RGB-D object detection, the effective
cross-modal information fusion as well as attentional context modeling in our
proposed model provide an interpretable visualization scheme. Experimental
results demonstrate that the proposed method significantly improves upon the
state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin
An Efficient Bit Plane X-OR Algorithm for Irreversible Image Steganography
The science of hiding secret information in another message is known as
Steganography; hence the presence of secret information is concealed. It is the
method of hiding cognitive content in same or another media to avoid
recognition by the intruders. This paper introduces new method wherein
irreversible steganography is used to hide an image in the same medium so that
the secret data is masked. The secret image is known as payload and the carrier
is known as cover image. X-OR operation is used amongst mid level bit planes of
carrier image and high level bit planes of data image to generate new low level
bit planes of the stego image. Recovery process includes the X-ORing of low
level bit planes and mid level bit planes of the stego image. Based on the
result of the recovery, subsequent data image is generated. A RGB color image
is used as carrier and the data image is a grayscale image of dimensions less
than or equal to the dimensions of the carrier image. The proposed method
greatly increases the embedding capacity without significantly decreasing the
PSNR value
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Video-based Sign Language Recognition without Temporal Segmentation
Millions of hearing impaired people around the world routinely use some
variants of sign languages to communicate, thus the automatic translation of a
sign language is meaningful and important. Currently, there are two
sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that
recognizes word by word and continuous SLR that translates entire sentences.
Existing continuous SLR methods typically utilize isolated SLRs as building
blocks, with an extra layer of preprocessing (temporal segmentation) and
another layer of post-processing (sentence synthesis). Unfortunately, temporal
segmentation itself is non-trivial and inevitably propagates errors into
subsequent steps. Worse still, isolated SLR methods typically require strenuous
labeling of each word separately in a sentence, severely limiting the amount of
attainable training data. To address these challenges, we propose a novel
continuous sign recognition framework, the Hierarchical Attention Network with
Latent Space (LS-HAN), which eliminates the preprocessing of temporal
segmentation. The proposed LS-HAN consists of three components: a two-stream
Convolutional Neural Network (CNN) for video feature representation generation,
a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention
Network (HAN) for latent space based recognition. Experiments are carried out
on two large scale datasets. Experimental results demonstrate the effectiveness
of the proposed framework.Comment: 32nd AAAI Conference on Artificial Intelligence (AAAI-18), Feb. 2-7,
2018, New Orleans, Louisiana, US
Correlated and Individual Multi-Modal Deep Learning for RGB-D Object Recognition
In this paper, we propose a new correlated and individual multi-modal deep
learning (CIMDL) method for RGB-D object recognition. Unlike most conventional
RGB-D object recognition methods which extract features from the RGB and depth
channels individually, our CIMDL jointly learns feature representations from
raw RGB-D data with a pair of deep neural networks, so that the sharable and
modal-specific information can be simultaneously exploited. Specifically, we
construct a pair of deep convolutional neural networks (CNNs) for the RGB and
depth data, and concatenate them at the top layer of the network with a loss
function which learns a new feature space where both correlated part and the
individual part of the RGB-D information are well modelled. The parameters of
the whole networks are updated by using the back-propagation criterion.
Experimental results on two widely used RGB-D object image benchmark datasets
clearly show that our method outperforms state-of-the-arts.Comment: 11 pages, 7 figures, submitted to a conference in 201
Super-Resolution via Deep Learning
The recent phenomenal interest in convolutional neural networks (CNNs) must
have made it inevitable for the super-resolution (SR) community to explore its
potential. The response has been immense and in the last three years, since the
advent of the pioneering work, there appeared too many works not to warrant a
comprehensive survey. This paper surveys the SR literature in the context of
deep learning. We focus on the three important aspects of multimedia - namely
image, video and multi-dimensions, especially depth maps. In each case, first
relevant benchmarks are introduced in the form of datasets and state of the art
SR methods, excluding deep learning. Next is a detailed analysis of the
individual works, each including a short description of the method and a
critique of the results with special reference to the benchmarking done. This
is followed by minimum overall benchmarking in the form of comparison on some
common dataset, while relying on the results reported in various works
Multigrid Predictive Filter Flow for Unsupervised Learning on Videos
We introduce multigrid Predictive Filter Flow (mgPFF), a framework for
unsupervised learning on videos. The mgPFF takes as input a pair of frames and
outputs per-pixel filters to warp one frame to the other. Compared to optical
flow used for warping frames, mgPFF is more powerful in modeling sub-pixel
movement and dealing with corruption (e.g., motion blur). We develop a
multigrid coarse-to-fine modeling strategy that avoids the requirement of
learning large filters to capture large displacement. This allows us to train
an extremely compact model (4.6MB) which operates in a progressive way over
multiple resolutions with shared weights. We train mgPFF on unsupervised,
free-form videos and show that mgPFF is able to not only estimate long-range
flow for frame reconstruction and detect video shot transitions, but also
readily amendable for video object segmentation and pose tracking, where it
substantially outperforms the published state-of-the-art without bells and
whistles. Moreover, owing to mgPFF's nature of per-pixel filter prediction, we
have the unique opportunity to visualize how each pixel is evolving during
solving these tasks, thus gaining better interpretability.Comment: webpage (https://www.ics.uci.edu/~skong2/mgpff.html
Depth Adaptive Deep Neural Network for Semantic Segmentation
In this work, we present the depth-adaptive deep neural network using a depth
map for semantic segmentation. Typical deep neural networks receive inputs at
the predetermined locations regardless of the distance from the camera. This
fixed receptive field presents a challenge to generalize the features of
objects at various distances in neural networks. Specifically, the
predetermined receptive fields are too small at a short distance, and vice
versa. To overcome this challenge, we develop a neural network which is able to
adapt the receptive field not only for each layer but also for each neuron at
the spatial location. To adjust the receptive field, we propose the
depth-adaptive multiscale (DaM) convolution layer consisting of the adaptive
perception neuron and the in-layer multiscale neuron. The adaptive perception
neuron is to adjust the receptive field at each spatial location using the
corresponding depth information. The in-layer multiscale neuron is to apply the
different size of the receptive field at each feature space to learn features
at multiple scales. The proposed DaM convolution is applied to two fully
convolutional neural networks. We demonstrate the effectiveness of the proposed
neural networks on the publicly available RGB-D dataset for semantic
segmentation and the novel hand segmentation dataset for hand-object
interaction. The experimental results show that the proposed method outperforms
the state-of-the-art methods without any additional layers or
pre/post-processing.Comment: IEEE Transactions on Multimedia, 201
Choosing Smartly: Adaptive Multimodal Fusion for Object Detection in Changing Environments
Object detection is an essential task for autonomous robots operating in
dynamic and changing environments. A robot should be able to detect objects in
the presence of sensor noise that can be induced by changing lighting
conditions for cameras and false depth readings for range sensors, especially
RGB-D cameras. To tackle these challenges, we propose a novel adaptive fusion
approach for object detection that learns weighting the predictions of
different sensor modalities in an online manner. Our approach is based on a
mixture of convolutional neural network (CNN) experts and incorporates multiple
modalities including appearance, depth and motion. We test our method in
extensive robot experiments, in which we detect people in a combined indoor and
outdoor scenario from RGB-D data, and we demonstrate that our method can adapt
to harsh lighting changes and severe camera motion blur. Furthermore, we
present a new RGB-D dataset for people detection in mixed in- and outdoor
environments, recorded with a mobile robot. Code, pretrained models and dataset
are available at http://adaptivefusion.cs.uni-freiburg.deComment: Published at the 2016 IEEE/RSJ International Conference on
Intelligent Robots and Systems. Added a new baseline with respect to the IROS
version. Project page with code, pretrained models and our InOutDoorPeople
RGB-D dataset at http://adaptivefusion.cs.uni-freiburg.de
- …