28 research outputs found
Beyond Semantic Image Segmentation : Exploring Efficient Inference in Video
We explore the efficiency of the CRF inference module beyond image level
semantic segmentation. The key idea is to combine the best of two worlds of
semantic co-labeling and exploiting more expressive models. Similar to
[Alvarez14] our formulation enables us perform inference over ten thousand
images within seconds. On the other hand, it can handle higher-order clique
potentials similar to [vineet2014] in terms of region-level label consistency
and context in terms of co-occurrences. We follow the mean-field updates for
higher order potentials similar to [vineet2014] and extend the spatial
smoothness and appearance kernels [DenseCRF13] to address video data inspired
by [Alvarez14]; thus making the system amenable to perform video semantic
segmentation most effectively.Comment: CVPR 2015 workshop WiCV : Extended Abstrac
Real-time Sign Language Fingerspelling Recognition using Convolutional Neural Networks from Depth map
Sign language recognition is important for natural and convenient
communication between deaf community and hearing majority. We take the highly
efficient initial step of automatic fingerspelling recognition system using
convolutional neural networks (CNNs) from depth maps. In this work, we consider
relatively larger number of classes compared with the previous literature. We
train CNNs for the classification of 31 alphabets and numbers using a subset of
collected depth data from multiple subjects. While using different learning
configurations, such as hyper-parameter selection with and without validation,
we achieve 99.99% accuracy for observed signers and 83.58% to 85.49% accuracy
for new signers. The result shows that accuracy improves as we include more
data from different subjects during training. The processing time is 3 ms for
the prediction of a single image. To the best of our knowledge, the system
achieves the highest accuracy and speed. The trained model and dataset is
available on our repository.Comment: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR
Detecting Temporally Consistent Objects in Videos through Object Class Label Propagation
Object proposals for detecting moving or static video objects need to address
issues such as speed, memory complexity and temporal consistency. We propose an
efficient Video Object Proposal (VOP) generation method and show its efficacy
in learning a better video object detector. A deep-learning based video object
detector learned using the proposed VOP achieves state-of-the-art detection
performance on the Youtube-Objects dataset. We further propose a clustering of
VOPs which can efficiently be used for detecting objects in video in a
streaming fashion. As opposed to applying per-frame convolutional neural
network (CNN) based object detection, our proposed method called Objects in
Video Enabler thRough LAbel Propagation (OVERLAP) needs to classify only a
small fraction of all candidate proposals in every video frame through
streaming clustering of object proposals and class-label propagation. Source
code will be made available soon.Comment: Accepted for publication in WACV 201
Correction by Projection: Denoising Images with Generative Adversarial Networks
Generative adversarial networks (GANs) transform low-dimensional latent
vectors into visually plausible images. If the real dataset contains only clean
images, then ostensibly, the manifold learned by the GAN should contain only
clean images. In this paper, we propose to denoise corrupted images by finding
the nearest point on the GAN manifold, recovering latent vectors by minimizing
distances in image space. We first demonstrate that given a corrupted version
of an image that truly lies on the GAN manifold, we can approximately recover
the latent vector and denoise the image, obtaining significantly higher
quality, comparing with BM3D. Next, we demonstrate that latent vectors
recovered from noisy images exhibit a consistent bias. By subtracting this bias
before projecting back to image space, we improve denoising results even
further. Finally, even for unseen images, our method performs better at
denoising better than BM3D. Notably, the basic version of our method (without
bias correction) requires no prior knowledge on the noise variance. To achieve
the highest possible denoising quality, the best performing signal processing
based methods, such as BM3D, require an estimate of the blur kernel
Pose2Instance: Harnessing Keypoints for Person Instance Segmentation
Human keypoints are a well-studied representation of people.We explore how to
use keypoint models to improve instance-level person segmentation. The main
idea is to harness the notion of a distance transform of oracle provided
keypoints or estimated keypoint heatmaps as a prior for person instance
segmentation task within a deep neural network. For training and evaluation, we
consider all those images from COCO where both instance segmentation and human
keypoints annotations are available. We first show how oracle keypoints can
boost the performance of existing human segmentation model during inference
without any training. Next, we propose a framework to directly learn a deep
instance segmentation model conditioned on human pose. Experimental results
show that at various Intersection Over Union (IOU) thresholds, in a constrained
environment with oracle keypoints, the instance segmentation accuracy achieves
10% to 12% relative improvements over a strong baseline of oracle bounding
boxes. In a more realistic environment, without the oracle keypoints, the
proposed deep person instance segmentation model conditioned on human pose
achieves 3.8% to 10.5% relative improvements comparing with its strongest
baseline of a deep network trained only for segmentation
Structured Query-Based Image Retrieval Using Scene Graphs
A structured query can capture the complexity of object interactions (e.g.
'woman rides motorcycle') unlike single objects (e.g. 'woman' or 'motorcycle').
Retrieval using structured queries therefore is much more useful than single
object retrieval, but a much more challenging problem. In this paper we present
a method which uses scene graph embeddings as the basis for an approach to
image retrieval. We examine how visual relationships, derived from scene
graphs, can be used as structured queries. The visual relationships are
directed subgraphs of the scene graph with a subject and object as nodes
connected by a predicate relationship. Notably, we are able to achieve high
recall even on low to medium frequency objects found in the long-tailed
COCO-Stuff dataset, and find that adding a visual relationship-inspired loss
boosts our recall by 10% in the best case.Comment: Accepted to Diagram Image Retrieval and Analysis (DIRA) Workshop at
CVPR 202
Context Matters: Refining Object Detection in Video with Recurrent Neural Networks
Given the vast amounts of video available online, and recent breakthroughs in
object detection with static images, object detection in video offers a
promising new frontier. However, motion blur and compression artifacts cause
substantial frame-level variability, even in videos that appear smooth to the
eye. Additionally, video datasets tend to have sparsely annotated frames. We
present a new framework for improving object detection in videos that captures
temporal context and encourages consistency of predictions. First, we train a
pseudo-labeler, that is, a domain-adapted convolutional neural network for
object detection. The pseudo-labeler is first trained individually on the
subset of labeled frames, and then subsequently applied to all frames. Then we
train a recurrent neural network that takes as input sequences of
pseudo-labeled frames and optimizes an objective that encourages both accuracy
on the target frame and consistency across consecutive frames. The approach
incorporates strong supervision of target frames, weak-supervision on context
frames, and regularization via a smoothness penalty. Our approach achieves mean
Average Precision (mAP) of 68.73, an improvement of 7.1 over the strongest
image-based baselines for the Youtube-Video Objects dataset. Our experiments
demonstrate that neighboring frames can provide valuable information, even
absent labels.Comment: To appear in BMVC 201
Triplet-Aware Scene Graph Embeddings
Scene graphs have become an important form of structured knowledge for tasks
such as for image generation, visual relation detection, visual question
answering, and image retrieval. While visualizing and interpreting word
embeddings is well understood, scene graph embeddings have not been fully
explored. In this work, we train scene graph embeddings in a layout generation
task with different forms of supervision, specifically introducing triplet
super-vision and data augmentation. We see a significant performance increase
in both metrics that measure the goodness of layout prediction, mean
intersection-over-union (mIoU)(52.3% vs. 49.2%) and relation score (61.7% vs.
54.1%),after the addition of triplet supervision and data augmentation. To
understand how these different methods affect the scene graph representation,
we apply several new visualization and evaluation methods to explore the
evolution of the scene graph embedding. We find that triplet supervision
significantly improves the embedding separability, which is highly correlated
with the performance of the layout prediction model.Comment: Accepted to Scene Graph Representation Learning workshop at ICCV 201
Toward Joint Image Generation and Compression using Generative Adversarial Networks
In this paper, we present a generative adversarial network framework that
generates compressed images instead of synthesizing raw RGB images and
compressing them separately. In the real world, most images and videos are
stored and transferred in a compressed format to save storage capacity and data
transfer bandwidth. However, since typical generative adversarial networks
generate raw RGB images, those generated images need to be compressed by a
post-processing stage to reduce the data size. Among image compression methods,
JPEG has been one of the most commonly used lossy compression methods for still
images. Hence, we propose a novel framework that generates JPEG compressed
images using generative adversarial networks. The novel generator consists of
the proposed locally connected layers, chroma subsampling layers, quantization
layers, residual blocks, and convolution layers. The locally connected layer is
proposed to enable block-based operations. We also discuss training strategies
for the proposed architecture including the loss function and the
transformation between its generator and its discriminator. The proposed method
is evaluated using the publicly available CIFAR-10 dataset and LSUN bedroom
dataset. The results demonstrate that the proposed method is able to generate
compressed data with competitive qualities. The proposed method is a promising
baseline method for joint image generation and compression using generative
adversarial networks
Semantic Video Segmentation : Exploring Inference Efficiency
We explore the efficiency of the CRF inference beyond image level semantic
segmentation and perform joint inference in video frames. The key idea is to
combine best of two worlds: semantic co-labeling and more expressive models.
Our formulation enables us to perform inference over ten thousand images within
seconds and makes the system amenable to perform video semantic segmentation
most effectively. On CamVid dataset, with TextonBoost unaries, our proposed
method achieves up to 8% improvement in accuracy over individual semantic image
segmentation without additional time overhead. The source code is available at
https://github.com/subtri/video_inferenceComment: To appear in proc of ISOCC 201