7 research outputs found
Video-based Sign Language Recognition without Temporal Segmentation
Millions of hearing impaired people around the world routinely use some
variants of sign languages to communicate, thus the automatic translation of a
sign language is meaningful and important. Currently, there are two
sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that
recognizes word by word and continuous SLR that translates entire sentences.
Existing continuous SLR methods typically utilize isolated SLRs as building
blocks, with an extra layer of preprocessing (temporal segmentation) and
another layer of post-processing (sentence synthesis). Unfortunately, temporal
segmentation itself is non-trivial and inevitably propagates errors into
subsequent steps. Worse still, isolated SLR methods typically require strenuous
labeling of each word separately in a sentence, severely limiting the amount of
attainable training data. To address these challenges, we propose a novel
continuous sign recognition framework, the Hierarchical Attention Network with
Latent Space (LS-HAN), which eliminates the preprocessing of temporal
segmentation. The proposed LS-HAN consists of three components: a two-stream
Convolutional Neural Network (CNN) for video feature representation generation,
a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention
Network (HAN) for latent space based recognition. Experiments are carried out
on two large scale datasets. Experimental results demonstrate the effectiveness
of the proposed framework.Comment: 32nd AAAI Conference on Artificial Intelligence (AAAI-18), Feb. 2-7,
2018, New Orleans, Louisiana, US
RGB-NIR image categorization with prior knowledge transfer
Abstract
Recent development on image categorization, especially scene categorization, shows that the combination of standard visible RGB image data and near-infrared (NIR) image data performs better than RGB-only image data. However, the size of RGB-NIR image collection is often limited due to the difficulty of acquisition. With limited data, it is difficult to extract effective features using the common deep learning networks. It is observed that humans are able to learn prior knowledge from other tasks or a good mentor, which is helpful to solve the learning problems with limited training samples. Inspired by this observation, we propose a novel training methodology for introducing the prior knowledge into a deep architecture, which allows us to bypass the burdensome labeling large quantity of image data to meet the big data requirements in deep learning. At first, transfer learning is adopted to learn single modal features from a large source database, such as ImageNet. Then, a knowledge distillation method is explored to fuse the RGB and NIR features. Finally, a global optimization method is employed to fine-tune the entire network. The experimental results on two RGB-NIR datasets demonstrate the effectiveness of our proposed approach in comparison with the state-of-the-art multi-modal image categorization methods.https://deepblue.lib.umich.edu/bitstream/2027.42/146762/1/13640_2018_Article_388.pd
Domain Adaptation and Privileged Information for Visual Recognition
The automatic identification of entities like objects, people or their actions in visual data, such as images or video, has significantly improved, and is now being deployed in access control, social media, online retail, autonomous vehicles, and several other applications. This visual recognition capability leverages supervised learning techniques, which require large amounts of labeled training data from the target distribution representative of the particular task at hand. However, collecting such training data might be expensive, require too much time, or even be impossible. In this work, we introduce several novel approaches aiming at compensating for the lack of target training data. Rather than leveraging prior knowledge for building task-specific models, typically easier to train, we focus on developing general visual recognition techniques, where the notion of prior knowledge is better identified by additional information, available during training. Depending on the nature of such information, the learning problem may turn into domain adaptation (DA), domain generalization (DG), leaning using privileged information (LUPI), or domain adaptation with privileged information (DAPI).;When some target data samples are available and additional information in the form of labeled data from a different source is also available, the learning problem becomes domain adaptation. Unlike previous DA work, we introduce two novel approaches for the few-shot learning scenario, which require only very few labeled target samples, and even one can be very effective. The first method exploits a Siamese deep neural network architecture for learning an embedding where visual categories from the source and target distributions are semantically aligned and yet maximally separated. The second approach instead, extends adversarial learning to simultaneously maximize the confusion between source and target domains while achieving semantic alignment.;In complete absence of target data, several cheaply available source datasets related to the target distribution can be leveraged as additional information for learning a task. This is the domain generalization setting. We introduce the first deep learning approach to address the DG problem, by extending a Siamese network architecture for learning a representation of visual categories that is invariant with respect to the sources, while imposing semantic alignment and class separation to maximize generalization performance on unseen target domains.;There are situations in which target data for training might come equipped with additional information that can be modeled as an auxiliary view of the data, and that unfortunately is not available during testing. This is the LUPI scenario. We introduce a novel framework based on the information bottleneck that leverages the auxiliary view to improve the performance of visual classifiers. We do so by introducing a formulation that is general, in the sense that can be used with any visual classifier.;Finally, when the available target data is unlabeled, and there is closely related labeled source data, which is also equipped with an auxiliary view as additional information, we pose the question of how to leverage the source data views to train visual classifiers for unseen target data. This is the DAPI scenario. We extend the LUPI framework based on the information bottleneck to learn visual classifiers in DAPI settings and show that privileged information can be leveraged to improve the learning on new domains. Also, the novel DAPI framework is general and can be used with any visual classifier.;Every use of auxiliary information has been validated extensively using publicly available benchmark datasets, and several new state-of-the-art accuracy performance values have been set. Examples of application domains include visual object recognition from RGB images and from depth data, handwritten digit recognition, and gesture recognition from video
Context-driven Object Detection and Segmentation with Auxiliary Information
One fundamental problem in computer vision and robotics is to
localize objects of interest in an image. The task can either be
formulated as an object detection problem if the objects are
described by a set of pose parameters, or an object segmentation
one if we recover object boundary precisely. A key issue in
object detection and segmentation concerns exploiting the spatial
context, as local evidence is often insufficient to determine
object pose in the presence of heavy occlusions or large object
appearance variations. This thesis addresses the object detection
and segmentation problem in such adverse conditions with
auxiliary depth data provided by RGBD cameras. We focus on four
main issues in context-aware object detection and segmentation:
1) what are the effective context representations? 2) how can we
work with limited and imperfect depth data? 3) how to design
depth-aware features and integrate depth cues into conventional
visual inference tasks? 4) how to make use of unlabeled data to
relax the labeling requirements for training data?
We discuss three object detection and segmentation scenarios
based on varying amounts of available auxiliary information. In
the first case, depth data are available for model training but
not available for testing. We propose a structured Hough voting
method for detecting objects with heavy occlusion in indoor
environments, in which we extend the Hough hypothesis space to
include both the object's location, and its visibility pattern.
We design a new score function that accumulates votes for object
detection and occlusion prediction. In addition, we explore the
correlation between objects and their environment, building a
depth-encoded object-context model based on RGBD data. In the
second case, we address the problem of localizing glass objects
with noisy and incomplete depth data. Our method integrates the
intensity and depth information from a single view point, and
builds a Markov Random Field that predicts glass boundary and
region jointly. In addition, we propose a nonparametric,
data-driven label transfer scheme for local glass boundary
estimation. A weighted voting scheme based on a joint feature
manifold is adopted to integrate depth and appearance cues, and
we learn a distance metric on the depth-encoded feature manifold.
In the third case, we make use of unlabeled data to relax the
annotation requirements for object detection and segmentation,
and propose a novel data-dependent margin distribution learning
criterion for boosting, which utilizes the intrinsic geometric
structure of datasets. One key aspect of this method is that it
can seamlessly incorporate unlabeled data by including a graph
Laplacian regularizer. We demonstrate the performance of our
models and compare with baseline methods on several real-world
object detection and segmentation tasks, including indoor object
detection, glass object segmentation and foreground segmentation
in video