6,968 research outputs found
Refining Image Categorization by Exploiting Web Images and General Corpus
Studies show that refining real-world categories into semantic subcategories
contributes to better image modeling and classification. Previous image
sub-categorization work relying on labeled images and WordNet's hierarchy is
not only labor-intensive, but also restricted to classify images into NOUN
subcategories. To tackle these problems, in this work, we exploit general
corpus information to automatically select and subsequently classify web images
into semantic rich (sub-)categories. The following two major challenges are
well studied: 1) noise in the labels of subcategories derived from the general
corpus; 2) noise in the labels of images retrieved from the web. Specifically,
we first obtain the semantic refinement subcategories from the text perspective
and remove the noise by the relevance-based approach. To suppress the search
error induced noisy images, we then formulate image selection and classifier
learning as a multi-class multi-instance learning problem and propose to solve
the employed problem by the cutting-plane algorithm. The experiments show
significant performance gains by using the generated data of our way on both
image categorization and sub-categorization tasks. The proposed approach also
consistently outperforms existing weakly supervised and web-supervised
approaches
Adaptive SVM+: Learning with Privileged Information for Domain Adaptation
Incorporating additional knowledge in the learning process can be beneficial
for several computer vision and machine learning tasks. Whether privileged
information originates from a source domain that is adapted to a target domain,
or as additional features available at training time only, using such
privileged (i.e., auxiliary) information is of high importance as it improves
the recognition performance and generalization. However, both primary and
privileged information are rarely derived from the same distribution, which
poses an additional challenge to the recognition task. To address these
challenges, we present a novel learning paradigm that leverages privileged
information in a domain adaptation setup to perform visual recognition tasks.
The proposed framework, named Adaptive SVM+, combines the advantages of both
the learning using privileged information (LUPI) paradigm and the domain
adaptation framework, which are naturally embedded in the objective function of
a regular SVM. We demonstrate the effectiveness of our approach on the publicly
available Animals with Attributes and INTERACT datasets and report
state-of-the-art results in both of them.Comment: To appear in ICCV Workshops 2017 (TASK-CV
Fine-grained Classification using Heterogeneous Web Data and Auxiliary Categories
Fine-grained classification remains a very challenging problem, because of
the absence of well-labeled training data caused by the high cost of annotating
a large number of fine-grained categories. In the extreme case, given a set of
test categories without any well-labeled training data, the majority of
existing works can be grouped into the following two research directions: 1)
crawl noisy labeled web data for the test categories as training data, which is
dubbed as webly supervised learning; 2) transfer the knowledge from auxiliary
categories with well-labeled training data to the test categories, which
corresponds to zero-shot learning setting. Nevertheless, the above two research
directions still have critical issues to be addressed. For the first direction,
web data have noisy labels and considerably different data distribution from
test data. For the second direction, zero-shot learning is struggling to
achieve compelling results compared with conventional supervised learning. The
issues of the above two directions motivate us to develop a novel approach
which can jointly exploit both noisy web training data from test categories and
well-labeled training data from auxiliary categories. In particular, on one
hand, we crawl web data for test categories as noisy training data. On the
other hand, we transfer the knowledge from auxiliary categories with
well-labeled training data to test categories by virtue of free semantic
information (e.g., word vector) of all categories. Moreover, given the fact
that web data are generally associated with additional textual information
(e.g., title and tag), we extend our method by using the surrounding textual
information of web data as privileged information. Extensive experiments show
the effectiveness of our proposed methods
Learning from Noisy Web Data with Category-level Supervision
As tons of photos are being uploaded to public websites (e.g., Flickr, Bing,
and Google) every day, learning from web data has become an increasingly
popular research direction because of freely available web resources, which is
also referred to as webly supervised learning. Nevertheless, the performance
gap between webly supervised learning and traditional supervised learning is
still very large, owning to the label noise of web data. To be exact, the
labels of images crawled from public websites are very noisy and often
inaccurate. Some existing works tend to facilitate learning from web data with
the aid of extra information, such as augmenting or purifying web data by
virtue of instance-level supervision, which is usually in demand of heavy
manual annotation. Instead, we propose to tackle the label noise by leveraging
more accessible category-level supervision. In particular, we build our method
upon variational autoencoder (VAE), in which the classification network is
attached on the hidden layer of VAE in a way that the classification network
and VAE can jointly leverage the category-level hybrid semantic information.
The effectiveness of our proposed method is clearly demonstrated by extensive
experiments on three benchmark datasets
Human Activity Recognition Using Robust Adaptive Privileged Probabilistic Learning
In this work, a novel method based on the learning using privileged
information (LUPI) paradigm for recognizing complex human activities is
proposed that handles missing information during testing. We present a
supervised probabilistic approach that integrates LUPI into a hidden
conditional random field (HCRF) model. The proposed model is called HCRF+ and
may be trained using both maximum likelihood and maximum margin approaches. It
employs a self-training technique for automatic estimation of the
regularization parameters of the objective functions. Moreover, the method
provides robustness to outliers (such as noise or missing data) by modeling the
conditional distribution of the privileged information by a Student's
\textit{t}-density function, which is naturally integrated into the HCRF+
framework. Different forms of privileged information were investigated. The
proposed method was evaluated using four challenging publicly available
datasets and the experimental results demonstrate its effectiveness with
respect to the-state-of-the-art in the LUPI framework using both hand-crafted
features and features extracted from a convolutional neural network
WebVision Challenge: Visual Learning and Understanding With Web Data
We present the 2017 WebVision Challenge, a public image recognition challenge
designed for deep learning based on web images without instance-level human
annotation. Following the spirit of previous vision challenges, such as ILSVRC,
Places2 and PASCAL VOC, which have played critical roles in the development of
computer vision by contributing to the community with large scale annotated
data for model designing and standardized benchmarking, we contribute with this
challenge a large scale web images dataset, and a public competition with a
workshop co-located with CVPR 2017. The WebVision dataset contains more than
million web images crawled from the Internet by using queries generated
from the semantic concepts of the benchmark ILSVRC 2012 dataset. Meta
information is also included. A validation set and test set containing human
annotated images are also provided to facilitate algorithmic development. The
2017 WebVision challenge consists of two tracks, the image classification task
on WebVision test set, and the transfer learning task on PASCAL VOC 2012
dataset. In this paper, we describe the details of data collection and
annotation, highlight the characteristics of the dataset, and introduce the
evaluation metrics.Comment: project page: http://www.vision.ee.ethz.ch/webvision
WebVision Database: Visual Learning and Understanding from Web Data
In this paper, we present a study on learning visual recognition models from
large scale noisy web data. We build a new database called WebVision, which
contains more than million web images crawled from the Internet by using
queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012
dataset. Meta information along with those web images (e.g., title,
description, tags, etc.) are also crawled. A validation set and test set
containing human annotated images are also provided to facilitate algorithmic
development. Based on our new database, we obtain a few interesting
observations: 1) the noisy web images are sufficient for training a good deep
CNN model for visual recognition; 2) the model learnt from our WebVision
database exhibits comparable or even better generalization ability than the one
trained from the ILSVRC 2012 dataset when being transferred to new datasets and
tasks; 3) a domain adaptation issue (a.k.a., dataset bias) is observed, which
means the dataset can be used as the largest benchmark dataset for visual
domain adaptation. Our new WebVision database and relevant studies in this work
would benefit the advance of learning state-of-the-art visual models with
minimum supervision based on web data
Exploiting Multi-modal Curriculum in Noisy Web Data for Large-scale Concept Learning
Learning video concept detectors automatically from the big but noisy web
data with no additional manual annotations is a novel but challenging area in
the multimedia and the machine learning community. A considerable amount of
videos on the web are associated with rich but noisy contextual information,
such as the title, which provides weak annotations or labels about the video
content. To leverage the big noisy web labels, this paper proposes a novel
method called WEbly-Labeled Learning (WELL), which is established on the
state-of-the-art machine learning algorithm inspired by the learning process of
human. WELL introduces a number of novel multi-modal approaches to incorporate
meaningful prior knowledge called curriculum from the noisy web videos. To
investigate this problem, we empirically study the curriculum constructed from
the multi-modal features of the videos collected from YouTube and Flickr. The
efficacy and the scalability of WELL have been extensively demonstrated on two
public benchmarks, including the largest multimedia dataset and the largest
manually-labeled video set. The comprehensive experimental results demonstrate
that WELL outperforms state-of-the-art studies by a statically significant
margin on learning concepts from noisy web video data. In addition, the results
also verify that WELL is robust to the level of noisiness in the video data.
Notably, WELL trained on sufficient noisy web labels is able to achieve a
comparable accuracy to supervised learning methods trained on the clean
manually-labeled data
Dynamically Visual Disambiguation of Keyword-based Image Search
Due to the high cost of manual annotation, learning directly from the web has
attracted broad attention. One issue that limits their performance is the
problem of visual polysemy. To address this issue, we present an adaptive
multi-model framework that resolves polysemy by visual disambiguation. Compared
to existing methods, the primary advantage of our approach lies in that our
approach can adapt to the dynamic changes in the search results. Our proposed
framework consists of two major steps: we first discover and dynamically select
the text queries according to the image search results, then we employ the
proposed saliency-guided deep multi-instance learning network to remove
outliers and learn classification models for visual disambiguation. Extensive
experiments demonstrate the superiority of our proposed approach.Comment: Accepted by International Joint Conference on Artificial Intelligence
(IJCAI), 201
Road Segmentation with Image-LiDAR Data Fusion
Robust road segmentation is a key challenge in self-driving research. Though
many image-based methods have been studied and high performances in dataset
evaluations have been reported, developing robust and reliable road
segmentation is still a major challenge. Data fusion across different sensors
to improve the performance of road segmentation is widely considered an
important and irreplaceable solution. In this paper, we propose a novel
structure to fuse image and LiDAR point cloud in an end-to-end semantic
segmentation network, in which the fusion is performed at decoder stage instead
of at, more commonly, encoder stage. During fusion, we improve the multi-scale
LiDAR map generation to increase the precision of the multi-scale LiDAR map by
introducing pyramid projection method. Additionally, we adapted the multi-path
refinement network with our fusion strategy and improve the road prediction
compared with transpose convolution with skip layers. Our approach has been
tested on KITTI ROAD dataset and has competitive performance.Comment: Accepted by Multimedia Tools and Application
- …