2,130 research outputs found
Hybrid Generative/Discriminative Learning for Automatic Image Annotation
Automatic image annotation (AIA) raises tremendous challenges to machine
learning as it requires modeling of data that are both ambiguous in input and
output, e.g., images containing multiple objects and labeled with multiple
semantic tags. Even more challenging is that the number of candidate tags is
usually huge (as large as the vocabulary size) yet each image is only related
to a few of them. This paper presents a hybrid generative-discriminative
classifier to simultaneously address the extreme data-ambiguity and
overfitting-vulnerability issues in tasks such as AIA. Particularly: (1) an
Exponential-Multinomial Mixture (EMM) model is established to capture both the
input and output ambiguity and in the meanwhile to encourage prediction
sparsity; and (2) the prediction ability of the EMM model is explicitly
maximized through discriminative learning that integrates variational inference
of graphical models and the pairwise formulation of ordinal regression.
Experiments show that our approach achieves both superior annotation
performance and better tag scalability.Comment: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty
in Artificial Intelligence (UAI2010
A Supervised Neural Autoregressive Topic Model for Simultaneous Image Classification and Annotation
Topic modeling based on latent Dirichlet allocation (LDA) has been a
framework of choice to perform scene recognition and annotation. Recently, a
new type of topic model called the Document Neural Autoregressive Distribution
Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance
for document modeling. In this work, we show how to successfully apply and
extend this model to the context of visual scene modeling. Specifically, we
propose SupDocNADE, a supervised extension of DocNADE, that increases the
discriminative power of the hidden topic features by incorporating label
information into the training objective of the model. We also describe how to
leverage information about the spatial position of the visual words and how to
embed additional image annotations, so as to simultaneously perform image
classification and annotation. We test our model on the Scene15, LabelMe and
UIUC-Sports datasets and show that it compares favorably to other topic models
such as the supervised variant of LDA.Comment: 13 pages, 5 figure
A Deep and Autoregressive Approach for Topic Modeling of Multimodal Data
Topic modeling based on latent Dirichlet allocation (LDA) has been a
framework of choice to deal with multimodal data, such as in image annotation
tasks. Another popular approach to model the multimodal data is through deep
neural networks, such as the deep Boltzmann machine (DBM). Recently, a new type
of topic model called the Document Neural Autoregressive Distribution Estimator
(DocNADE) was proposed and demonstrated state-of-the-art performance for text
document modeling. In this work, we show how to successfully apply and extend
this model to multimodal data, such as simultaneous image classification and
annotation. First, we propose SupDocNADE, a supervised extension of DocNADE,
that increases the discriminative power of the learned hidden topic features
and show how to employ it to learn a joint representation from image visual
words, annotation words and class label information. We test our model on the
LabelMe and UIUC-Sports data sets and show that it compares favorably to other
topic models. Second, we propose a deep extension of our model and provide an
efficient way of training the deep model. Experimental results show that our
deep model outperforms its shallow version and reaches state-of-the-art
performance on the Multimedia Information Retrieval (MIR) Flickr data set.Comment: 24 pages, 10 figures. A version has been accepted by TPAMI on Aug
4th, 2015. Add footnote about how to train the model in practice in Section
5.1. arXiv admin note: substantial text overlap with arXiv:1305.530
Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation
We address the problem of localisation of objects as bounding boxes in images
with weak labels. This weakly supervised object localisation problem has been
tackled in the past using discriminative models where each object class is
localised independently from other classes. We propose a novel framework based
on Bayesian joint topic modelling. Our framework has three distinctive
advantages over previous works: (1) All object classes and image backgrounds
are modelled jointly together in a single generative model so that "explaining
away" inference can resolve ambiguity and lead to better learning and
localisation. (2) The Bayesian formulation of the model enables easy
integration of prior knowledge about object appearance to compensate for
limited supervision. (3) Our model can be learned with a mixture of weakly
labelled and unlabelled data, allowing the large volume of unlabelled images on
the Internet to be exploited for learning. Extensive experiments on the
challenging VOC dataset demonstrate that our approach outperforms the
state-of-the-art competitors.Comment: iccv 201
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Unsupervised Category Discovery via Looped Deep Pseudo-Task Optimization Using a Large Scale Radiology Image Database
Obtaining semantic labels on a large scale radiology image database (215,786
key images from 61,845 unique patients) is a prerequisite yet bottleneck to
train highly effective deep convolutional neural network (CNN) models for image
recognition. Nevertheless, conventional methods for collecting image labels
(e.g., Google search followed by crowd-sourcing) are not applicable due to the
formidable difficulties of medical annotation tasks for those who are not
clinically trained. This type of image labeling task remains non-trivial even
for radiologists due to uncertainty and possible drastic inter-observer
variation or inconsistency.
In this paper, we present a looped deep pseudo-task optimization procedure
for automatic category discovery of visually coherent and clinically semantic
(concept) clusters. Our system can be initialized by domain-specific (CNN
trained on radiology images and text report derived labels) or generic
(ImageNet based) CNN models. Afterwards, a sequence of pseudo-tasks are
exploited by the looped deep image feature clustering (to refine image labels)
and deep CNN training/classification using new labels (to obtain more task
representative deep features). Our method is conceptually simple and based on
the hypothesized "convergence" of better labels leading to better trained CNN
models which in turn feed more effective deep image features to facilitate more
meaningful clustering/labels. We have empirically validated the convergence and
demonstrated promising quantitative and qualitative results. Category labels of
significantly higher quality than those in previous work are discovered. This
allows for further investigation of the hierarchical semantic nature of the
given large-scale radiology image database
A Survey on Object Detection in Optical Remote Sensing Images
Object detection in optical remote sensing images, being a fundamental but
challenging problem in the field of aerial and satellite image analysis, plays
an important role for a wide range of applications and is receiving significant
attention in recent years. While enormous methods exist, a deep review of the
literature concerning generic object detection is still lacking. This paper
aims to provide a review of the recent progress in this field. Different from
several previously published surveys that focus on a specific object class such
as building and road, we concentrate on more generic object categories
including, but are not limited to, road, building, tree, vehicle, ship,
airport, urban-area. Covering about 270 publications we survey 1) template
matching-based object detection methods, 2) knowledge-based object detection
methods, 3) object-based image analysis (OBIA)-based object detection methods,
4) machine learning-based object detection methods, and 5) five publicly
available datasets and three standard evaluation metrics. We also discuss the
challenges of current studies and propose two promising research directions,
namely deep learning-based feature representation and weakly supervised
learning-based geospatial object detection. It is our hope that this survey
will be beneficial for the researchers to have better understanding of this
research field.Comment: This manuscript is the accepted version for ISPRS Journal of
Photogrammetry and Remote Sensin
Mid-level Deep Pattern Mining
Mid-level visual element discovery aims to find clusters of image patches
that are both representative and discriminative. In this work, we study this
problem from the prospective of pattern mining while relying on the recently
popularized Convolutional Neural Networks (CNNs). Specifically, we find that
for an image patch, activations extracted from the first fully-connected layer
of CNNs have two appealing properties which enable its seamless integration
with pattern mining. Patterns are then discovered from a large number of CNN
activations of image patches through the well-known association rule mining.
When we retrieve and visualize image patches with the same pattern,
surprisingly, they are not only visually similar but also semantically
consistent. We apply our approach to scene and object classification tasks, and
demonstrate that our approach outperforms all previous works on mid-level
visual element discovery by a sizeable margin with far fewer elements being
used. Our approach also outperforms or matches recent works using CNN for these
tasks. Source code of the complete system is available online.Comment: Published in Proc. IEEE Conf. Computer Vision and Pattern Recognition
201
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
Multimedia retrieval plays an indispensable role in big data utilization.
Past efforts mainly focused on single-media retrieval. However, the
requirements of users are highly flexible, such as retrieving the relevant
audio clips with one query of image. So challenges stemming from the "media
gap", which means that representations of different media types are
inconsistent, have attracted increasing attention. Cross-media retrieval is
designed for the scenarios where the queries and retrieval results are of
different media types. As a relatively new research topic, its concepts,
methodologies and benchmarks are still not clear in the literatures. To address
these issues, we review more than 100 references, give an overview including
the concepts, methodologies, major challenges and open issues, as well as build
up the benchmarks including datasets and experimental results. Researchers can
directly adopt the benchmarks to promptly evaluate their proposed methods. This
will help them to focus on algorithm design, rather than the time-consuming
compared methods and results. It is noted that we have constructed a new
dataset XMedia, which is the first publicly available dataset with up to five
media types (text, image, video, audio and 3D model). We believe this overview
will attract more researchers to focus on cross-media retrieval and be helpful
to them.Comment: 14 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technolog
- …