12,293 research outputs found
Refining Image Categorization by Exploiting Web Images and General Corpus
Studies show that refining real-world categories into semantic subcategories
contributes to better image modeling and classification. Previous image
sub-categorization work relying on labeled images and WordNet's hierarchy is
not only labor-intensive, but also restricted to classify images into NOUN
subcategories. To tackle these problems, in this work, we exploit general
corpus information to automatically select and subsequently classify web images
into semantic rich (sub-)categories. The following two major challenges are
well studied: 1) noise in the labels of subcategories derived from the general
corpus; 2) noise in the labels of images retrieved from the web. Specifically,
we first obtain the semantic refinement subcategories from the text perspective
and remove the noise by the relevance-based approach. To suppress the search
error induced noisy images, we then formulate image selection and classifier
learning as a multi-class multi-instance learning problem and propose to solve
the employed problem by the cutting-plane algorithm. The experiments show
significant performance gains by using the generated data of our way on both
image categorization and sub-categorization tasks. The proposed approach also
consistently outperforms existing weakly supervised and web-supervised
approaches
Fine-grained Visual-textual Representation Learning
Fine-grained visual categorization is to recognize hundreds of subcategories
belonging to the same basic-level category, which is a highly challenging task
due to the quite subtle and local visual distinctions among similar
subcategories. Most existing methods generally learn part detectors to discover
discriminative regions for better categorization performance. However, not all
parts are beneficial and indispensable for visual categorization, and the
setting of part detector number heavily relies on prior knowledge as well as
experimental validation. As is known to all, when we describe the object of an
image via textual descriptions, we mainly focus on the pivotal characteristics,
and rarely pay attention to common characteristics as well as the background
areas. This is an involuntary transfer from human visual attention to textual
attention, which leads to the fact that textual attention tells us how many and
which parts are discriminative and significant to categorization. So textual
attention could help us to discover visual attention in image. Inspired by
this, we propose a fine-grained visual-textual representation learning (VTRL)
approach, and its main contributions are: (1) Fine-grained visual-textual
pattern mining devotes to discovering discriminative visual-textual pairwise
information for boosting categorization performance through jointly modeling
vision and text with generative adversarial networks (GANs), which
automatically and adaptively discovers discriminative parts. (2) Visual-textual
representation learning jointly combines visual and textual information, which
preserves the intra-modality and inter-modality information to generate
complementary fine-grained representation, as well as further improves
categorization performance.Comment: 12 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technology (TCSVT
Few-Shot Adaptation for Multimedia Semantic Indexing
We propose a few-shot adaptation framework, which bridges zero-shot learning
and supervised many-shot learning, for semantic indexing of image and video
data. Few-shot adaptation provides robust parameter estimation with few
training examples, by optimizing the parameters of zero-shot learning and
supervised many-shot learning simultaneously. In this method, first we build a
zero-shot detector, and then update it by using the few examples. Our
experiments show the effectiveness of the proposed framework on three datasets:
TRECVID Semantic Indexing 2010, 2014, and ImageNET. On the ImageNET dataset, we
show that our method outperforms recent few-shot learning methods. On the
TRECVID 2014 dataset, we achieve 15.19% and 35.98% in Mean Average Precision
under the zero-shot condition and the supervised condition, respectively. To
the best of our knowledge, these are the best results on this dataset
Pairwise Constraint Propagation on Multi-View Data
This paper presents a graph-based learning approach to pairwise constraint
propagation on multi-view data. Although pairwise constraint propagation has
been studied extensively, pairwise constraints are usually defined over pairs
of data points from a single view, i.e., only intra-view constraint propagation
is considered for multi-view tasks. In fact, very little attention has been
paid to inter-view constraint propagation, which is more challenging since
pairwise constraints are now defined over pairs of data points from different
views. In this paper, we propose to decompose the challenging inter-view
constraint propagation problem into semi-supervised learning subproblems so
that they can be efficiently solved based on graph-based label propagation. To
the best of our knowledge, this is the first attempt to give an efficient
solution to inter-view constraint propagation from a semi-supervised learning
viewpoint. Moreover, since graph-based label propagation has been adopted for
basic optimization, we develop two constrained graph construction methods for
interview constraint propagation, which only differ in how the intra-view
pairwise constraints are exploited. The experimental results in cross-view
retrieval have shown the promising performance of our inter-view constraint
propagation
A Survey on Web Multimedia Mining
Modern developments in digital media technologies has made transmitting and
storing large amounts of multi/rich media data (e.g. text, images, music, video
and their combination) more feasible and affordable than ever before. However,
the state of the art techniques to process, mining and manage those rich media
are still in their infancy. Advances developments in multimedia acquisition and
storage technology the rapid progress has led to the fast growing incredible
amount of data stored in databases. Useful information to users can be revealed
if these multimedia files are analyzed. Multimedia mining deals with the
extraction of implicit knowledge, multimedia data relationships, or other
patterns not explicitly stored in multimedia files. Also in retrieval, indexing
and classification of multimedia data with efficient information fusion of the
different modalities is essential for the system's overall performance. The
purpose of this paper is to provide a systematic overview of multimedia mining.
This article is also represents the issues in the application process component
for multimedia mining followed by the multimedia mining models.Comment: 13 Pages; The International Journal of Multimedia & Its Applications
(IJMA) Vol.3, No.3, August 201
Semantic Diversity versus Visual Diversity in Visual Dictionaries
Visual dictionaries are a critical component for image
classification/retrieval systems based on the bag-of-visual-words (BoVW) model.
Dictionaries are usually learned without supervision from a training set of
images sampled from the collection of interest. However, for large,
general-purpose, dynamic image collections (e.g., the Web), obtaining a
representative sample in terms of semantic concepts is not straightforward. In
this paper, we evaluate the impact of semantics in the dictionary quality,
aiming at verifying the importance of semantic diversity in relation visual
diversity for visual dictionaries. In the experiments, we vary the amount of
classes used for creating the dictionary and then compute different BoVW
descriptors, using multiple codebook sizes and different coding and pooling
methods (standard BoVW and Fisher Vectors). Results for image classification
show that as visual dictionaries are based on low-level visual appearances,
visual diversity is more important than semantic diversity. Our conclusions
open the opportunity to alleviate the burden in generating visual dictionaries
as we need only a visually diverse set of images instead of the whole
collection to create a good dictionary
Trace transform based method for color image domain identification
Context categorization is a fundamental pre-requisite for multi-domain
multimedia content analysis applications in order to manage contextual
information in an efficient manner. In this paper, we introduce a new color
image context categorization method (DITEC) based on the trace transform. The
problem of dimensionality reduction of the obtained trace transform signal is
addressed through statistical descriptors that keep the underlying information.
These extracted features offer a highly discriminant behavior for content
categorization. The theoretical properties of the method are analyzed and
validated experimentally through two different datasets.Comment: This paper has been momentaneously withdraw
Recent Advances in Zero-shot Recognition
With the recent renaissance of deep convolution neural networks, encouraging
breakthroughs have been achieved on the supervised recognition tasks, where
each class has sufficient training data and fully annotated training data.
However, to scale the recognition to a large number of classes with few or now
training samples for each class remains an unsolved problem. One approach to
scaling up the recognition is to develop models capable of recognizing unseen
categories without any training instances, or zero-shot recognition/ learning.
This article provides a comprehensive review of existing zero-shot recognition
techniques covering various aspects ranging from representations of models, and
from datasets and evaluation settings. We also overview related recognition
tasks including one-shot and open set recognition which can be used as natural
extensions of zero-shot recognition when limited number of class samples become
available or when zero-shot recognition is implemented in a real-world setting.
Importantly, we highlight the limitations of existing approaches and point out
future research directions in this existing new research area.Comment: accepted by IEEE Signal Processing Magazin
Web Mining Research: A Survey
With the huge amount of information available online, the World Wide Web is a
fertile area for data mining research. The Web mining research is at the cross
road of research from several research communities, such as database,
information retrieval, and within AI, especially the sub-areas of machine
learning and natural language processing. However, there is a lot of confusions
when comparing research efforts from different point of views. In this paper,
we survey the research in the area of Web mining, point out some confusions
regarded the usage of the term Web mining and suggest three Web mining
categories. Then we situate some of the research with respect to these three
categories. We also explore the connection between the Web mining categories
and the related agent paradigm. For the survey, we focus on representation
issues, on the process, on the learning algorithm, and on the application of
the recent works as the criteria. We conclude the paper with some research
issues.Comment: 15 page
Fast Fine-grained Image Classification via Weakly Supervised Discriminative Localization
Fine-grained image classification is to recognize hundreds of subcategories
in each basic-level category. Existing methods employ discriminative
localization to find the key distinctions among subcategories. However, they
generally have two limitations: (1) Discriminative localization relies on
region proposal methods to hypothesize the locations of discriminative regions,
which are time-consuming. (2) The training of discriminative localization
depends on object or part annotations, which are heavily labor-consuming. It is
highly challenging to address the two key limitations simultaneously, and
existing methods only focus on one of them. Therefore, we propose a weakly
supervised discriminative localization approach (WSDL) for fast fine-grained
image classification to address the two limitations at the same time, and its
main advantages are: (1) n-pathway end-to-end discriminative localization
network is designed to improve classification speed, which simultaneously
localizes multiple different discriminative regions for one image to boost
classification accuracy, and shares full-image convolutional features generated
by region proposal network to accelerate the process of generating region
proposals as well as reduce the computation of convolutional operation. (2)
Multi-level attention guided localization learning is proposed to localize
discriminative regions with different focuses automatically, without using
object and part annotations, avoiding the labor consumption. Different level
attentions focus on different characteristics of the image, which are
complementary and boost the classification accuracy. Both are jointly employed
to simultaneously improve classification speed and eliminate dependence on
object and part annotations. Compared with state-of-the-art methods on 2
widely-used fine-grained image classification datasets, our WSDL approach
achieves the best performance.Comment: 13pages, submitted to IEEE Transactions on Circuits and Systems for
Video Technology. arXiv admin note: text overlap with arXiv:1709.0829
- …