1,309 research outputs found
Fine-Grained Image Analysis with Deep Learning: A Survey
Fine-grained image analysis (FGIA) is a longstanding and fundamental problem
in computer vision and pattern recognition, and underpins a diverse set of
real-world applications. The task of FGIA targets analyzing visual objects from
subordinate categories, e.g., species of birds or models of cars. The small
inter-class and large intra-class variation inherent to fine-grained image
analysis makes it a challenging problem. Capitalizing on advances in deep
learning, in recent years we have witnessed remarkable progress in deep
learning powered FGIA. In this paper we present a systematic survey of these
advances, where we attempt to re-define and broaden the field of FGIA by
consolidating two fundamental fine-grained research areas -- fine-grained image
recognition and fine-grained image retrieval. In addition, we also review other
key issues of FGIA, such as publicly available benchmark datasets and related
domain-specific applications. We conclude by highlighting several research
directions and open problems which need further exploration from the community.Comment: Accepted by IEEE TPAM
Expert Knowledge-Guided Length-Variant Hierarchical Label Generation for Proposal Classification
To advance the development of science and technology, research proposals are
submitted to open-court competitive programs developed by government agencies
(e.g., NSF). Proposal classification is one of the most important tasks to
achieve effective and fair review assignments. Proposal classification aims to
classify a proposal into a length-variant sequence of labels. In this paper, we
formulate the proposal classification problem into a hierarchical multi-label
classification task. Although there are certain prior studies, proposal
classification exhibit unique features: 1) the classification result of a
proposal is in a hierarchical discipline structure with different levels of
granularity; 2) proposals contain multiple types of documents; 3) domain
experts can empirically provide partial labels that can be leveraged to improve
task performances. In this paper, we focus on developing a new deep proposal
classification framework to jointly model the three features. In particular, to
sequentially generate labels, we leverage previously-generated labels to
predict the label of next level; to integrate partial labels from experts, we
use the embedding of these empirical partial labels to initialize the state of
neural networks. Our model can automatically identify the best length of label
sequence to stop next label prediction. Finally, we present extensive results
to demonstrate that our method can jointly model partial labels, textual
information, and semantic dependencies in label sequences, and, thus, achieve
advanced performances.Comment: 10 pages, Accepted as regular paper by ICDM 202
Context Embedding Networks
Low dimensional embeddings that capture the main variations of interest in
collections of data are important for many applications. One way to construct
these embeddings is to acquire estimates of similarity from the crowd. However,
similarity is a multi-dimensional concept that varies from individual to
individual. Existing models for learning embeddings from the crowd typically
make simplifying assumptions such as all individuals estimate similarity using
the same criteria, the list of criteria is known in advance, or that the crowd
workers are not influenced by the data that they see. To overcome these
limitations we introduce Context Embedding Networks (CENs). In addition to
learning interpretable embeddings from images, CENs also model worker biases
for different attributes along with the visual context i.e. the visual
attributes highlighted by a set of images. Experiments on two noisy crowd
annotated datasets show that modeling both worker bias and visual context
results in more interpretable embeddings compared to existing approaches.Comment: CVPR 2018 spotligh
Delving into Multimodal Prompting for Fine-grained Visual Classification
Fine-grained visual classification (FGVC) involves categorizing fine
subdivisions within a broader category, which poses challenges due to subtle
inter-class discrepancies and large intra-class variations. However, prevailing
approaches primarily focus on uni-modal visual concepts. Recent advancements in
pre-trained vision-language models have demonstrated remarkable performance in
various high-level vision tasks, yet the applicability of such models to FGVC
tasks remains uncertain. In this paper, we aim to fully exploit the
capabilities of cross-modal description to tackle FGVC tasks and propose a
novel multimodal prompting solution, denoted as MP-FGVC, based on the
contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a
multimodal prompts scheme and a multimodal adaptation scheme. The former
includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text
Prompt (DaTP), which explicitly highlights the subcategory-specific
discrepancies from the perspectives of both vision and language. The latter
aligns the vision and text prompting elements in a common semantic space,
facilitating cross-modal collaborative reasoning through a Vision-Language
Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a
two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained
CLIP model and expedite efficient adaptation for FGVC. Extensive experiments
conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.Comment: The first two authors contributed equally to this wor
- …