12,042 research outputs found
Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation
Image annotation aims to annotate a given image with a variable number of
class labels corresponding to diverse visual concepts. In this paper, we
address two main issues in large-scale image annotation: 1) how to learn a rich
feature representation suitable for predicting a diverse set of visual concepts
ranging from object, scene to abstract concept; 2) how to annotate an image
with the optimal number of class labels. To address the first issue, we propose
a novel multi-scale deep model for extracting rich and discriminative features
capable of representing a wide range of visual concepts. Specifically, a novel
two-branch deep neural network architecture is proposed which comprises a very
deep main network branch and a companion feature fusion network branch designed
for fusing the multi-scale features computed from the main branch. The deep
model is also made multi-modal by taking noisy user-provided tags as model
input to complement the image input. For tackling the second issue, we
introduce a label quantity prediction auxiliary task to the main label
prediction task to explicitly estimate the optimal label number for a given
image. Extensive experiments are carried out on two large-scale image
annotation benchmark datasets and the results show that our method
significantly outperforms the state-of-the-art.Comment: Submited to IEEE TI
Detecting and Grounding Multi-Modal Media Manipulation and Beyond
Misinformation has become a pressing issue. Fake media, in both visual and
textual forms, is widespread on the web. While various deepfake detection and
text fake news detection methods have been proposed, they are only designed for
single-modality forgery based on binary classification, let alone analyzing and
reasoning subtle forgery traces across different modalities. In this paper, we
highlight a new research problem for multi-modal fake media, namely Detecting
and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only
detect the authenticity of multi-modal media, but also ground the manipulated
content, which requires deeper reasoning of multi-modal media manipulation. To
support a large-scale investigation, we construct the first DGM^4 dataset,
where image-text pairs are manipulated by various approaches, with rich
annotation of diverse manipulations. Moreover, we propose a novel HierArchical
Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the
fine-grained interaction between different modalities. HAMMER performs 1)
manipulation-aware contrastive learning between two uni-modal encoders as
shallow manipulation reasoning, and 2) modality-aware cross-attention by
multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation
detection and grounding heads are integrated from shallow to deep levels based
on the interacted multi-modal information. To exploit more fine-grained
contrastive learning for cross-modal semantic alignment, we further integrate
Manipulation-Aware Contrastive Loss with Local View and construct a more
advanced model HAMMER++. Finally, we build an extensive benchmark and set up
rigorous evaluation metrics for this new research problem. Comprehensive
experiments demonstrate the superiority of HAMMER and HAMMER++.Comment: Extension of our CVPR 2023 paper: arXiv:2304.02556 Code:
https://github.com/rshaojimmy/MultiModal-DeepFak
Fine-graind Image Classification via Combining Vision and Language
Fine-grained image classification is a challenging task due to the large
intra-class variance and small inter-class variance, aiming at recognizing
hundreds of sub-categories belonging to the same basic-level category. Most
existing fine-grained image classification methods generally learn part
detection models to obtain the semantic parts for better classification
accuracy. Despite achieving promising results, these methods mainly have two
limitations: (1) not all the parts which obtained through the part detection
models are beneficial and indispensable for classification, and (2)
fine-grained image classification requires more detailed visual descriptions
which could not be provided by the part locations or attribute annotations. For
addressing the above two limitations, this paper proposes the two-stream model
combining vision and language (CVL) for learning latent semantic
representations. The vision stream learns deep representations from the
original visual information via deep convolutional neural network. The language
stream utilizes the natural language descriptions which could point out the
discriminative parts or characteristics for each image, and provides a flexible
and compact way of encoding the salient visual aspects for distinguishing
sub-categories. Since the two streams are complementary, combining the two
streams can further achieves better classification accuracy. Comparing with 12
state-of-the-art methods on the widely used CUB-200-2011 dataset for
fine-grained image classification, the experimental results demonstrate our CVL
approach achieves the best performance.Comment: 9 pages, to appear in CVPR 201
- …