5 research outputs found
Learning Social Image Embedding with Deep Multimodal Attention Networks
Learning social media data embedding by deep models has attracted extensive
research interest as well as boomed a lot of applications, such as link
prediction, classification, and cross-modal search. However, for social images
which contain both link information and multimodal contents (e.g., text
description, and visual content), simply employing the embedding learnt from
network structure or data content results in sub-optimal social image
representation. In this paper, we propose a novel social image embedding
approach called Deep Multimodal Attention Networks (DMAN), which employs a deep
model to jointly embed multimodal contents and link information. Specifically,
to effectively capture the correlations between multimodal contents, we propose
a multimodal attention network to encode the fine-granularity relation between
image regions and textual words. To leverage the network structure for
embedding learning, a novel Siamese-Triplet neural network is proposed to model
the links among images. With the joint deep model, the learnt embedding can
capture both the multimodal contents and the nonlinear network information.
Extensive experiments are conducted to investigate the effectiveness of our
approach in the applications of multi-label classification and cross-modal
search. Compared to state-of-the-art image embeddings, our proposed DMAN
achieves significant improvement in the tasks of multi-label classification and
cross-modal search
RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization
We study an important, yet largely unexplored problem of large-scale
cross-modal visual localization by matching ground RGB images to a
geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior
works were demonstrated on small datasets and did not lend themselves to
scaling up for large-scale applications. To enable large-scale evaluation, we
introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of
RGB and aerial LIDAR depth images. We propose a novel joint embedding based
method that effectively combines the appearance and semantic cues from both
modalities to handle drastic cross-modal variations. Experiments on the
proposed dataset show that our model achieves a strong result of a median rank
of 5 in matching across a large test set of 50K location pairs collected from a
14km^2 area. This represents a significant advancement over prior works in
performance and scale. We conclude with qualitative results to highlight the
challenging nature of this task and the benefits of the proposed model. Our
work provides a foundation for further research in cross-modal visual
localization.Comment: ACM Multimedia 202