1,083 research outputs found
SMAN : Stacked Multi-Modal Attention Network for cross-modal image-text retrieval
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods
UniMAP: Universal SMILES-Graph Representation Learning
Molecular representation learning is fundamental for many drug related
applications. Most existing molecular pre-training models are limited in using
single molecular modality, either SMILES or graph representation. To
effectively leverage both modalities, we argue that it is critical to capture
the fine-grained 'semantics' between SMILES and graph, because subtle
sequence/graph differences may lead to contrary molecular properties. In this
paper, we propose a universal SMILE-graph representation learning model, namely
UniMAP. Firstly, an embedding layer is employed to obtain the token and
node/edge representation in SMILES and graph, respectively. A multi-layer
Transformer is then utilized to conduct deep cross-modality fusion. Specially,
four kinds of pre-training tasks are designed for UniMAP, including Multi-Level
Cross-Modality Masking (CMM), SMILES-Graph Matching (SGM), Fragment-Level
Alignment (FLA), and Domain Knowledge Learning (DKL). In this way, both global
(i.e. SGM and DKL) and local (i.e. CMM and FLA) alignments are integrated to
achieve comprehensive cross-modality fusion. We evaluate UniMAP on various
downstream tasks, i.e. molecular property prediction, drug-target affinity
prediction and drug-drug interaction. Experimental results show that UniMAP
outperforms current state-of-the-art pre-training methods.We also visualize the
learned representations to demonstrate the effect of multi-modality
integration
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
Most existing cross-modal retrieval methods employ two-stream encoders with
different architectures for images and texts, \textit{e.g.}, CNN for images and
RNN/Transformer for texts. Such discrepancy in architectures may induce
different semantic distribution spaces and limit the interactions between
images and texts, and further result in inferior alignment between images and
texts. To fill this research gap, inspired by recent advances of Transformers
in vision tasks, we propose to unify the encoder architectures with
Transformers for both modalities. Specifically, we design a cross-modal
retrieval framework purely based on two-stream Transformers, dubbed
\textbf{Hierarchical Alignment Transformers (HAT)}, which consists of an image
Transformer, a text Transformer, and a hierarchical alignment module. With such
identical architectures, the encoders could produce representations with more
similar characteristics for images and texts, and make the interactions and
alignments between them much easier. Besides, to leverage the rich semantics,
we devise a hierarchical alignment scheme to explore multi-level
correspondences of different layers between images and texts. To evaluate the
effectiveness of the proposed HAT, we conduct extensive experiments on two
benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that
HAT outperforms SOTA baselines by a large margin. Specifically, on two key
tasks, \textit{i.e.}, image-to-text and text-to-image retrieval, HAT achieves
7.6\% and 16.7\% relative score improvement of Recall@1 on MSCOCO, and 4.4\%
and 11.6\% on Flickr30k respectively. The code is available at
\url{https://github.com/LuminosityX/HAT}.Comment: Accepted at ACM Multimedia 202
- …