24 research outputs found
Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching
Existing image-text matching approaches typically infer the similarity of an
image-text pair by capturing and aggregating the affinities between the text
and each independent object of the image. However, they ignore the connections
between the objects that are semantically related. These objects may
collectively determine whether the image corresponds to a text or not. To
address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN)
which processes images and sentences symmetrically by recurrent neural networks
(RNN). In particular, given an input image-text pair, our model reorders the
image objects based on the positions of their most related words in the text.
In the same way as extracting the hidden features from word embeddings, the
model leverages RNN to extract high-level object features from the reordered
object inputs. We validate that the high-level object features contain useful
joint information of semantically related objects, which benefit the retrieval
task. To compute the image-text similarity, we incorporate a Multi-attention
Cross Matching Model into DP-RNN. It aggregates the affinity between objects
and words with cross-modality guided attention and self-attention. Our model
achieves the state-of-the-art performance on Flickr30K dataset and competitive
performance on MS-COCO dataset. Extensive experiments demonstrate the
effectiveness of our model.Comment: Accepted by AAAI-2
Enabling Efficient Equivariant Operations in the Fourier Basis via Gaunt Tensor Products
Developing equivariant neural networks for the E(3) group plays an important
role in modeling 3D data across real-world applications. Enforcing this
equivariance primarily involves the tensor products of irreducible
representations (irreps). However, the computational complexity of such
operations increases significantly as higher-order tensors are used. In this
work, we propose a systematic approach to substantially accelerate the
computation of the tensor products of irreps. We mathematically connect the
commonly used Clebsch-Gordan coefficients to the Gaunt coefficients, which are
integrals of products of three spherical harmonics. Through Gaunt coefficients,
the tensor product of irreps becomes equivalent to the multiplication between
spherical functions represented by spherical harmonics. This perspective
further allows us to change the basis for the equivariant operations from
spherical harmonics to a 2D Fourier basis. Consequently, the multiplication
between spherical functions represented by a 2D Fourier basis can be
efficiently computed via the convolution theorem and Fast Fourier Transforms.
This transformation reduces the complexity of full tensor products of irreps
from to , where is the max degree of
irreps. Leveraging this approach, we introduce the Gaunt Tensor Product, which
serves as a new method to construct efficient equivariant operations across
different model architectures. Our experiments on the Open Catalyst Project and
3BPA datasets demonstrate both the increased efficiency and improved
performance of our approach.Comment: 36 pages; ICLR 2024 (Spotlight Presentation); Code:
https://github.com/lsj2408/Gaunt-Tensor-Produc