22,600 research outputs found
FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing
Textual scene graph parsing has become increasingly important in various
vision-language applications, including image caption evaluation and image
retrieval. However, existing scene graph parsers that convert image captions
into scene graphs often suffer from two types of errors. First, the generated
scene graphs fail to capture the true semantics of the captions or the
corresponding images, resulting in a lack of faithfulness. Second, the
generated scene graphs have high inconsistency, with the same semantics
represented by different annotations.
To address these challenges, we propose a novel dataset, which involves
re-annotating the captions in Visual Genome (VG) using a new intermediate
representation called FACTUAL-MR. FACTUAL-MR can be directly converted into
faithful and consistent scene graph annotations. Our experimental results
clearly demonstrate that the parser trained on our dataset outperforms existing
approaches in terms of faithfulness and consistency. This improvement leads to
a significant performance boost in both image caption evaluation and zero-shot
image retrieval tasks. Furthermore, we introduce a novel metric for measuring
scene graph similarity, which, when combined with the improved scene graph
parser, achieves state-of-the-art (SOTA) results on multiple benchmark datasets
for the aforementioned tasks. The code and dataset are available at
https://github.com/zhuang-li/FACTUAL .Comment: 9 pages, ACL 2023 (findings
Image-to-Image Retrieval by Learning Similarity between Scene Graphs
As a scene graph compactly summarizes the high-level content of an image in a
structured and symbolic manner, the similarity between scene graphs of two
images reflects the relevance of their contents. Based on this idea, we propose
a novel approach for image-to-image retrieval using scene graph similarity
measured by graph neural networks. In our approach, graph neural networks are
trained to predict the proxy image relevance measure, computed from
human-annotated captions using a pre-trained sentence similarity model. We
collect and publish the dataset for image relevance measured by human
annotators to evaluate retrieval algorithms. The collected dataset shows that
our method agrees well with the human perception of image similarity than other
competitive baselines.Comment: Accepted to AAAI 202
Attribute-Graph: A Graph based approach to Image Ranking
We propose a novel image representation, termed Attribute-Graph, to rank
images by their semantic similarity to a given query image. An Attribute-Graph
is an undirected fully connected graph, incorporating both local and global
image characteristics. The graph nodes characterise objects as well as the
overall scene context using mid-level semantic attributes, while the edges
capture the object topology. We demonstrate the effectiveness of
Attribute-Graphs by applying them to the problem of image ranking. We benchmark
the performance of our algorithm on the 'rPascal' and 'rImageNet' datasets,
which we have created in order to evaluate the ranking performance on complex
queries containing multiple objects. Our experimental evaluation shows that
modelling images as Attribute-Graphs results in improved ranking performance
over existing techniques.Comment: In IEEE International Conference on Computer Vision (ICCV) 201
- …