Since the severe imbalanced predicate distributions in common subject-object
relations, current Scene Graph Generation (SGG) methods tend to predict
frequent predicate categories and fail to recognize rare ones. To improve the
robustness of SGG models on different predicate categories, recent research has
focused on unbiased SGG and adopted mean Recall@K (mR@K) as the main evaluation
metric. However, we discovered two overlooked issues about this de facto
standard metric mR@K, which makes current unbiased SGG evaluation vulnerable
and unfair: 1) mR@K neglects the correlations among predicates and
unintentionally breaks category independence when ranking all the triplet
predictions together regardless of the predicate categories, leading to the
performance of some predicates being underestimated. 2) mR@K neglects the
compositional diversity of different predicates and assigns excessively high
weights to some oversimple category samples with limited composable relation
triplet types. It totally conflicts with the goal of SGG task which encourages
models to detect more types of visual relationship triplets. In addition, we
investigate the under-explored correlation between objects and predicates,
which can serve as a simple but strong baseline for unbiased SGG. In this
paper, we refine mR@K and propose two complementary evaluation metrics for
unbiased SGG: Independent Mean Recall (IMR) and weighted IMR (wIMR). These two
metrics are designed by considering the category independence and diversity of
composable relation triplets, respectively. We compare the proposed metrics
with the de facto standard metrics through extensive experiments and discuss
the solutions to evaluate unbiased SGG in a more trustworthy way