The field of relation extraction (RE) is experiencing a notable shift towards
generative relation extraction (GRE), leveraging the capabilities of large
language models (LLMs). However, we discovered that traditional relation
extraction (RE) metrics like precision and recall fall short in evaluating GRE
methods. This shortfall arises because these metrics rely on exact matching
with human-annotated reference relations, while GRE methods often produce
diverse and semantically accurate relations that differ from the references. To
fill this gap, we introduce GenRES for a multi-dimensional assessment in terms
of the topic similarity, uniqueness, granularity, factualness, and completeness
of the GRE results. With GenRES, we empirically identified that (1)
precision/recall fails to justify the performance of GRE methods; (2)
human-annotated referential relations can be incomplete; (3) prompting LLMs
with a fixed set of relations or entities can cause hallucinations. Next, we
conducted a human evaluation of GRE methods that shows GenRES is consistent
with human preferences for RE quality. Last, we made a comprehensive evaluation
of fourteen leading LLMs using GenRES across document, bag, and sentence level
RE datasets, respectively, to set the benchmark for future research in GR