92 research outputs found
On the Ambiguity of Rank-Based Evaluation of Entity Alignment or Link Prediction Methods
In this work, we take a closer look at the evaluation of two families of
methods for enriching information from knowledge graphs: Link Prediction and
Entity Alignment. In the current experimental setting, multiple different
scores are employed to assess different aspects of model performance. We
analyze the informativeness of these evaluation measures and identify several
shortcomings. In particular, we demonstrate that all existing scores can hardly
be used to compare results across different datasets. Moreover, we demonstrate
that varying size of the test size automatically has impact on the performance
of the same model based on commonly used metrics for the Entity Alignment task.
We show that this leads to various problems in the interpretation of results,
which may support misleading conclusions. Therefore, we propose adjustments to
the evaluation and demonstrate empirically how this supports a fair,
comparable, and interpretable assessment of model performance. Our code is
available at https://github.com/mberr/rank-based-evaluation
Knowledge Base Completion: Baseline strikes back (Again)
Knowledge Base Completion has been a very active area recently, where
multiplicative models have generally outperformed additive and other deep
learning methods -- like GNN, CNN, path-based models. Several recent KBC papers
propose architectural changes, new training methods, or even a new problem
reformulation. They evaluate their methods on standard benchmark datasets -
FB15k, FB15k-237, WN18, WN18RR, and Yago3-10. Recently, some papers discussed
how 1-N scoring can speed up training and evaluation. In this paper, we discuss
how by just applying this training regime to a basic model like Complex gives
near SOTA performance on all the datasets -- we call this model COMPLEX-V2. We
also highlight how various multiplicative methods recently proposed in
literature benefit from this trick and become indistinguishable in terms of
performance on most datasets. This paper calls for a reassessment of their
individual value, in light of these findings
- …