Evaluating the performance of graph neural networks (GNNs) is an essential
task for practical GNN model deployment and serving, as deployed GNNs face
significant performance uncertainty when inferring on unseen and unlabeled test
graphs, due to mismatched training-test graph distributions. In this paper, we
study a new problem, GNN model evaluation, that aims to assess the performance
of a specific GNN model trained on labeled and observed graphs, by precisely
estimating its performance (e.g., node classification accuracy) on unseen
graphs without labels. Concretely, we propose a two-stage GNN model evaluation
framework, including (1) DiscGraph set construction and (2) GNNEvaluator
training and inference. The DiscGraph set captures wide-range and diverse graph
data distribution discrepancies through a discrepancy measurement function,
which exploits the outputs of GNNs related to latent node embeddings and node
class predictions. Under the effective training supervision from the DiscGraph
set, GNNEvaluator learns to precisely estimate node classification accuracy of
the to-be-evaluated GNN model and makes an accurate inference for evaluating
GNN model performance. Extensive experiments on real-world unseen and unlabeled
test graphs demonstrate the effectiveness of our proposed method for GNN model
evaluation.Comment: Accepted by NeurIPS 202