In statistics and machine learning, measuring the similarity between two or
more datasets is important for several purposes. The performance of a
predictive model on novel datasets, referred to as generalizability, critically
depends on how similar the dataset used for fitting the model is to the novel
datasets. Exploiting or transferring insights between similar datasets is a key
aspect of meta-learning and transfer-learning. In two-sample testing, it is
checked, whether the underlying (multivariate) distributions of two datasets
coincide or not.
Extremely many approaches for quantifying dataset similarity have been
proposed in the literature. A structured overview is a crucial first step for
comparisons of approaches. We examine more than 100 methods and provide a
taxonomy, classifying them into ten classes, including (i) comparisons of
cumulative distribution functions, density functions, or characteristic
functions, (ii) methods based on multivariate ranks, (iii) discrepancy measures
for distributions, (iv) graph-based methods, (v) methods based on inter-point
distances, (vi) kernel-based methods, (vii) methods based on binary
classification, (viii) distance and similarity measures for datasets, (ix)
comparisons based on summary statistics, and (x) different testing approaches.
Here, we present an extensive review of these methods. We introduce the main
underlying ideas, formal definitions, and important properties.Comment: 90 pages, submitted to Statistics Survey