A Review and Taxonomy of Methods for Quantifying Dataset Similarity

Bommert, Andrea; Rahnenführer, Jörg; Stolte, Marieke

A Review and Taxonomy of Methods for Quantifying Dataset Similarity

Authors: Andrea Bommert
Jörg Rahnenführer
Marieke Stolte
Publication date: 7 December 2023
Publisher

Abstract

In statistics and machine learning, measuring the similarity between two or more datasets is important for several purposes. The performance of a predictive model on novel datasets, referred to as generalizability, critically depends on how similar the dataset used for fitting the model is to the novel datasets. Exploiting or transferring insights between similar datasets is a key aspect of meta-learning and transfer-learning. In two-sample testing, it is checked, whether the underlying (multivariate) distributions of two datasets coincide or not. Extremely many approaches for quantifying dataset similarity have been proposed in the literature. A structured overview is a crucial first step for comparisons of approaches. We examine more than 100 methods and provide a taxonomy, classifying them into ten classes, including (i) comparisons of cumulative distribution functions, density functions, or characteristic functions, (ii) methods based on multivariate ranks, (iii) discrepancy measures for distributions, (iv) graph-based methods, (v) methods based on inter-point distances, (vi) kernel-based methods, (vii) methods based on binary classification, (viii) distance and similarity measures for datasets, (ix) comparisons based on summary statistics, and (x) different testing approaches. Here, we present an extensive review of these methods. We introduce the main underlying ideas, formal definitions, and important properties.Comment: 90 pages, submitted to Statistics Survey

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2312.04078

Last time updated on 04/08/2024