Machine-learning datasets are typically characterized by measuring their size
and class balance. However, there exists a richer and potentially more useful
set of measures, termed diversity measures, that incorporate elements'
frequencies and between-element similarities. Although these have been
available in the R and Julia programming languages for other applications, they
have not been as readily available in Python, which is widely used for machine
learning, and are not easily applied to machine-learning-sized datasets without
special coding considerations. To address these issues, we developed
greylock, a Python package that calculates diversity measures and is
tailored to large datasets. greylock can calculate any of the
frequency-sensitive measures of Hill's D-number framework, and going beyond
Hill, their similarity-sensitive counterparts (Greylock is a mountain).
greylock also outputs measures that compare datasets (beta
diversities). We first briefly review the D-number framework, illustrating how
it incorporates elements' frequencies and between-element similarities. We then
describe greylock's key features and usage. We end with several
examples - immunomics, metagenomics, computational pathology, and medical
imaging - illustrating greylock's applicability across a range of
dataset types and fields.Comment: 42 pages, many figures. Many thanks to Ralf Bundschuh for help with
the submission proces