$\textit{greylock}$: A Python Package for Measuring The Composition of
  Complex Datasets

Arnaout, Ramy; Arnaout, Rima; Arora, Rohit; Braun, Jasper; Hill, Elliot D.; Lee, Ghee Rye; Mazzoni, Gabrielle; Morgan, Alexandra; Nguyen, Phuc; Quintana, Liza M.

$\textit{greylock}$ : A Python Package for Measuring The Composition of Complex Datasets

Authors: Ramy Arnaout
Rima Arnaout
Rohit Arora
Jasper Braun
Elliot D. Hill
Ghee Rye Lee
Gabrielle Mazzoni
Alexandra Morgan
Phuc Nguyen
Liza M. Quintana
Publication date: 29 December 2023
Publisher

Abstract

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed diversity measures, that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed

\textit{greylock}

, a Python package that calculates diversity measures and is tailored to large datasets.

\textit{greylock}

can calculate any of the frequency-sensitive measures of Hill's D-number framework, and going beyond Hill, their similarity-sensitive counterparts (Greylock is a mountain).

\textit{greylock}

also outputs measures that compare datasets (beta diversities). We first briefly review the D-number framework, illustrating how it incorporates elements' frequencies and between-element similarities. We then describe

\textit{greylock}

's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating

\textit{greylock}

's applicability across a range of dataset types and fields.Comment: 42 pages, many figures. Many thanks to Ralf Bundschuh for help with the submission proces

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2401.00102

Last time updated on 14/08/2024