The identification of relevant features, i.e., the driving variables that
determine a process or the property of a system, is an essential part of the
analysis of data sets whose entries are described by a large number of
variables. The preferred measure for quantifying the relevance of nonlinear
statistical dependencies is mutual information, which requires as input
probability distributions. Probability distributions cannot be reliably sampled
and estimated from limited data, especially for real-valued data samples such
as lengths or energies. Here, we introduce total cumulative mutual information
(TCMI), a measure of the relevance of mutual dependencies based on cumulative
probability distributions. TCMI can be estimated directly from sample data and
is a non-parametric, robust and deterministic measure that facilitates
comparisons and rankings between feature sets with different cardinality. The
ranking induced by TCMI allows for feature selection, i.e., the identification
of the set of relevant features that are statistical related to the process or
the property of a system, while taking into account the number of data samples
as well as the cardinality of the feature subsets. We evaluate the performance
of our measure with simulated data, compare its performance with similar
multivariate dependence measures, and demonstrate the effectiveness of our
feature selection method on a set of standard data sets and a typical scenario
in materials science.Comment: 36 pages, 7 figures, 6 table