7 research outputs found
Azadkia-Chatterjee's correlation coefficient adapts to manifold data
In their seminal work, Azadkia and Chatterjee (2021) initiated graph-based
methods for measuring variable dependence strength. By appealing to nearest
neighbor graphs, they gave an elegant solution to a problem of R\'enyi
(R\'enyi, 1959). Their idea was later developed in Deb et al. (2020) and the
authors there proved that, quite interestingly, Azadkia and Chatterjee's
correlation coefficient can automatically adapt to the manifold structure of
the data. This paper furthers their study in terms of calculating the
statistic's limiting variance under independence and showing that it only
depends on the manifold dimension.Comment: 25 page
Background Modeling for Double Higgs Boson Production: Density Ratios and Optimal Transport
We study the problem of data-driven background estimation, arising in the
search of physics signals predicted by the Standard Model at the Large Hadron
Collider. Our work is motivated by the search for the production of pairs of
Higgs bosons decaying into four bottom quarks. A number of other physical
processes, known as background, also share the same final state. The data
arising in this problem is therefore a mixture of unlabeled background and
signal events, and the primary aim of the analysis is to determine whether the
proportion of unlabeled signal events is nonzero. A challenging but necessary
first step is to estimate the distribution of background events. Past work in
this area has determined regions of the space of collider events where signal
is unlikely to appear, and where the background distribution is therefore
identifiable. The background distribution can be estimated in these regions,
and extrapolated into the region of primary interest using transfer learning of
a multivariate classifier. We build upon this existing approach in two ways. On
the one hand, we revisit this method by developing a powerful new classifier
architecture tailored to collider data. On the other hand, we develop a new
method for background estimation, based on the optimal transport problem, which
relies on distinct modeling assumptions. These two methods can serve as
powerful cross-checks for each other in particle physics analyses, due to the
complementarity of their underlying assumptions. We compare their performance
on simulated collider data
Mode-Seeking Clustering and Density Ridge Estimation via Direct Estimation of Density-Derivative-Ratios
Modes and ridges of the probability density function behind observed data are useful geometric features. Mode-seeking clustering assigns cluster labels by associating data samples with the nearest modes, and estimation of density ridges enables us to find lower-dimensional structures hidden in data. A key technical challenge both in mode-seeking clustering and density ridge estimation is accurate estimation of the ratios of the first- and second-order density derivatives to the density. A naive approach takes a three-step approach of first estimating the data density, then computing its derivatives, and finally taking their ratios. However, this three-step approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator, and division by the estimated density could significantly magnify the estimation error. To cope with these problems, we propose a novel estimator for the density-derivative-ratios. The proposed estimator does not involve density estimation, but rather directly approximates the ratios of density derivatives of any order. Moreover, we establish a convergence rate of the proposed estimator. Based on the proposed estimator, novel methods both for mode-seeking clustering and density ridge estimation are developed, and the respective convergence rates to the mode and ridge of the underlying density are also established. Finally, we experimentally demonstrate that the developed methods significantly outperform existing methods, particularly for relatively high-dimensional data.Peer reviewe
Minimax optimal approaches to the label shift problem
We study minimax rates of convergence in the label shift problem. In addition
to the usual setting in which the learner only has access to unlabeled examples
from the target domain, we also consider the setting in which a small number of
labeled examples from the target domain are available to the learner. Our study
reveals a difference in the difficulty of the label shift problem in the two
settings. We attribute this difference to the availability of data from the
target domain to estimate the class conditional distributions in the latter
setting. We also show that a distributional matching approach is minimax
rate-optimal in the former setting
Recommended from our members
Lipschitz Density-Ratios, Structured Data, and Data-driven Tuning
Density-ratio estimation (i.e. estimating f=fQ/fP for two unknown distributions Q and P) has proved useful in many Machine Learning tasks, e.g., risk-calibration in transfer-learning, two-sample tests, and also useful in common techniques such importance sampling and bias correction. While there are many important analyses of this estimation problem, the present paper derives convergence rates in other practical settings that are less understood, namely, extensions of traditional Lipschitz smoothness conditions, and common high-dimensional settings with structured data (e.g. manifold data, sparse data). Various interesting facts, which hold in earlier settings, are shown to extend to these settings. Namely, (1) optimal rates depend only on the smoothness of the ratio f, and not on the densities fQ, fP, supporting the belief that plugging in estimates for fQ, fP is suboptimal; (2) optimal rates depend only on the intrinsic dimension of data, i.e. this problem – unlike density estimation – escapes the curse of dimension. We further show that near-optimal rates are attainable by estimators tuned from data alone, i.e. with no prior distributional information. This last fact is of special interest in unsupervised settings such as this one, where only oracle rates seem to be known, i.e., rates which assume critical distributional information usually unavailable in practice