Search CORE

7 research outputs found

Azadkia-Chatterjee's correlation coefficient adapts to manifold data

Author: Han Fang
Huang Zhihan
Publication venue
Publication date: 22/09/2022
Field of study

In their seminal work, Azadkia and Chatterjee (2021) initiated graph-based methods for measuring variable dependence strength. By appealing to nearest neighbor graphs, they gave an elegant solution to a problem of R\'enyi (R\'enyi, 1959). Their idea was later developed in Deb et al. (2020) and the authors there proved that, quite interestingly, Azadkia and Chatterjee's correlation coefficient can automatically adapt to the manifold structure of the data. This paper furthers their study in terms of calculating the statistic's limiting variance under independence and showing that it only depends on the manifold dimension.Comment: 25 page

arXiv.org e-Print Archive

Background Modeling for Double Higgs Boson Production: Density Ratios and Optimal Transport

Author: Alison John
Bryant Patrick
Kuusela Mikael
Manole Tudor
Wasserman Larry
Publication venue
Publication date: 04/08/2022
Field of study

We study the problem of data-driven background estimation, arising in the search of physics signals predicted by the Standard Model at the Large Hadron Collider. Our work is motivated by the search for the production of pairs of Higgs bosons decaying into four bottom quarks. A number of other physical processes, known as background, also share the same final state. The data arising in this problem is therefore a mixture of unlabeled background and signal events, and the primary aim of the analysis is to determine whether the proportion of unlabeled signal events is nonzero. A challenging but necessary first step is to estimate the distribution of background events. Past work in this area has determined regions of the space of collider events where signal is unlikely to appear, and where the background distribution is therefore identifiable. The background distribution can be estimated in these regions, and extrapolated into the region of primary interest using transfer learning of a multivariate classifier. We build upon this existing approach in two ways. On the one hand, we revisit this method by developing a powerful new classifier architecture tailored to collider data. On the other hand, we develop a new method for background estimation, based on the optimal transport problem, which relies on distinct modeling assumptions. These two methods can serve as powerful cross-checks for each other in particle physics analyses, due to the complementarity of their underlying assumptions. We compare their performance on simulated collider data

arXiv.org e-Print Archive

Mode-Seeking Clustering and Density Ridge Estimation via Direct Estimation of Density-Derivative-Ratios

Author: Hyvärinen Aapo
Kanamori Takafumi
Niu Gang
Sasaki Hiroaki
Sugiyama Masashi
Publication venue
Publication date: 01/01/2018
Field of study

Modes and ridges of the probability density function behind observed data are useful geometric features. Mode-seeking clustering assigns cluster labels by associating data samples with the nearest modes, and estimation of density ridges enables us to find lower-dimensional structures hidden in data. A key technical challenge both in mode-seeking clustering and density ridge estimation is accurate estimation of the ratios of the first- and second-order density derivatives to the density. A naive approach takes a three-step approach of first estimating the data density, then computing its derivatives, and finally taking their ratios. However, this three-step approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator, and division by the estimated density could significantly magnify the estimation error. To cope with these problems, we propose a novel estimator for the density-derivative-ratios. The proposed estimator does not involve density estimation, but rather directly approximates the ratios of density derivatives of any order. Moreover, we establish a convergence rate of the proposed estimator. Based on the proposed estimator, novel methods both for mode-seeking clustering and density ridge estimation are developed, and the respective convergence rates to the mode and ridge of the underlying density are also established. Finally, we experimentally demonstrate that the developed methods significantly outperform existing methods, particularly for relatively high-dimensional data.Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Minimax optimal approaches to the label shift problem

Author: Banerjee Moulinath
Maity Subha
Sun Yuekai
Publication venue
Publication date: 04/04/2020
Field of study

We study minimax rates of convergence in the label shift problem. In addition to the usual setting in which the learner only has access to unlabeled examples from the target domain, we also consider the setting in which a small number of labeled examples from the target domain are available to the learner. Our study reveals a difference in the difficulty of the label shift problem in the two settings. We attribute this difference to the availability of data from the target domain to estimate the class conditional distributions in the latter setting. We also show that a distributional matching approach is minimax rate-optimal in the former setting

arXiv.org e-Print Archive

Recommended from our members

Lipschitz Density-Ratios, Structured Data, and Data-driven Tuning

Author: Kpotufe S
Publication venue
Publication date: 01/01/2017
Field of study

Density-ratio estimation (i.e. estimating f=fQ/fP for two unknown distributions Q and P) has proved useful in many Machine Learning tasks, e.g., risk-calibration in transfer-learning, two-sample tests, and also useful in common techniques such importance sampling and bias correction. While there are many important analyses of this estimation problem, the present paper derives convergence rates in other practical settings that are less understood, namely, extensions of traditional Lipschitz smoothness conditions, and common high-dimensional settings with structured data (e.g. manifold data, sparse data). Various interesting facts, which hold in earlier settings, are shown to extend to these settings. Namely, (1) optimal rates depend only on the smoothness of the ratio f, and not on the densities fQ, fP, supporting the belief that plugging in estimates for fQ, fP is suboptimal; (2) optimal rates depend only on the intrinsic dimension of data, i.e. this problem – unlike density estimation – escapes the curse of dimension. We further show that near-optimal rates are attainable by estimators tuned from data alone, i.e. with no prior distributional information. This last fact is of special interest in unsupervised settings such as this one, where only oracle rates seem to be known, i.e., rates which assume critical distributional information usually unavailable in practice

Princeton University Open Access Repository