6,294 research outputs found
A Partially Linear Framework for Massive Heterogeneous Data
We consider a partially linear framework for modelling massive heterogeneous
data. The major goal is to extract common features across all sub-populations
while exploring heterogeneity of each sub-population. In particular, we propose
an aggregation type estimator for the commonality parameter that possesses the
(non-asymptotic) minimax optimal bound and asymptotic distribution as if there
were no heterogeneity. This oracular result holds when the number of
sub-populations does not grow too fast. A plug-in estimator for the
heterogeneity parameter is further constructed, and shown to possess the
asymptotic distribution as if the commonality information were available. We
also test the heterogeneity among a large number of sub-populations. All the
above results require to regularize each sub-estimation as though it had the
entire sample size. Our general theory applies to the divide-and-conquer
approach that is often used to deal with massive homogeneous data. A technical
by-product of this paper is the statistical inferences for the general kernel
ridge regression. Thorough numerical results are also provided to back up our
theory.Comment: 40 pages main text + 40 pages suppl, To appear in Annals of
Statistic
- …