35,379 research outputs found
GEE analysis of clustered binary data with diverging number of covariates
Clustered binary data with a large number of covariates have become
increasingly common in many scientific disciplines. This paper develops an
asymptotic theory for generalized estimating equations (GEE) analysis of
clustered binary data when the number of covariates grows to infinity with the
number of clusters. In this "large , diverging " framework, we provide
appropriate regularity conditions and establish the existence, consistency and
asymptotic normality of the GEE estimator. Furthermore, we prove that the
sandwich variance formula remains valid. Even when the working correlation
matrix is misspecified, the use of the sandwich variance formula leads to an
asymptotically valid confidence interval and Wald test for an estimable linear
combination of the unknown parameters. The accuracy of the asymptotic
approximation is examined via numerical simulations. We also discuss the
"diverging " asymptotic theory for general GEE. The results in this paper
extend the recent elegant work of Xie and Yang [Ann. Statist. 31 (2003)
310--347] and Balan and Schiopu-Kratina [Ann. Statist. 32 (2005) 522--541] in
the "fixed " setting.Comment: Published in at http://dx.doi.org/10.1214/10-AOS846 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Partially linear additive quantile regression in ultra-high dimension
We consider a flexible semiparametric quantile regression model for analyzing
high dimensional heterogeneous data. This model has several appealing features:
(1) By considering different conditional quantiles, we may obtain a more
complete picture of the conditional distribution of a response variable given
high dimensional covariates. (2) The sparsity level is allowed to be different
at different quantile levels. (3) The partially linear additive structure
accommodates nonlinearity and circumvents the curse of dimensionality. (4) It
is naturally robust to heavy-tailed distributions. In this paper, we
approximate the nonlinear components using B-spline basis functions. We first
study estimation under this model when the nonzero components are known in
advance and the number of covariates in the linear part diverges. We then
investigate a nonconvex penalized estimator for simultaneous variable selection
and estimation. We derive its oracle property for a general class of nonconvex
penalty functions in the presence of ultra-high dimensional covariates under
relaxed conditions. To tackle the challenges of nonsmooth loss function,
nonconvex penalty function and the presence of nonlinear components, we combine
a recently developed convex-differencing method with modern empirical process
techniques. Monte Carlo simulations and an application to a microarray study
demonstrate the effectiveness of the proposed method. We also discuss how the
method for a single quantile of interest can be extended to simultaneous
variable selection and estimation at multiple quantiles.Comment: Published at http://dx.doi.org/10.1214/15-AOS1367 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Adaptive Dispatching of Tasks in the Cloud
The increasingly wide application of Cloud Computing enables the
consolidation of tens of thousands of applications in shared infrastructures.
Thus, meeting the quality of service requirements of so many diverse
applications in such shared resource environments has become a real challenge,
especially since the characteristics and workload of applications differ widely
and may change over time. This paper presents an experimental system that can
exploit a variety of online quality of service aware adaptive task allocation
schemes, and three such schemes are designed and compared. These are a
measurement driven algorithm that uses reinforcement learning, secondly a
"sensible" allocation algorithm that assigns jobs to sub-systems that are
observed to provide a lower response time, and then an algorithm that splits
the job arrival stream into sub-streams at rates computed from the hosts'
processing capabilities. All of these schemes are compared via measurements
among themselves and with a simple round-robin scheduler, on two experimental
test-beds with homogeneous and heterogeneous hosts having different processing
capacities.Comment: 10 pages, 9 figure
Modelling galaxy stellar mass evolution from z~0.8 to today
We apply the empirical method built for z=0 in the previous work of Wang et
al. to a higher redshift, to link galaxy stellar mass directly with its hosting
dark matter halo mass at z~0.8. The relation of the galaxy stellar mass and the
host halo mass M_infall is constrained by fitting both the stellar mass
function and the correlation functions at different stellar mass intervals of
the VVDS observation, where M_infall is the mass of the hosting halo at the
time when the galaxy was last the central galaxy. We find that for low mass
haloes, their residing central galaxies are less massive at high redshift than
those at low redshift. For high mass haloes, central galaxies in these haloes
at high redshift are a bit more massive than the galaxies at low redshift.
Satellite galaxies are less massive at earlier times, for any given mass of
hosting haloes. Fitting both the SDSS and VVDS observations simultaneously, we
also propose a unified model of the M_stars-M_infall relation, which describes
the evolution of central galaxy mass as a function of time. The stellar mass of
a satellite galaxy is determined by the same M_stars-M_infall relation of
central galaxies at the time when the galaxy is accreted. With these models, we
study the amount of galaxy stellar mass increased from z~0.8 to the present day
through galaxy mergers and star formation. Low mass galaxies gain their stellar
masses from z~0.8 to z=0 mainly through star formation. For galaxies of higher
mass, the increase of stellar mass solely through mergers from z=0.8 can make
the massive galaxies a factor ~2 larger than observed at z=0. We can also
predict stellar mass functions of redshifts up to z~3, and the results are
consistent with the latest observations.Comment: 12 pages, 10 figures, accepted for publication in MNRA
Calibrating nonconvex penalized regression in ultra-high dimension
We investigate high-dimensional nonconvex penalized regression, where the
number of covariates may grow at an exponential rate. Although recent
asymptotic theory established that there exists a local minimum possessing the
oracle property under general conditions, it is still largely an open problem
how to identify the oracle estimator among potentially multiple local minima.
There are two main obstacles: (1) due to the presence of multiple minima, the
solution path is nonunique and is not guaranteed to contain the oracle
estimator; (2) even if a solution path is known to contain the oracle
estimator, the optimal tuning parameter depends on many unknown factors and is
hard to estimate. To address these two challenging issues, we first prove that
an easy-to-calculate calibrated CCCP algorithm produces a consistent solution
path which contains the oracle estimator with probability approaching one.
Furthermore, we propose a high-dimensional BIC criterion and show that it can
be applied to the solution path to select the optimal tuning parameter which
asymptotically identifies the oracle estimator. The theory for a general class
of nonconvex penalties in the ultra-high dimensional setup is established when
the random errors follow the sub-Gaussian distribution. Monte Carlo studies
confirm that the calibrated CCCP algorithm combined with the proposed
high-dimensional BIC has desirable performance in identifying the underlying
sparsity pattern for high-dimensional data analysis.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1159 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …