35,379 research outputs found

    GEE analysis of clustered binary data with diverging number of covariates

    Full text link
    Clustered binary data with a large number of covariates have become increasingly common in many scientific disciplines. This paper develops an asymptotic theory for generalized estimating equations (GEE) analysis of clustered binary data when the number of covariates grows to infinity with the number of clusters. In this "large nn, diverging pp" framework, we provide appropriate regularity conditions and establish the existence, consistency and asymptotic normality of the GEE estimator. Furthermore, we prove that the sandwich variance formula remains valid. Even when the working correlation matrix is misspecified, the use of the sandwich variance formula leads to an asymptotically valid confidence interval and Wald test for an estimable linear combination of the unknown parameters. The accuracy of the asymptotic approximation is examined via numerical simulations. We also discuss the "diverging pp" asymptotic theory for general GEE. The results in this paper extend the recent elegant work of Xie and Yang [Ann. Statist. 31 (2003) 310--347] and Balan and Schiopu-Kratina [Ann. Statist. 32 (2005) 522--541] in the "fixed pp" setting.Comment: Published in at http://dx.doi.org/10.1214/10-AOS846 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Partially linear additive quantile regression in ultra-high dimension

    Get PDF
    We consider a flexible semiparametric quantile regression model for analyzing high dimensional heterogeneous data. This model has several appealing features: (1) By considering different conditional quantiles, we may obtain a more complete picture of the conditional distribution of a response variable given high dimensional covariates. (2) The sparsity level is allowed to be different at different quantile levels. (3) The partially linear additive structure accommodates nonlinearity and circumvents the curse of dimensionality. (4) It is naturally robust to heavy-tailed distributions. In this paper, we approximate the nonlinear components using B-spline basis functions. We first study estimation under this model when the nonzero components are known in advance and the number of covariates in the linear part diverges. We then investigate a nonconvex penalized estimator for simultaneous variable selection and estimation. We derive its oracle property for a general class of nonconvex penalty functions in the presence of ultra-high dimensional covariates under relaxed conditions. To tackle the challenges of nonsmooth loss function, nonconvex penalty function and the presence of nonlinear components, we combine a recently developed convex-differencing method with modern empirical process techniques. Monte Carlo simulations and an application to a microarray study demonstrate the effectiveness of the proposed method. We also discuss how the method for a single quantile of interest can be extended to simultaneous variable selection and estimation at multiple quantiles.Comment: Published at http://dx.doi.org/10.1214/15-AOS1367 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Adaptive Dispatching of Tasks in the Cloud

    Full text link
    The increasingly wide application of Cloud Computing enables the consolidation of tens of thousands of applications in shared infrastructures. Thus, meeting the quality of service requirements of so many diverse applications in such shared resource environments has become a real challenge, especially since the characteristics and workload of applications differ widely and may change over time. This paper presents an experimental system that can exploit a variety of online quality of service aware adaptive task allocation schemes, and three such schemes are designed and compared. These are a measurement driven algorithm that uses reinforcement learning, secondly a "sensible" allocation algorithm that assigns jobs to sub-systems that are observed to provide a lower response time, and then an algorithm that splits the job arrival stream into sub-streams at rates computed from the hosts' processing capabilities. All of these schemes are compared via measurements among themselves and with a simple round-robin scheduler, on two experimental test-beds with homogeneous and heterogeneous hosts having different processing capacities.Comment: 10 pages, 9 figure

    Modelling galaxy stellar mass evolution from z~0.8 to today

    Full text link
    We apply the empirical method built for z=0 in the previous work of Wang et al. to a higher redshift, to link galaxy stellar mass directly with its hosting dark matter halo mass at z~0.8. The relation of the galaxy stellar mass and the host halo mass M_infall is constrained by fitting both the stellar mass function and the correlation functions at different stellar mass intervals of the VVDS observation, where M_infall is the mass of the hosting halo at the time when the galaxy was last the central galaxy. We find that for low mass haloes, their residing central galaxies are less massive at high redshift than those at low redshift. For high mass haloes, central galaxies in these haloes at high redshift are a bit more massive than the galaxies at low redshift. Satellite galaxies are less massive at earlier times, for any given mass of hosting haloes. Fitting both the SDSS and VVDS observations simultaneously, we also propose a unified model of the M_stars-M_infall relation, which describes the evolution of central galaxy mass as a function of time. The stellar mass of a satellite galaxy is determined by the same M_stars-M_infall relation of central galaxies at the time when the galaxy is accreted. With these models, we study the amount of galaxy stellar mass increased from z~0.8 to the present day through galaxy mergers and star formation. Low mass galaxies gain their stellar masses from z~0.8 to z=0 mainly through star formation. For galaxies of higher mass, the increase of stellar mass solely through mergers from z=0.8 can make the massive galaxies a factor ~2 larger than observed at z=0. We can also predict stellar mass functions of redshifts up to z~3, and the results are consistent with the latest observations.Comment: 12 pages, 10 figures, accepted for publication in MNRA

    Calibrating nonconvex penalized regression in ultra-high dimension

    Full text link
    We investigate high-dimensional nonconvex penalized regression, where the number of covariates may grow at an exponential rate. Although recent asymptotic theory established that there exists a local minimum possessing the oracle property under general conditions, it is still largely an open problem how to identify the oracle estimator among potentially multiple local minima. There are two main obstacles: (1) due to the presence of multiple minima, the solution path is nonunique and is not guaranteed to contain the oracle estimator; (2) even if a solution path is known to contain the oracle estimator, the optimal tuning parameter depends on many unknown factors and is hard to estimate. To address these two challenging issues, we first prove that an easy-to-calculate calibrated CCCP algorithm produces a consistent solution path which contains the oracle estimator with probability approaching one. Furthermore, we propose a high-dimensional BIC criterion and show that it can be applied to the solution path to select the optimal tuning parameter which asymptotically identifies the oracle estimator. The theory for a general class of nonconvex penalties in the ultra-high dimensional setup is established when the random errors follow the sub-Gaussian distribution. Monte Carlo studies confirm that the calibrated CCCP algorithm combined with the proposed high-dimensional BIC has desirable performance in identifying the underlying sparsity pattern for high-dimensional data analysis.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1159 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore