18 research outputs found

    UPS delivers optimal phase diagram in high-dimensional variable selection

    Full text link
    Consider a linear model Y=Xβ+zY=X\beta+z, z∼N(0,In)z\sim N(0,I_n). Here, X=Xn,pX=X_{n,p}, where both pp and nn are large, but p>np>n. We model the rows of XX as i.i.d. samples from N(0,1nΩ)N(0,\frac{1}{n}\Omega), where Ω\Omega is a p×pp\times p correlation matrix, which is unknown to us but is presumably sparse. The vector β\beta is also unknown but has relatively few nonzero coordinates, and we are interested in identifying these nonzeros. We propose the Univariate Penalization Screeing (UPS) for variable selection. This is a screen and clean method where we screen with univariate thresholding and clean with penalized MLE. It has two important properties: sure screening and separable after screening. These properties enable us to reduce the original regression problem to many small-size regression problems that can be fitted separately. The UPS is effective both in theory and in computation.Comment: Published in at http://dx.doi.org/10.1214/11-AOS947 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Rate optimal multiple testing procedure in high-dimensional regression

    Full text link
    Multiple testing and variable selection have gained much attention in statistical theory and methodology research. They are dealing with the same problem of identifying the important variables among many (Jin, 2012). However, there is little overlap in the literature. Research on variable selection has been focusing on selection consistency, i.e., both type I and type II errors converging to zero. This is only possible when the signals are sufficiently strong, contrary to many modern applications. For the regime where the signals are both rare and weak, it is inevitable that a certain amount of false discoveries will be allowed, as long as some error rate can be controlled. In this paper, motivated by the research by Ji and Jin (2012) and Jin (2012) in the rare/weak regime, we extend their UPS procedure for variable selection to multiple testing. Under certain conditions, the new UPT procedure achieves the fastest convergence rate of marginal false non-discovery rates, while controlling the marginal false discovery rate at any designated level α\alpha asymptotically. Numerical results are provided to demonstrate the advantage of the proposed method.Comment: 27 page

    Selected Topics In Nonparametric Testing And Variable Selection For High Dimensional Data

    Full text link
    Part I: The Gaussian white noise model has been used as a general framework for nonparametric problems. The asymptotic equivalence of this model to density estimation and nonparametric regression has been established by Nussbaum (1996), Brown and Low (1996). In Chapter 1, we consider testing for presence of a signal in Gaussian white noise with intensity n[-]1/2 , when the alternatives are given by smoothness ellipsoids with an L2 -ball of radius [rho] removed. It is known that, for a fixed Sobolev type ellipsoid [SIGMA]([beta], M ) of smoothness [beta] and size M , the radius rate [rho] n[-]4[beta]/(4[beta]+1) is the critical separation rate, in the sense that the minimax error of second kind over [alpha]-tests stays asymptotically between 0 and 1 strictly (Ingster, 1982). In addition, Ermakov (1990) found the sharp asymptotics of the minimax error of second kind at the separation rate. For adaptation over both [beta] and M in that context, it is known that a log log-penalty over the separation rate for [rho] is necessary for a nonzero asymptotic power. Here, following an example in nonparametric estimation related to the Pinsker constant, we investigate the adaptation problem over the ellipsoid size M only, for fixed smoothness degree [beta]. It is established that the Ermakov type sharp asymptotics can be preserved in that adaptive setting, if [rho] [RIGHTWARDS ARROW] 0 slower than the separation rate. The penalty for adapation in that setting turns out to be a sequence tending to infinity arbitrarily slowly. In Chapter 2, motivated by the sharp asymptotics of nonparametric estimation for non-Gaussian regression (Golubev and Nussbaum, 1990), we extend Ermakov's sharp asymptotics for the minimax testing errors to the nonparametric regression model with nonnormal errors. The paper entitled "Sharp Asymptotics for Risk Bounds in Nonparametric Testing with Uncertainty in Error Distributions" is in preparation. This part is joint work with Michael Nussbaum. Part II: Consider a linear model Y = X [beta] + z, z ~ N (0, In ). Here, X = Xn, p , where both p and n are large but p > n. We model the rows of X as iid samples from N (0, 1 Ω), where Ω is a p x p correlation matrix, which is unknown to us but is n presumably sparse. The vector [beta] is also unknown but has relatively few nonzero coordinates, and we are interested in identifying these nonzeros. We propose the Univariate Penalization Screeing (UPS) for variable selection. This is a Screen and Clean method where we screen with Univariate thresholding, and clean with Penalized MLE. It has two important properties: Sure Screening and Separable After Screening. These properties enable us to reduce the original regression problem to many small-size regression problems that can be fitted separately. The UPS is effective both in theory and in computation. We measure the performance of a procedure by the Hamming distance, and use an asymptotic framework where p [RIGHTWARDS ARROW] [INFINITY] and other quantities (e.g., n, sparsity level and strength of signals) are linked to p by fixed parameters. We find that in many cases, the UPS achieves the optimal rate of convergence. Al- so, for many different Ω, there is a common three-phase diagram in the twodimensional phase space quantifying the signal sparsity and signal strength. In the first phase, it is possible to recover all signals. In the second phase, it is possible to recover most of the signals, but not all of them. In the third phase, successful variable selection is impossible. UPS partitions the phase space in the same way that the optimal procedures do, and recovers most of the signals as long as successful variable selection is possible. The lasso and the subset selection are well-known approaches to variable selection. However, somewhat surprisingly, there are regions in the phase space where neither of them is rate optimal, even in very simple settings such as Ω is tridiagonal, and when the tuning parameter is ideally set. This part is joint work with Jiashun Jin, and has appeared in Annals of Statistics

    Ups delivers optimal phase diagram in high dimensional variable selection

    No full text
    Abstract We consider a linear regression model where both p and n are large but p > n. The vector β is unknown but is sparse in the sense that only a small proportion of its coordinates is nonzero, and we are interested in identifying these nonzero ones. We model the coordinates of β as samples from a two-component mixture (1 − )ν 0 + π, and the rows of X as samples from N (0, 1 n Ω), where ν 0 is the point mass at 0, π is a distribution, and Ω is a p by p correlation matrix which is unknown but is presumably sparse. We propose a two-stage variable selection procedure which we call the UPS. This is a Screen and Clean procedure We measure the performance of variable selection procedure by the Hamming distance, and use an asymptotic framework where p → ∞ and ( , π, n, Ω) depend on p. We find that in many situations, the UPS achieves the optimal rate of convergence. We also find that in the ( p , π p ) space, there is a three-phase diagram shared by many choices of Ω. In the first phase, it is possible to recover all signals. In the second phase, exact recovery is impossible, but it is possible to recover most of the signals. In the third phase, successful variable selection is impossible. The UPS partitions the phase space in the same way that the optimal procedures do, and recovers most of the signals as long as successful variable selection is possible. The lasso and the subset selection (also known as the L 1 -and L 0 -penalization methods, respectively) are well-known approaches to variable selection. However, somewhat surprisingly, there are regions in the phase space where neither the lasso nor the subset selection is rate optimal, even for very simple Ω. The lasso is non-optimal because it is too loose in filtering out fake signals (i.e. noise that is highly correlated with a signal), and the subset selection is non-optimal because it tends to kill one or more signals when signals appear in pairs, triplets, etc.

    Prognostic value of LGR5 in colorectal cancer: a meta-analysis.

    No full text
    Leucine-rich repeat-containing G protein-coupled receptor 5 (LGR5) has recently been reported to be a marker of cancer stem cells (CSCs) in colorectal cancer (CRC), and the prognostic value of LGR5 in CRC has been evaluated in several studies. However, the conclusions remain controversial. In this study, we aimed to evaluate the association between the expression of LGR5 and the outcome of CRC patients by performing a meta-analysis.We systematically searched for relevant studies published up to February 2014 using the PubMed, Web of Science, EMBASE and Wangfang databases. Only articles in which LGR5 expression was detected by immunohistochemistry were included. A meta-analysis was performed using STATA 12.0, and pooled hazard ratios (HRs) with 95% confidence intervals (CIs) were used to estimate the strength of the association between LGR5 expression and the prognosis of CRC patients.A total of 7 studies comprising 1833 CRC patients met the inclusion criteria, including 6 studies comprising 1781 patients for overall survival (OS) and 3 studies comprising 528 patients for disease-free survival (DFS). Our results showed that high LGR5 expression was significantly associated with poor prognosis in terms of OS (HR: 1.87, 95% CI: 1.23-2.84; P = 0.003) and DFS (HR: 2.44, 95% CI: 1.49-3.98; P<0.001). Further subgroup analysis revealed that many factors, including the study region, number of patients, follow-up duration and cutoff value, affected the significance of the association between LGR5 expression and a worse prognosis in patients with CRC. In addition, there was no evidence of publication bias, as suggested by Begg's and Egger's tests.The present meta-analysis indicated that high LGR5 expression was associated with poor prognosis in patients with CRC and that LGR5 is an efficient prognostic factor in CRC

    Stratified analysis of pooled hazard ratios for colorectal cancer patients with High LGR5expression.

    No full text
    <p>OS: Overall survival; HR: Hazard ratio; CI: Confidence intervals; REM: Random-effect model; FEM: Fixed-effect model.</p><p>P (Z): P value for significant test; P (BON): P value from stepdown Bonferroni testing.</p><p>Stratified analysis of pooled hazard ratios for colorectal cancer patients with High LGR5expression.</p
    corecore