263 research outputs found
Bayesian nonparametric tests via sliced inverse modeling
We study the problem of independence and conditional independence tests
between categorical covariates and a continuous response variable, which has an
immediate application in genetics. Instead of estimating the conditional
distribution of the response given values of covariates, we model the
conditional distribution of covariates given the discretized response (aka
"slices"). By assigning a prior probability to each possible discretization
scheme, we can compute efficiently a Bayes factor (BF)-statistic for the
independence (or conditional independence) test using a dynamic programming
algorithm. Asymptotic and finite-sample properties such as power and null
distribution of the BF statistic are studied, and a stepwise variable selection
method based on the BF statistic is further developed. We compare the BF
statistic with some existing classical methods and demonstrate its statistical
power through extensive simulation studies. We apply the proposed method to a
mouse genetics data set aiming to detect quantitative trait loci (QTLs) and
obtain promising results.Comment: 32 pages, 7 figure
The Bright and Dark Side of Cooperation for Regional Innovation Performance
Studies analyzing the importance of intra- and inter-regional cooperation for regional innovation performance are mainly of qualitative nature and focus strongly on the positive effects that high levels of cooperation can yield. For the case of the German labor market regions and the Electrics & Electronics industry the paper provides a quantitative-empirical analysis taking into account the possibility of negative effects related to regional lock-in, lock-out, and cooperation overload situations. Using conditional nonparametric frontier techniques and cooperation behavior measures we find positive as well as substantial negative effects of cooperation with the latter being induced by excessive and unbalanced cooperation behavior.regional innovation performance, cooperation, lock-out, lock-in, cooperation overload
Adaptive Basis Sampling for Smoothing Splines
Smoothing splines provide flexible nonparametric regression estimators. Penalized likelihood method is adopted when responses are from exponential families and multivariate models are constructed with certain analysis of variance decomposition. However, the high computational cost of smoothing splines for large data sets has hindered their wide application. We develop a new method, named adaptive basis sampling, for efficient computation of smoothing splines in super-large samples. Generally, a smoothing spline for a regression problem with sample size n can be expressed as a linear combination of n basis functions and its computational complexity is O(n³). We achieve a more scalable computation in the multivariate case by evaluating the smoothing spline using a smaller set of basis functions, obtained by an adaptive sampling scheme that uses values of the response variable. Our asymptotic analysis shows that smoothing splines computed via adaptive basis sampling converge to the true function at the same rate as full basis smoothing splines. We show that the proposed method outperforms a sampling method that does not use the values of response variable by simulation studies, and apply it to several real data examples
New developments of dimension reduction
Variable selection becomes more crucial than before, since high dimensional data are frequently seen in many research areas. Many model-based variable selection methods have been developed. However, the performance might be poor when the model is mis-specified. Sufficient dimension reduction (SDR, Li 1991; Cook 1998) provides a general framework for model-free variable selection methods.
In this thesis, we first propose a novel model-free variable selection method to deal with multi-population data by incorporating the grouping information. Theoretical properties of our proposed method are also presented. Simulation studies show that our new method significantly improves the selection performance compared with those ignoring the grouping information. In the second part of this dissertation, we apply partial SDR method to conduct conditional model-free variable (feature) screening for ultra-high dimensional data, when researchers have prior information regarding the importance of certain predictors based on experience or previous investigations. Comparing to the state of art conditional screening method, conditional sure independence screening (CSIS; Barut, Fan and Verhasselt, 2016), our method greatly outperforms CSIS for nonlinear models. The sure screening consistency property of our proposed method is also established --Abstract, page iv
Recommended from our members
Partition Models for Variable Selection and Interaction Detection
Variable selection methods play important roles in modeling high-dimensional data and are key to data-driven scientific discoveries. In this thesis, we consider the problem of variable selection with interaction detection. Instead of building a predictive model of the response given combinations of predictors, we start by modeling the conditional distribution of predictors given partitions based on responses. We use this inverse modeling perspective as motivation to propose a stepwise procedure for effectively detecting interaction with few assumptions on parametric form. The proposed procedure is able to detect pairwise interactions among p predictors with a computational time of instead of under moderate conditions. We establish consistency of the proposed procedure in variable selection under a diverging number of predictors and sample size. We demonstrate its excellent empirical performance in comparison with some existing methods through simulation studies as well as real data examples. Next, we combine the forward and inverse modeling perspectives under the Bayesian framework to detect pleiotropic and epistatic effects in effects in expression quantitative loci (eQTLs) studies. We augment the Bayesian partition model proposed by Zhang et al. (2010) to capture complex dependence structure among gene expression and genetic markers. In particular, we propose a sequential partition prior to model the asymmetric roles played by the response and the predictors, and we develop an efficient dynamic programming algorithm for sampling latent individual partitions. The augmented partition model significantly improves the power in detecting eQTLs compared to previous methods in both simulations and real data examples pertaining to yeast. Finally, we study the application of Bayesian partition models in the unsupervised learning of transcription factor (TF) families based on protein binding microarray (PBM). The problem of TF subclass identification can be viewed as the clustering of TFs with variable selection on their binding DNA sequences. Our model provides simultaneous identification of TF families and their shared sequence preferences, as well as DNA sequences bound preferentially by individual members of TF families. Our analysis may aid in deciphering cis regulatory codes and determinants of protein-DNA binding specificity.Statistic
A new sliced inverse regression method for multivariate response
International audienceA semiparametric regression model of a q-dimensional multivariate response y on a p-dimensional covariate x is considered. A new approach is proposed based on sliced inverse regression (SIR) for estimating the effective dimension reduction (EDR) space without requiring a prespecified parametric model. The convergence at rate square root of n of the estimated EDR space is shown. The choice of the dimension of the EDR space is discussed. Moreover, a way to cluster components of y related to the same EDR space is provided. Thus, the proposed multivariate SIR method can be used properly on each cluster instead of blindly applying it on all components of y. The numerical performances of multivariate SIR are illustrated on a simulation study. Applications to a remote sensing dataset and to the Minneapolis elementary schools data are also provided. Although the proposed methodology relies on SIR, it opens the door for new regression approaches with a multivariate response
- …