218 research outputs found
Novel specification tests for synchronous additive concurrent model formulation based on martingale difference divergence
This paper presents new specification tests for a general synchronous additive concurrent model formulation. As a novelty, our proposal does not require a preliminary model or error structure estimation. No tuning parameters are involved either. We develop a suitable test statistic using the martingale difference divergence coefficient. As a result, this statistic measures the departure from the conditional mean independence in the concurrent model framework, considering the information of all observed time instants. In particular, global as well as partial dependence tests are introduced. Then, we allow one to quantify the effect of a group of covariates or to apply covariates selection one by one. We obtain its asymptotic distribution under the null and propose a bootstrap algorithm to compute the p-values in practice. Through simulations, we illustrate our method, and its performance is compared to existing competitors. In addition, we use this in the analysis of three real datasets related to gait data, flu activity, and casual bike rentalsThe research of Laura Freijeiro-González is supported by the Consellería de Cultura, Educación e Ordenación Universitaria along with the Consellería de Economía, Emprego e Industria of the Xunta de Galicia (project ED481A-2018/264). Laura Freijeiro-González, Wenceslao González-Manteiga and Manuel Febrero-Bande acknowledged the support from Project PID2020-116587GB-I00 funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe” and the Competitive Reference Groups 2021-2024 (ED431C 2021/24) from the Xunta de Galicia through the ERDF. We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for computational resources. Open Access funding provided thanks to the CRUE-CSIC agreement with Springer NatureS
New covariates selection approaches in high dimensional or functional regression models
In a Big Data context, the number of covariates used to explain a variable of
interest, p, is likely to be high, sometimes even higher than the available sample size (p > n). Ordinary procedures
for fitting regression models start to perform wrongly in this situation. As a result, other approaches are needed. A
first covariates selection step is of interest to consider only the relevant terms and to reduce the problem
dimensionality. The purpose of this thesis is the study and development of covariates selection techniques for
regression models in complex settings. In particular, we focus on recent high dimensional or functional data
contexts of interest. Assuming some model structure, regularization techniques are widely employed alternatives
for both: model estimation and covariates selection simultaneously. Specifically, an extensive and critical review of
penalization techniques for covariates selection is carried out. This is developed in the context of the high
dimensional linear model of the vectorial framework. Conversely, if no model structure wants to be assumed, stateof-
the-art dependence measures based on distances are an attractive option for covariates selection. New
specification tests using these ideas are proposed for the functional concurrent model. Both versions are
considered separately: the synchronous and the asynchronous case. These approaches are based on novel
dependence measures derived from the distance covariance coefficient
Test and Measure for Partial Mean Dependence Based on Deep Neural Networks
It is of great importance to investigate the significance of a subset of
covariates W for the response Y given covariates Z in regression modeling. To
this end, we propose a new significance test for the partial mean independence
problem based on deep neural networks and data splitting. The test statistic
converges to the standard chi-squared distribution under the null hypothesis
while it converges to a normal distribution under the alternative hypothesis.
We also suggest a powerful ensemble algorithm based on multiple data splitting
to enhance the testing power. If the null hypothesis is rejected, we propose a
new partial Generalized Measure of Correlation (pGMC) to measure the partial
mean dependence of Y given W after controlling for the nonlinear effect of Z,
which is an interesting extension of the GMC proposed by Zheng et al. (2012).
We present the appealing theoretical properties of the pGMC and establish the
asymptotic normality of its estimator with the optimal root-N converge rate.
Furthermore, the valid confidence interval for the pGMC is also derived. As an
important special case when there is no conditional covariates Z, we also
consider a new test of overall significance of covariates for the response in a
model-free setting. We also introduce new estimator of GMC and derive its
asymptotic normality. Numerical studies and real data analysis are also
conducted to compare with existing approaches and to illustrate the validity
and flexibility of our proposed procedures
Semiparametric inference in mixture models with predictive recursion marginal likelihood
Predictive recursion is an accurate and computationally efficient algorithm
for nonparametric estimation of mixing densities in mixture models. In
semiparametric mixture models, however, the algorithm fails to account for any
uncertainty in the additional unknown structural parameter. As an alternative
to existing profile likelihood methods, we treat predictive recursion as a
filter approximation to fitting a fully Bayes model, whereby an approximate
marginal likelihood of the structural parameter emerges and can be used for
inference. We call this the predictive recursion marginal likelihood.
Convergence properties of predictive recursion under model mis-specification
also lead to an attractive construction of this new procedure. We show
pointwise convergence of a normalized version of this marginal likelihood
function. Simulations compare the performance of this new marginal likelihood
approach that of existing profile likelihood methods as well as Dirichlet
process mixtures in density estimation. Mixed-effects models and an empirical
Bayes multiple testing application in time series analysis are also considered
High Dimensional Data Analysis: variable screening and inference
This dissertation focuses on the problem of high dimensional data analysis, which arises in many fields including genomics, finance, and social sciences. In such settings, the number of features or variables is much larger than the number of observations, posing significant challenges to traditional statistical methods.
To address these challenges, this dissertation proposes novel methods for variable screening and inference. The first part of the dissertation focuses on variable screening, which aims to identify a subset of important variables that are strongly associated with the response variable. Specifically, we propose a robust nonparametric screening method to effectively select the predictors that marginally independent but conditionally dependent on the response.
The second part of the dissertation focuses on high dimensional inference. The problem arise from microbiome and metabolome study. The microbial community in the human gut is teeming with metabolic activity and plays a key role in host physiology and health. But the host-microbiome interactions are not well understood in terms of the molecular mechanism, while the microbial metabolites have been hypothesized to play a critical role. This motivate us to developed a statistical framework that first quantifies the abundances of microbial metabolites and then examines the associations between such metabolites and disease outcomes. This framework also accounts for potential high-dimensional microbiome confounders, thereby avoiding potential false discoveries of disease-associated metabolites. We overcome this challenging inference problem based on the idea of debiasing lasso. In numerical study, we demonstrate its significant power improvement when comparing some popular existing methods
- …