42 research outputs found
Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm
High-dimensional longitudinal data is increasingly used in a wide range of
scientific studies. However, there are few statistical methods for
high-dimensional linear mixed models (LMMs), as most Bayesian variable
selection or penalization methods are designed for independent observations.
Additionally, the few available software packages for high-dimensional LMMs
suffer from scalability issues. This work presents an efficient and accurate
Bayesian framework for high-dimensional LMMs. We use empirical Bayes estimators
of hyperparameters for increased flexibility and an
Expectation-Conditional-Minimization (ECM) algorithm for computationally
efficient maximum a posteriori probability (MAP) estimation of parameters. The
novelty of the approach lies in its partitioning and parameter expansion as
well as its fast and scalable computation. We illustrate Linear Mixed Modeling
with PaRtitiOned empirical Bayes ECM (LMM-PROBE) in simulation studies
evaluating fixed and random effects estimation along with computation time. A
real-world example is provided using data from a study of lupus in children,
where we identify genes and clinical factors associated with a new lupus
biomarker and predict the biomarker over time
Mixed-effects high-dimensional multivariate regression via group-lasso regularization
Linear mixed modeling is a well-established technique widely employed
when observations possess a grouping structure. Nonetheless, this standard methodology
is no longer applicable when the learning framework encompasses a multivariate
response and high-dimensional predictors. To overcome these issues, in the
present paper a penalized estimation procedure for multivariate linear mixed-effects
models (MLMM) is introduced. In details, we propose to regularize the likelihood
via a group-lasso penalty, forcing only a subset of the estimated parameters to be
preserved across all components of the multivariate response. The methodology is
employed to develop novel surrogate biomarkers for cardiovascular risk factors,
such as lipids and blood pressure, from whole-genome DNA methylation data in
a multi-center study. The described methodology performs better than current stateof-
art alternatives in predicting a multivariate continuous outcome
Prédiction phénotypique et sélection de variables en grande dimension dans les modèles linéaires et linéaires mixtes
Les nouvelles technologies permettent l'acquisition de données génomiques et post-génomiques de grande dimension, c'est-à-dire des données pour lesquelles il y a toujours un plus grand nombre de variables mesurées que d'individus sur lesquels on les mesure. Ces données nécessitent généralement des hypothèses supplémentaires afin de pouvoir être analysées, comme une hypothèse de parcimonie pour laquelle peu de variables sont supposées influentes. C'est dans ce contexte de grande dimension que nous avons travaillé sur des données réelles issues de l espèce porcine et de la technologie haut-débit, plus particulièrement le métabolome obtenu à partir de la spectrométrie RMN et des phénotypes mesurés post-mortem pour la plupart. L'objectif est double : d'une part la prédiction de phénotypes d intérêt pour la production porcine et d'autre part l'explicitation de relations biologiques entre ces phénotypes et le métabolome. On montre, grâce à une analyse dans le modèle linéaire effectuée avec la méthode Lasso, que le métabolome a un pouvoir prédictif non négligeable pour certains phénotypes importants pour la production porcine comme le taux de muscle et la consommation moyenne journalière. Le deuxième objectif est traité grâce au domaine statistique de la sélection de variables. Les méthodes classiques telles que la méthode Lasso et la procédure FDR sont investiguées et de nouvelles méthodes plus performantes sont développées : nous proposons une méthode de sélection de variables en modèle linéaire basée sur des tests d'hypothèses multiples. Cette méthode possède des résultats non asymptotiques de puissance sous certaines conditions sur le signal. De part les données annexes disponibles sur les animaux telles que les lots dans lesquels ils ont évolués ou les relations de parentés qu'ils possèdent, les modèles mixtes sont considérés. Un nouvel algorithme de sélection d'effets fixes est développé et il s'avère beaucoup plus rapide que les algorithmes existants qui ont le même objectif. Grâce à sa décomposition en étapes distinctes, l algorithme peut être combiné à toutes les méthodes de sélection de variables développées pour le modèle linéaire classique. Toutefois, les résultats de convergence dépendent de la méthode utilisée. On montre que la combinaison de cet algorithme avec la méthode de tests multiples donne de très bons résultats empiriques. Toutes ces méthodes sont appliquées au jeu de données réelles et des relations biologiques sont mises en évidenceRecent technologies have provided scientists with genomics and post-genomics high-dimensional data; there are always more variables that are measured than the number of individuals. These high dimensional datasets usually need additional assumptions in order to be analyzed, such as a sparsity condition which means that only a small subset of the variables are supposed to be relevant. In this high-dimensional context we worked on a real dataset which comes from the pig species and high-throughput biotechnologies. Metabolomic data has been measured with NMR spectroscopy and phenotypic data has been mainly obtained post-mortem. There are two objectives. On one hand, we aim at obtaining good prediction for the production phenotypes and on the other hand we want to pinpoint metabolomic data that explain the phenotype under study. Thanks to the Lasso method applied in a linear model, we show that metabolomic data has a real prediction power for some important phenotypes for livestock production, such as a lean meat percentage and the daily food consumption. The second objective is a problem of variable selection. Classic statistical tools such as the Lasso method or the FDR procedure are investigated and new powerful methods are developed. We propose a variable selection method based on multiple hypotheses testing. This procedure is designed to perform in linear models and non asymptotic results are given under a condition on the signal. Since supplemental data are available on the real dataset such as the batch or the family relationships between the animals, linear mixed models are considered. A new algorithm for fixed effects selection is developed, and this algorithm turned out to be faster than the usual ones. Thanks to its structure, it can be combined with any variable selection methods built for linear models. However, the convergence property of this algorithm depends on the method that is used. The multiple hypotheses testing procedure shows good empirical results. All the mentioned methods are applied to the real data and biological relationships are emphasizedTOULOUSE-INSA-Bib. electronique (315559905) / SudocSudocFranceF
A general framework for penalized mixed-effects multitask learning with applications on DNA methylation surrogate biomarkers creation
Recent evidence highlights the usefulness of DNA methylation (DNAm)
biomarkers as surrogates for exposure to risk factors for noncommunicable
diseases in epidemiological studies and randomized trials. DNAm variability
has been demonstrated to be tightly related to lifestyle behavior and exposure
to environmental risk factors, ultimately providing an unbiased proxy of
an individual state of health. At present, the creation of DNAm surrogates
relies on univariate penalized regression models, with elastic-net regularizer
being the gold standard when accomplishing the task. Nonetheless, more advanced
modeling procedures are required in the presence of multivariate outcomes
with a structured dependence pattern among the study samples. In this
work we propose a general framework for mixed-effects multitask learning
in presence of high-dimensional predictors to develop a multivariate DNAm
biomarker from a multicenter study. A penalized estimation scheme, based
on an expectation-maximization algorithm, is devised in which any penalty
criteria for fixed-effects models can be conveniently incorporated in the fitting
process. We apply the proposed methodology to create novel DNAm
surrogate biomarkers for multiple correlated risk factors for cardiovascular
diseases and comorbidities. We show that the proposed approach, modeling
multiple outcomes together, outperforms state-of-the-art alternatives both in
predictive power and biomolecular interpretation of the results
Ensemble methods for ranking data with and without position weights
The main goal of this Thesis is to build suitable Ensemble Methods for ranking data with weights assigned to the items’positions, in the cases of rankings with and without ties.
The Thesis begins with the definition of a new rank correlation coefficient, able to take into account the importance of items’position. Inspired by the rank correlation coefficient, τ x , proposed by Emond and Mason (2002) for unweighted rankings and the weighted Kemeny distance proposed by García-Lapresta and Pérez-Román (2010), this work proposes τ x w , a new rank correlation coefficient corresponding to the weighted Kemeny distance. The new coefficient is analized analitically and empirically and represents the main core of the consensus ranking process. Simulations and applications to real cases are presented. In a second step, in order to detect which predictors better explain a phenomenon, the Thesis proposes decision trees for ranking data with and without weights, discussing and comparing the results. A simulation study is built up, showing the impact of different structures of weights on the ability of decision trees to describe data. In the third part, ensemble methods for ranking data, more specifically Bagging and Boosting, are introduced. Last but not least, a review on a different topic is inserted in this Thesis. The review compares a significant number of linear mixed model selection procedures available in the literature. The review represents the answer to a pressing issue in the framework of LMMs: how to identify the best approach to adopt in a specific case. The work outlines mainly all approaches found in literature. This review represents my first academic training in making research.The main goal of this Thesis is to build suitable Ensemble Methods for ranking data with weights assigned to the items’positions, in the cases of rankings with and without ties.
The Thesis begins with the definition of a new rank correlation coefficient, able to take into account the importance of items’position. Inspired by the rank correlation coefficient, τ x , proposed by Emond and Mason (2002) for unweighted rankings and the weighted Kemeny distance proposed by García-Lapresta and Pérez-Román (2010), this work proposes τ x w , a new rank correlation coefficient corresponding to the weighted Kemeny distance. The new coefficient is analized analitically and empirically and represents the main core of the consensus ranking process. Simulations and applications to real cases are presented. In a second step, in order to detect which predictors better explain a phenomenon, the Thesis proposes decision trees for ranking data with and without weights, discussing and comparing the results. A simulation study is built up, showing the impact of different structures of weights on the ability of decision trees to describe data. In the third part, ensemble methods for ranking data, more specifically Bagging and Boosting, are introduced. Last but not least, a review on a different topic is inserted in this Thesis. The review compares a significant number of linear mixed model selection procedures available in the literature. The review represents the answer to a pressing issue in the framework of LMMs: how to identify the best approach to adopt in a specific case. The work outlines mainly all approaches found in literature. This review represents my first academic training in making research