16 research outputs found
Sparse reduced-rank regression for imaging genetics studies: models and applications
We present a novel statistical technique; the sparse reduced rank regression (sRRR) model
which is a strategy for multivariate modelling of high-dimensional imaging responses and
genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity
in the regression coefficients, identifying subsets of genetic markers that best explain
the variability observed in subsets of the phenotypes. To properly exploit the rich structure
present in each of the imaging and genetics domains, we additionally propose the use of
several structured penalties within the sRRR model. Using simulation procedures that accurately
reflect realistic imaging genetics data, we present detailed evaluations of the sRRR
method in comparison with the more traditional univariate linear modelling approach. In
all settings considered, we show that sRRR possesses better power to detect the deleterious
genetic variants. Moreover, using a simple genetic model, we demonstrate the potential
benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to
extracting averages over regions of interest in the brain. Since this entails the use of phenotypic
vectors of enormous dimensionality, we suggest the use of a sparse classification
model as a de-noising step, prior to the imaging genetics study. Finally, we present the
application of a data re-sampling technique within the sRRR model for model selection.
Using this approach we are able to rank the genetic markers in order of importance of association
to the phenotypes, and similarly rank the phenotypes in order of importance to
the genetic markers. In the very end, we illustrate the application perspective of the proposed
statistical models in three real imaging genetics datasets and highlight some potential
associations
Graph based fusion of high-dimensional gene- and microRNA expression data
One of the main goals in cancer studies including high-throughput microRNA
(miRNA) and mRNA data is to find and assess prognostic signatures capable
of predicting clinical outcome. Both mRNA and miRNA expression changes in
cancer diseases are described to reflect clinical characteristics like staging and
prognosis. Furthermore, miRNA abundance can directly affect target transcripts
and translation in tumor cells. Prediction models are trained to identify either
mRNA or miRNA signatures for patient stratification. With the increasing
number of microarray studies collecting mRNA and miRNA from the same
patient cohort there is a need for statistical methods to integrate or fuse both
kinds of data into one prediction model in order to find a combined signature
that improves the prediction.
Here, we propose a new method to fuse miRNA and mRNA data into one
prediction model. Since miRNAs are known regulators of mRNAs, correlations
between miRNA and mRNA expression data as well as target prediction
information were used to build a bipartite graph representing the relations
between miRNAs and mRNAs.
Feature selection is a critical part when fitting prediction models to high-
dimensional data. Most methods treat features, in this case genes or miRNAs,
as independent, an assumption that does not hold true when dealing with
combined gene and miRNA expression data. To improve prediction accuracy, a
description of the correlation structure in the data is needed. In this work the
bipartite graph was used to guide the feature selection and therewith improve
prediction results and find a stable prognostic signature of miRNAs and genes.
The method is evaluated on a prostate cancer data set comprising 98 patient
samples with miRNA and mRNA expression data. The biochemical relapse, an
important event in prostate cancer treatment, was used as clinical endpoint.
Biochemical relapse coins the renewed rise of the blood level of a prostate
marker (PSA) after surgical removal of the prostate. The relapse is a hint
for metastases and usually the point in clinical practise to decide for further
treatment.
A boosting approach was used to predict the biochemical relapse. It could
be shown that the bipartite graph in combination with miRNA and mRNA
expression data could improve prediction performance. Furthermore the ap-
proach improved the stability of the feature selection and therewith yielded
more consistent marker sets. Of course, the marker sets produced by this new
method contain mRNAs as well as miRNAs.
The new approach was compared to two state-of-the-art methods suited for
high-dimensional data and showed better prediction performance in both cases
Sparse Functional Linear Discriminant Analysis
Functional linear discriminant analysis offers a simple yet efficient method for classification, with the possibility of achieving a perfect classification. Several methods are proposed in the literature that mostly address the dimensionality of the problem. On the other hand, there is a growing interest in interpretability of the analysis, which favors a simple and sparse solution. In this work, we propose a new approach that incorporates a type of sparsity that identifies nonzero sub-domains in the functional setting, offering a solution that is easier to interpret without compromising performance. With the need to embed additional constraints in the solution, we reformulate the functional linear discriminant analysis as a regularization problem with an appropriate penalty. Inspired by the success of ℓ1-type regularization at inducing zero coefficients for scalar variables, we develop a new regularization method for functional linear discriminant analysis that incorporates an L1-type penalty, ∫ |f|, to induce zero regions. We demonstrate that our formulation has a well-defined solution that contains zero regions, achieving a functional sparsity in the sense of domain selection. In addition, the misclassification probability of the regularized solution is shown to converge to the Bayes error if the data are Gaussian. Our method does not presume that the underlying function has zero regions in the domain, but produces a sparse estimator that consistently estimates the true function whether or not the latter is sparse. Numerical comparisons with existing methods demonstrate this property in finite samples with both simulated and real data examples
Model-Free Variable Screening, Sparse Regression Analysis and Other Applications with Optimal Transformations
Variable screening and variable selection methods play important roles in modeling high dimensional data. Variable screening is the process of filtering out irrelevant variables, with the aim to reduce the dimensionality from ultrahigh to high while retaining all important variables. Variable selection is the process of selecting a subset of relevant variables for use in model construction. The main theme of this thesis is to develop variable screening and variable selection methods for high dimensional data analysis. In particular, we will present two relevant methods for variable screening and selection under a unified framework based on optimal transformations.
In the first part of the thesis, we develop a maximum correlation-based sure independence screening (MC-SIS) procedure to screen features in an ultrahigh-dimensional setting. We show that MC-SIS possesses the sure screen property without imposing model or distributional assumptions on the response and predictor variables. MC-SIS is a model-free method in contrast with some other existing model-based sure independence screening methods in the literature. In the second part of the thesis, we develop a novel method called SParse Optimal Transformations (SPOT) to simultaneously select important variables and explore relationships between the response and predictor variables in high dimensional nonparametric regression analysis. Not only are the optimal transformations identified by SPOT interpretable, they can also be used for response prediction. We further show that SPOT achieves consistency in both variable selection and parameter estimation.
Besides variable screening and selection, we also consider other applications with optimal transformations. In the third part of the thesis, we propose several dependence measures, for both univariate and multivariate random variables, based on maximum correlation and B-spline approximation. B-spline based Maximum Correlation (BMC) and Trace BMC (T-BMC) are introduced to measure dependence between two univariate random variables. As extensions to BMC and T-BMC, Multivariate BMC (MBMC) and Trace Multivariate BMC (T-MBMC) are proposed to measure dependence between multivariate random variables. We give convergence rates for both BMC and T-BMC.
Numerical simulations and real data applications are used to demonstrate the performances of proposed methods. The results show that the proposed methods outperform other existing ones and can serve as effective tools in practice
Determination of the composition of failure time models with long-term survivors
Indiana University-Purdue University Indianapolis (IUPUI)Failure-time data with long-term survivors are frequently encountered in clinical in
vestigations. A standard approach for analyzing such data is to add a logistic regres
sion component to the traditional proportional hazard models for accommodation
of the individuals that are not at risk of the event. One such formulation is the
cure rate model; other formulations with similar structures are also used in prac
tice. Increased complexity presents a great challenge for determination of the model
composition. Importantly, no existing model selection tools are directly applicable
for determination of the composition of such models. This dissertation focuses on
two key questions concerning the construction of complex survival models with long
term survivors: (1) what independent variables should be included in which modeling
components? (2) what functional form should each variable assume? I address these
questions by proposing a set of regularized estimation procedures using the Least
Absolute Shrinkage and Selection Operators (LASSO). Specifically, I present vari
able selection and structural discovery procedures for a broad class of survival models
with long-term survivors. Selection performance of the proposed methods is evaluated
through carefully designed simulation studies
Image Guided Respiratory Motion Analysis: Time Series and Image Registration.
The efficacy of Image guided radiation therapy (IGRT) systems relies on accurately extracting, modeling and predicting tumor movement with imaging techniques. This thesis
investigates two key problems associated with such systems: motion modeling and image
processing. For thoracic and upper abdominal tumors, respiratory motion is the dominant
factor for tumor movement. We have studied several special structured time series analysis techniques to incorporate the semi-periodicity characteristics of respiratory motion.
The proposed methods are robust towards large variations among fractions and populations; the algorithms perform stably in the presence of sparse radiographic observations
with noise. We have proposed a subspace projection method to quantitatively evaluate the
semi-periodicity of a given observation trace; a nonparametric local regression approach
for real-time prediction of respiratory motion; a state augmentation scheme to model hysteresis; and an ellipse tracking algorithm to estimate the trend of respiratory motion in
real time. For image processing, we have focused on designing regularizations to account
for prior information in image registration problems. We investigated a penalty function design that accommodates tissue-type-dependent elasticity information. We studied a class of discontinuity preserving regularizers that yield smooth deformation estimates
in most regions, yet allow discontinuities supported by data. We have further proposed a
discriminate regularizer that preserves shear discontinuity, but discourages folding or vacuum generating flows. In addition, we have initiated a preliminary principled study on the
fundamental performance limit of image registration problems. We proposed a statistical
generative model to account for noise effect in both source and target images, and investigated the approximate performance of the maximum-likelihood estimator corresponding
to the generative model and the commonly adopted M-estimator. A simple example suggests that the approximation is reasonably accurate.
Our studies in both time series analysis and image registration constitute essential
building-blocks for clinical applications such as adaptive treatment. Besides their theoretical interests, it is our sincere hope that with further justifications, the proposed techniques
would realize its clinical value, and improve the quality of life for patients.Ph.D.Electrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/60673/1/druan_1.pd
Algorithmes de bandits stochastiques pour la gestion de la demande électrique
As electricity is hard to store, the balance between production and consumption must be strictly maintained. With the integration of intermittent renewable energies into the production mix, the management of the balance becomes complex. At the same time, the deployment of smart meters suggests demand response. More precisely, sending signals - such as changes in the price of electricity - would encourage users to modulate their consumption according to the production of electricity. The algorithms used to choose these signals have to learn consumer reactions and, in the same time, to optimize them (exploration-exploration trade-off). Our approach is based on bandit theory and formalizes this sequential learning problem. We propose a first algorithm to control the electrical demand of a homogeneous population of consumers and offer T⅔ upper bound on its regret. Experiments on a real data set in which price incentives were offered illustrate these theoretical results. As a “full information” dataset is required to test bandit algorithms, a consumption data generator based on variational autoencoders is built. In order to drop the assumption of the population homogeneity, we propose an approach to cluster households according to their consumption profile. These different works are finally combined to propose and test a bandit algorithm for personalized demand side management.L'électricité se stockant difficilement à grande échelle, l'équilibre entre la production et la consommation doit être rigoureusement maintenu. Une gestion par anticipation de la demande se complexifie avec l'intégration au mix de production des énergies renouvelables intermittentes. Parallèlement, le déploiement des compteurs communicants permet d'envisager un pilotage dynamique de la consommation électrique. Plus concrètement, l'envoi de signaux - tels que des changements du prix de l'électricité – permettrait d'inciter les usagers à moduler leur consommation afin qu'elle s'ajuste au mieux à la production d'électricité. Les algorithmes choisissant ces signaux devront apprendre la réaction des consommateurs face aux envois tout en les optimisant (compromis exploration-exploitation). Notre approche, fondée sur la théorie des bandits, a permis de formaliser ce problème d'apprentissage séquentiel et de proposer un premier algorithme pour piloter la demande électrique d'une population homogène de consommateurs. Une borne supérieure d'ordre T⅔ a été obtenue sur le regret de cet algorithme. Des expériences réalisées sur des données de consommation de foyers soumis à des changements dynamiques du prix de l'électricité illustrent ce résultat théorique. Un jeu de données en « information complète » étant nécessaire pour tester un algorithme de bandits, un simulateur de données de consommation fondé sur les auto-encodeurs variationnels a ensuite été construit. Afin de s'affranchir de l'hypothèse d'homogénéité de la population, une approche pour segmenter les foyers en fonction de leurs habitudes de consommation est aussi proposée. Ces différents travaux sont finalement combinés pour proposer et tester des algorithmes de bandits pour un pilotage personnalisé de la consommation électrique
SCEE 2008 book of abstracts : the 7th International Conference on Scientific Computing in Electrical Engineering (SCEE 2008), September 28 – October 3, 2008, Helsinki University of Technology, Espoo, Finland
This report contains abstracts of presentations given at the SCEE 2008 conference.reviewe