1,135 research outputs found

    Improved Performance and Stability of the Knockoff Filter and an Approach to Mixed Effects Modeling of Sequentially Randomized Trials

    Full text link
    The knockoff filter is a variable selection technique for linear regression with finite-sample control of the regression false discovery rate (FDR). The regression FDR is the expected proportion of selected variables which, in fact, have no effect in the regression model. The knockoff filter constructs a set of synthetic variables which are known to be irrelevant to the regression and, by serving as negative controls, help identify relevant variables. The first two thirds of this thesis describe tradeoffs between power and collinearity due to tuning choices in the knockoff filter and provide a stabilization method to reduce variance and improve replicability of the selected variable set using the knockoff filter. The final third of this thesis develops an approach for mixed modeling and estimation for sequential multiple assignment randomized trials (SMARTs). SMARTs are an important data collection tool for informing the construction of dynamic treatment regimens (DTRs), which use cumulative patient information to recommend specific treatments during the course of an intervention. A common primary aim in a SMART is the marginal mean comparison between two or more of the DTRs embedded in the trial, and the mixed modeling approach is developed for these primary aim comparisons based on a continuous, longitudinal outcome. The method is illustrated using data from a SMART in autism research.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163099/1/luers_1.pd

    Generation of a Land Cover Atlas of environmental critic zones using unconventional tools

    Get PDF
    L'abstract ĆØ presente nell'allegato / the abstract is in the attachmen

    Variable Selection and Prediction in Messy'' High-Dimensional Data"

    Get PDF
    University of Minnesota Ph.D. dissertation. July 2017. Major: Biostatistics. Advisors: Julian Wolfson, Wei Pan. 1 computer file (PDF); x, 85 pages.When dealing with high-dimensional data, performing variable selection in a regression model reduces statistical noise and simplifies interpretation. There are many ways to perform variable selection when standard regression assumptions are met, but few that work well when one or more assumptions is violated. In this thesis, we propose three variable selection methods that outperform existing methods in such "messy data'' situations where standard regression assumptions are violated. First, we introduce Thresholded EEBoost (ThrEEBoost), an iterative algorithm which applies a gradient boosting type algorithm to estimating equations. Extending its progenitor, EEBoost (Wolfson, 2011), ThrEEBoost allows multiple coefficients to be updated at each iteration. The number of coefficients updated is controlled by a threshold parameter on the magnitude of the estimating equation. By allowing more coefficients to be updated at each iteration, ThrEEBoost can explore a greater diversity of variable selection "paths'' (i.e., sequences of coefficient vectors) through the model space, possibly finding models with smaller prediction error than any of those on the path defined by EEBoost. In a simulation of data with correlated outcomes, ThrEEBoost reduced prediction error compared to more naive methods and the less flexible EEBoost. We also applied our method to the Box Lunch Study where we found that we were able to reduce our error in predicting BMI from longitudinal data. Next, we propose a novel method, MEBoost, for variable selection and prediction when covariates are measured with error. To do this, we incorporate a measurement error corrected score function due to Nakamura (1990) into the ThrEEBoost framework. In both simulated and real data, MEBoost outperformed the CoCoLasso (Datta and Zou, 2017), a recently proposed penalization-based approach to variable selection in the presence of measurement error, and the (non-measurement error corrected) Lasso. Lastly, we consider the case where multiple regression assumptions may be simultaneously violated. Motivated by the idea of stacking, specifically the SuperLearner technique (VanDerLaan et al., 2007), we propose a novel method, Super Learner Estimating Equation Boosting (SuperBoost). SuperBoost performs variable selection in the presence of multiple data challenges by combining the results from variable selection procedures which are each tailored to address a different regression assumption violation. The ThrEEBoost framework is a natural fit for this approach, since the component "learners'' (i.e., violation-specific variable selection techniques) are fairly straightforward to construct and implement by using various estimating equations. We illustrate the application of SuperBoost on simulated data with both correlated outcomes and covariate measurement error, and show that it performs as well or better than methods which address only one (or neither) of these factors

    Development of Joint Estimating Equation Approaches to Merging Clustered or Longitudinal Datasets from Multiple Biomedical Studies.

    Full text link
    Jointly analyzing multiple datasets arising from similar studies has drawn increasing attention in recent years. In this dissertation, we investigate three primary problems pertinent to merging clustered or longitudinal datasets from multiple biomedical studies. The first project concerns the development of a rigorous hypothesis testing procedure to assess the validity of data merging and a joint estimation approach to obtaining regression coefficient estimates when merging data is permitted. The proposed methods can account for different within-subject correlations and follow-up schedules in different longitudinal studies. The second project concerns the development of an effective statistical method that enables to merge multiple longitudinal datasets subject to various heterogeneous characteristics, such as different follow-up schedules and study-speciļ¬c missing covariates (e.g. covariates observed in some studies but completely missing in other studies). The presence of study-specific missing covariates gives rise to a great challenge in data merging and analysis, where methods of imputation and inverse probability weighting are not directly applicable. We propose a joint estimating function approach to addressing this key challenge, in which a novel nonparametric estimating function constructed via splines-based sieve approximation is utilized to bridge estimating equations from studies with missing covariates to those with fully observed covariates. Under mild regularity conditions, we show that the proposed estimator is consistent and asymptotically normal. The third project is devoted to the development of a screening procedure for parameter homogeneity, which is the key feature to reduce model complexity in the process of data merging. We consider the longitudinal marginal model for merged studies, in which the classical hypothesis testing approach to evaluating all possible subsets of common regression parameters can be combinatorially complex and computationally prohibitive. We develop a regularization method that can overcome this difficulty by applying the idea of adaptive fused lasso in that restrictions are imposed on differences of pairs of parameters between studies. The selection procedure will automatically detect common parameters across all or subsets of studies.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/95928/1/wafei_1.pd

    Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements

    Full text link
    Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/135531/1/biom12496.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/135531/2/biom12496_am.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/135531/3/biom12496-sup-0001-SuppData.pd

    Statistical Modelling

    Get PDF
    The book collects the proceedings of the 19th International Workshop on Statistical Modelling held in Florence on July 2004. Statistical modelling is an important cornerstone in many scientific disciplines, and the workshop has provided a rich environment for cross-fertilization of ideas from different disciplines. It consists in four invited lectures, 48 contributed papers and 47 posters. The contributions are arranged in sessions: Statistical Modelling; Statistical Modelling in Genomics; Semi-parametric Regression Models; Generalized Linear Mixed Models; Correlated Data Modelling; Missing Data, Measurement of Error and Survival Analysis; Spatial Data Modelling and Time Series and Econometrics

    Statistical Models to Assess Associations between the Built Environment and Health: Examining Food Environment Contributions to the Childhood Obesity Epidemic.

    Full text link
    Models are developed and applied to examine the associations between built environment features and health. These developments are motivated by studies examining the contribution of features of the built food environment near schools, such as availability of fast food restaurants and convenience stores, to childrenā€™s body weight. The data used in this dissertation come from a surveillance database that captures body weight and other characteristics for all children in 5th, 7th, and 9th grades enrolled in public schools in California during 2001-2010 and a commercial data source that contains the locations of all food establishments in California for the same time period. First, we develop a hierarchical multiple informants model (HMIM) for clustered data that estimates the marginal association of multiple built environment features and formally tests if the strength of their association differs with the outcome. Using this new model, we establish that the contribution of the availability of convenience stores to childrenā€™s body mass index z-scores (BMIz) is stronger than that of fast food restaurants. Second, we propose to use a distributed lag model (DLM) to examine whether and how the association between the number of convenience stores and childrenā€™s BMIz decays with longer distance from schools. In this model, distributed lag (DL) covariates are the number of convenience stores within several contiguous ā€œringā€-shaped areas from schools rather than circular buffers, and their coefficients are modeled as a function of distance, using smoothing splines. We find that associations are stronger with closer proximity to schools and vanish by about 2 miles from school locations. Third, we develop a hierarchical distributed lag model (HDLM) to systematically examine the variability of the built environment association across regions to help address a yet unanswered question in the built environment literature: whether and how activity spaces relevant to health vary across regions. We find DL coefficients vary across regions, implying that variation in activity spaces also exists. We also identify areas where childrenā€™s BMIz is more vulnerable to built environment factors. This dissertation provides novel methods with which to study how built environment factors affect health.PhDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110362/1/jongguri_1.pd

    Statistical analysis of complex neuroimaging data

    Get PDF
    This dissertation is composed of two major topics: a) regression models for identifying noise sources in magnetic resonance images, and b) multiscale Adaptive method in neuroimaging studies. The first topic is covered by the first thesis paper. In this paper, we formally introduce three regression models including a Rician regression model and two associated normal models to characterize stochastic noise in various magnetic resonance imaging modalities, including diffusion weighted imaging (DWI) and functional MRI (fMRI). Estimation algorithms are introduced to maximize the likelihood function of the three regression models. We also develop a diagnostic procedure for systematically exploring MR images to identify noise components other than simple stochastic noise, and to detect discrepancies between the fitted regression models and MRI data. The diagnostic procedure includes goodness-of-fit statistics, measures of influence, and tools for graphical display. The goodness-of-fit statistics can assess the key assumptions of the three regression models, whereas measures of influence can isolate outliers caused by certain noise components, including motion artifact. The tools for graphical display permit graphical visualization of the values for the goodness-of-fit statistic and influence measures. Finally, we conduct simulation studies to evaluate performance of these methods, and we analyze a real dataset to illustrate how our diagnostic procedure localizes subtle image artifacts by detecting intravoxel variability that is not captured by the regression models. The second topic, multiscale adaptive methods for neuroimaging data, consists of two thesis papers.The goal of the first paper is to develop a multiscale adaptive regression model (MARM) for spatial and adaptive analysis of neuroimaging data. Compared with the existing voxel-wise approach in the analysis of imaging data,MARM has three unique features: being spatial, being hierarchical, and being adaptive. MARM creates a small sphere with a given radius at each location (called voxel), analyzes all observations in the sphere of each voxel, and then uses these consecutively connected spheres across all voxels to capture spatial dependence among imaging observations. MARM builds hierarchically nested spheres by increasing the radius of a spherical neighborhood around each voxel and utilizes information in each of the nested spheres at each voxel. Finally, MARM combine imaging observations with adaptive weights in the voxels within the sphere of the current voxel to adaptively calculate parameter estimates and test statistics. Theoretically, we establish the consistency and asymptotic normality of adaptive estimates and the asymptotic distributions of adaptive test statistics under some mild conditions. Three sets of simulation studies are used to demonstrate the methodology and examine the finite sample performance of the adaptive estimates and test statistics in MARM. We apply MARM to quantify spatiotemporal white matter maturation patterns in early postnatal population using diffusion tensor imaging. Our simulation studies and real data analysis confirm that the MARM significantly outperforms the voxel-wise methods. The goal of the second paper is to develop a multiscale adaptive generalized estimation equation (MAGEE) for spatial and adaptive analysis of longitudinal neuroimaging data. Longitudinal imaging studies have been valuable for better understanding disease progression and normal brain development/aging. Compared to cross-sectional imaging studies, longitudinal imaging studies can increase the statistical power in detecting subtle spatiotemporal changes of brain structure and function. MAGEE is a hierarchical, spatial, semiparametric, and adaptive procedure, compared with the existing voxel-wise approach. The key ideas of MAGEE are to build hierarchically nested spheres with increasing radii at each location, to analyze all observations in the sphere of each voxel using weighted generalized estimating equations, and to use the consecutively connected spheres across all voxels to adaptively capture spatial pattern. Simulation studies and real data analysis clearly show the advantage of MAGEE method over the existing voxel-wise methods. Our results also reveal i) the increase of fractional anisotropy in this early postnatal stage, and ii) five different growth patterns in the brain regions under examination
    • ā€¦
    corecore