37 research outputs found

    Renewable Estimation and Incremental Inference with Streaming Health Datasets

    Full text link
    The overarching objective of my dissertation is to develop a new methodology that allows to sequentially update parameter estimates and their standard errors along with data streams. The key technical novelty pertains to the fact that the proposed estimation method, termed as renewable estimation in my dissertation, uses current data and summary statistics of historical data, but no use of any historical subject-level data. To implement renewable estimation, I utilize the powerful Lambda architecture in Apache Spark to design a new paradigm that includes an inference layer in addition to the existing speed layer. This expanded architecture is named as the Rho architecture, which accommodates inference-related statistics and to facilitate sequential updating of quantities involved in estimation and inference. The first project focuses on the renewable estimation in the setting of generalized linear models (RenewGLM) in which I develop a new sequential updating algorithm to calculate numerical solutions of parameter estimates and related inferential quantities. The proposed procedure aggregates both score functions and information matrices over streaming data batches through some summary statistics. I show that the resulting estimation is asymptotically equivalent to the maximum likelihood estimation (MLE) obtained with the entire data once. I demonstrate this new methodology on the analysis of the National Automotive Sampling System-Crashworthiness Data System (NASS CDS) to evaluate the effectiveness of graduated driver licensing (GDL) in the USA. The second project focuses on a substantial extension of the first project to analyze streaming datasets with correlated outcomes, such as clustered data and longitudinal data. I establish the theoretical guarantees for the proposed renewable quadratic inference function (RenewQIF) for dependent outcomes and implement it within the Rho architecture. Furthermore, I relax the homogeneous assumption in the first project and consider regime-switching regression models with a structural change-point. I propose a real-time hypothesis testing procedure based on a goodness-of-fit test statistic that is shown to achieve both proper type I error control and desirable change-point detection power. The third project concerns data streams that involve both inter-data batch correlation and dynamic heterogeneity, arising typically from various types of electronic health records (EHR) and mobile health data. This project is built in the framework of state space models in which the observed data stream is driven by a latent state process that may incorporate trend, seasonal, or time-varying covariate effects. In this setting, calculating the online MLE is challenge due to the involvement of high-dimensional integrals and complex covariance structures. In this project, I develop a Kalman filter to facilitate a multivariate online regression analysis (MORA) in the context of linear state space mixed models. MORA enables to renew both point estimates and standard errors of the fixed effects. We also apply the MORA method to analyze an EHR data example, adjusting for some heterogeneous batch-specific effects.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163085/1/luolsph_1.pd

    Spatio‑temporal modelling of high‑throughput phenotyping data

    Get PDF
    High throughput phenotyping (HTP) platforms and devices are increasingly used to characterise growth and developmental processes for large sets of plant genotypes. This dissertation is motivated by the need to accurately estimate genetic effects over time when analysing data from such HTP experiments. The HTP data we deal with here are characterised by phenotypic traits measured multiple times in the presence of spatial and temporal noise and a hierarchical organisation at three levels (populations, genotypes within populations, and plants within genotypes). The challenge is to balance efficient statistical models and com- putational solutions to deal with the complexity and dimensionality of the experimental data. To that aim, we propose two strategies. The first proposal divides the problem into two stages. The first stage (spatial model) focuses on correcting the phenotypic data for experimental design factors and spatial variation, while the second stage (hierarchical longitudinal model) aims to estimate the evolution over time of the genetic signal. The second proposal is to face the problem simultaneously (one-stage approach). That is, mod- elling the longitudinal evolution of the genetic effect on a given phenotypic trait while accounting for the temporal and spatial effects of environmental and design factors (spatio-temporal hierarchical model). We follow the same modelling philosophy throughout our work and propose multidimensional P-spline-based hierarchical approaches. We provide the user with appealing tools that take advantage of the sparse model matrices structure to reduce computational complexity. All our codes are publicly available on the R-package statgenHTP and https://gitlab.bcamath.org/dperez/htp_one_stage_approach. We illustrate the performance of our methods using spatio-temporal simulated data and data from the PhenoArch greenhouse platform at INRAE Montpellier and the outdoor Field Phenotyping platform at ETH ZĂĽrich. In the plant breeding context, we show how to extract new time-independent phenotypes for genomic selection purposes.MTM2017-82379-R BERC 2018-2021 BERC 2022-2025 SEV-2017-0718 CEX2021-001142-S/MICIN/AEI/10.13039/50110001103

    A statistical analysis of dissolving timber pulp properties using linear mixed models.

    Get PDF
    Doctoral Degree. University of KwaZulu-Natal, Pietermaritzburg.The main focus of the study was to understand the behaviour of seven timber genotypes based on seven chemical properties observed during the chemical pulping process with the prime objective of developing methods of grouping different timber genotypes into compatible groups of timber that can be optimally processed together. Four related statistical methods were used in analysing the data and each had a specific objective. The random coefficients model was used to investigate how the genotypes evolve over the processing stages and it was discovered that the rates of change of the chemical properties studied depended on their initial readings at the beginning of processing. This trend applied for all seven genotypes of pulping trees studied. The important results that came out of fitting the random coefficient model to the data is that the higher the raw stage readings (initial values) the higher the rates of change in the chemical properties over the processing stages. The changes were either increases or decreases in the chemical property studied. The random coefficient model was also used to suggest a rudimental mixing index for the different genotypes based on the average ranking of their slope parameters (rates of change) for the seven variables studied. It was found, for example, that the genotypes GUA and GUW are the least mixable ones. Piecewise linear regression models were used to identify important variables when classifying genotypes and it was generally found that viscosity is not a very useful variable in the classification of genotypes. Using piecewise linear regression models together with kernel density estimation a mixing index (scale) was developed that can be used to determine which genotypes are the most mixable for chemical processing. A coparison of the random coefficient and the piecewise linear regression models shows that the two models yielded very similar conclusions on what genotypes are most mixable during processing. Joint modelling was used to analysis the correlations between evolutions of different chemical properties studied. The various levels of correlations between these variables were discussed. The main limitation of the joint modelling method was its computational challenges because of the many parameters that need to be estimated at the same time

    Bounded Influence Approaches to Constrained Mixed Vector Autoregressive Models

    Get PDF
    The proliferation of many clinical studies obtaining multiple biophysical signals from several individuals repeatedly in time is increasingly recognized, a recognition generating growth in statistical models that analyze cross-sectional time series data. In general, these statistical models try to answer two questions: (i) intra-individual dynamics of the response and its relation to some covariates; and, (ii) how this dynamics can be aggregated consistently in a group. In response to the first question, we propose a covariate-adjusted constrained Vector Autoregressive model, a technique similar to the STARMAX model (Stoffer, JASA 81, 762-772), to describe serial dependence of observations. In this way, the number of parameters to be estimated is kept minimal while offering flexibility for the model to explore higher order dependence. In response to (ii), we use mixed effects analysis that accommodates modelling of heterogeneity among cross-sections arising from covariate effects that vary from one cross-section to another. Although estimation of the model can proceed using standard maximum likelihood techniques, we believed it is advantageous to use bounded influence procedures in the modelling (such as choosing constraints) and parameter estimation so that the effects of outliers can be controlled. In particular, we use M-estimation with a redescending bounding function because its influence function is always bounded. Furthermore, assuming consistency, this influence function is useful to obtain the limiting distribution of the estimates. However, this distribution may not necessarily yield accurate inference in the presence of contamination as the actual asymptotic distribution might have wider tails. This led us to investigate bootstrap approximation techniques. A sampling scheme based on IID innovations is modified to accommodate the cross-sectional structure of the data. Then the M-estimation is applied to each bootstrap sample naively to obtain the asymptotic distribution of the estimates.We apply these strategies to the extracted BOLD activation from several regions of the brain from a group of individuals to describe joint dynamic behavior between these locations. We used simulated data with both innovation and additive outliers to test whether the estimation procedure is accurate despite contamination

    A Bayes Linear Analysis of Multilevel Models

    Get PDF
    In this thesis, Bayes Linear methods for modeling multilevel data are presented and discussed. Second-order exchangeability judgements are exploited to formulate subjectivist versions of multilevel models. Bayes linear methods are applied to estimate model parameters and for diagnostic checks. Closed-form expressions of estimators are derived, allowing insight into relationships between the quantities thereof. The canonical analysis and resolution transforms are used to guide sample design and sample size determination under cost constraints. A finite version of a multilevel model is formulated, analysed and compared to infinite versions, giving further insight into sample design issues via the finite resolution transform. A new Bayes Linear Minimum Variance Estimation (BLIMVE) approach is de- veloped to estimate variances. Estimated variances are used to perform two-stage Bayes linear analysis of more complex multilevel models. The methods developed are shown to be applicable in cases of small level-2 samples. The Bayes linear analy- ses of multilevel models are applied to an educational data set using special-purpose codes written in the R Statistical Language

    Genetics and Improvement of Forest Trees

    Get PDF
    Forest tree improvement has mainly been implemented to enhance the productivity of artificial forests. However, given the drastically changing global environment, improvement of various traits related to environmental adaptability is more essential than ever. This book focuses on genetic information, including trait heritability and the physiological mechanisms thereof, which facilitate tree improvement. Nineteen papers are included, reporting genetic approaches to improving various species, including conifers, broad-leaf trees, and bamboo. All of the papers in this book provide cutting-edge genetic information on tree genetics and suggest research directions for future tree improvement

    Longitudinal survey data analysis.

    Get PDF
    Thesis (M.Sc.)-University of KwaZulu-Natal, Pietermaritzburg, 2006.To investigate the effect of environmental pollution on the health of children in the Durban South Industrial Basin (DSIB) due to its proximity to industrial activities, 233 children from five primary schools were considered. Three of these schools were located in the south of Durban while the other two were in the northern residential areas that were closer to industrial activities. Data collected included the participants' demographic, health, occupational, social and economic characteristics. In addition, environmental information was monitored throughout the study specifically, measurements on the levels of some ambient air pollutants. The objective of this thesis is to investigate which of these factors had an effect on the lung function of the children. In order to achieve this objective, different sample survey data analysis techniques are investigated. This includes the design-based and model-based approaches. The nature of the survey data finally leads to the longitudinal mixed model approach. The multicolinearity between the pollutant variables leads to the fitting of two separate models: one with the peak counts as the independent pollutant measures and the other with the 8-hour maximum moving average as the independent pollutant variables. In the selection of the fixed-effects structure, a scatter-plot smoother known as the loess fit is applied to the response variable individual profile plots. The random effects and the residual effect are assumed to have different covariance structures. The unstructured (UN) covariance structure is used for the random effects, while using the Akaike information criterion (AIC), the compound symmetric (CS) covariance structure is selected to be appropriate for the residual effects. To check the model fit, the profiles of the fitted and observed values of the dependent variables are compared graphically. The data is also characterized by the problem of intermittent missingness. The type of missingness is investigated by applying a modified logistic regression model missing at random (MAR) test. The results indicate that school location, sex and weight are the significant factors for the children's respiratory conditions. More specifically, the children in schools located in the northern residential areas are found to have poor respiratory conditions as compared to those in the Durban-South schools. In addition, poor respiratory conditions are also identified for overweight children

    Genomic prediction of resistance to Photobacterium damselae subsp. piscicida and Sparicotyle chrysophrii in gilthead seabream (Sparus aurata) using 2B-RAD sequencing

    Get PDF
    Context: Gilthead seabream (Sparus aurata) is a highly important farmed fish species specifically in the Mediterranean aquaculture industry. Infectious diseases present a significant threat to the sustainability of aquaculture with high economic losses due to mortalities, reduced productivity, and the necessity of additional treatments/vaccinations. Specific and sensitive methods for the detection of fish pathogens represent useful tools to investigate infection dynamics and enable early detection of the disease for better prevention. Selection and breeding for resistance against infectious diseases is also a highly valuable tool to help prevent or diminish disease outbreaks, and applying genomic information to the currently available advanced selection methods could accelerate the response to selection. The gram-negative bacteria Photobacterium damselae subsp. piscicida (Phdp) and the ectoparasite Sparicotyle chrisophrii (Sc) are two of the most important pathogens affecting seabream cultivation. Purpose of the study: The aims of this work are: (i) to investigate the genomic prediction of resistance to two highly problematic diseases in seabream through the application of 2b-RAD with the objective of achieving selective breeding goals and (ii) to design an effective assay for the detection and quantification of Phdp. Materials and methods: (i) 1233 and 1001 seabream individuals were challenged trough intramuscular injection with a virulent strain of Phdp and by co-habitation with naturally Sc-infected seabreams, respectively. Animals were monitored daily and data of dead/survived fish and number of parasites in the gills/body length were recorded. Genomic DNA was extracted from the finfish of all individuals and used to construct 2b-RAD libraries. Data were analyzed in order to find SNP based genotypes (GATK, SAMtools), perform Genome-Wide Association Studies, estimate breeding values (ASReml 4.0) and construct linkage-maps (Lep-Map v2). (ii) A primer set was designed from a partial sequence of the bamB gene (Primer3 web) considering two SNPs that discriminate between Phdp and its strictly correlated subspecies Phdd. The assay was tested for specificity/sensitivity on laboratory-generated samples as well as on previous experimentally infected seabream tissue samples. Results and discussion: (i) The reference catalogue contained 175,725 and 269,660 tags for the Phdp and the Sc challenge, respectively. The SNP detection process yielded genotypic data for 19,313 and 21,773 quality SNPs for Phdp and Sc, respectively, both grouped into 24 linkage groups (LG), which are consistent with the karyotype of this species. Genomic heritability for resistance to photobacteriosis was 0.31-0.33 and genomic heritability for tolerance to Sc was 0.11-0.22, suggesting potential to enhance both resistances through family-based selection. Estimated breeding values (EBV) using genomic (GBLUP) information presented 5-43% higher accuracy in comparison to those measured using the only pedigree information (PBLUP). GWAS revealed a quantitative trait locus (QTL) including 7 SNPs at LG17 which presented significant association with resistance to Phdp, while one SNP (LG17) was found affecting tolerance to Sc. (ii) The molecular method proposed for P. damselae diagnosis, with high specificity and sensitivity, proved to be suitable for detection, quantification and subspecies identification in one-step, overcoming the limitations of previous assays. Conclusions: The SNPs discovered through 2b-RAD genotyping could be used to implement new marker-assisted selection programs for the generation of more resistant fish, preventing important disease outbreaks in fish farms. In addition, the original molecular method proposed holds the potential to improve the current knowledge of Phdp infection dynamics and the development of better strategies to control this important fish disease
    corecore