16 research outputs found

    Robust Sure Independence Screening for Non-polynomial dimensional Generalized Linear Models

    Get PDF
    We consider the problem of variable screening in ultra-high dimensional (of non-polynomial order) generalized linear models (GLMs). Since the popular SIS approach is extremely unstable in the presence of contamination and noises, which may frequently arise in the large scale sample data (e.g., Omics data), we discuss a new robust screening procedure based on the minimum density power divergence estimator (MDPDE) of the marginal regression coefficients. Our proposed screening procedure performs extremely well both under pure and contaminated data scenarios. We also theoretically justify the use of this marginal MDPDEs for variable screening from the population as well as sample aspects; in particular, we prove that these marginal MDPDEs are uniformly consistent leading to the sure screening property of our proposed algorithm. We have also proposed an appropriate MDPDE based extension for robust conditional screening in the GLMs along with the derivation of its sure screening property.Comment: Work in Progres

    Robust sure independence screening for nonpolynomial dimensional generalized linear models

    Get PDF
    We consider the problem of variable screening in ultra-high-dimensional generalized linear models (GLMs) of nonpolynomial orders. Since the popular SIS approach is extremely unstable in the presence of contamination and noise, we discuss a new robust screening procedure based on the minimum density power divergence estimator (MDPDE) of the marginal regression coefficients. Our proposed screening procedure performs well under pure and contaminated data scenarios. We provide a theoretical motivation for the use of marginal MDPDEs for variable screening from both population as well as sample aspects; in particular, we prove that the marginal MDPDEs are uniformly consistent leading to the sure screening property of our proposed algorithm. Finally, we propose an appropriate MDPDE-based extension for robust conditional screening in GLMs along with the derivation of its sure screening property. Our proposed methods are illustrated through extensive numerical studies along with an interesting real data application

    International Journal of Cancer / DNA methylation changes measured in pre-diagnostic peripheral blood samples are associated with smoking and lung cancer risk

    Get PDF
    DNA methylation changes are associated with cigarette smoking. We used the Illumina Infinium HumanMethylation450 array to determine whether methylation in DNA from pre-diagnostic, peripheral blood samples is associated with lung cancer risk. We used a case-control study nested within the EPIC-Italy cohort and a study within the MCCS cohort as discovery sets (a total of 552 case-control pairs). We validated the top signals in 429 case-control pairs from another 3 studies. We identified six CpGs for which hypomethylation was associated with lung cancer risk: cg05575921 in the AHRR gene (p-valuepooled =4 10-17 ), cg03636183 in the F2RL3 gene (p-valuepooled =2 10 - 13 ), cg21566642 and cg05951221 in 2q37.1 (p-valuepooled =7 10-16 and 1 10-11 respectively), cg06126421 in 6p21.33 (p-valuepooled =2 10-15 ) and cg23387569 in 12q14.1 (p-valuepooled =5 10-7 ). For cg05951221 and cg23387569 the strength of association was virtually identical in never and current smokers. For all these CpGs except for cg23387569, the methylation levels were different across smoking categories in controls (p-valuesheterogeneity 1.8 x10 - 7 ), were lowest for current smokers and increased with time since quitting for former smokers. We observed a gain in discrimination between cases and controls measured by the area under the ROC curve of at least 8% (p-values0.003) in former smokers by adding methylation at the 6 CpGs into risk prediction models including smoking status and number of pack-years. Our findings provide convincing evidence that smoking and possibly other factors lead to DNA methylation changes measurable in peripheral blood that may improve prediction of lung cancer risk.(VLID)222024

    Measurement error modelling in biological applications

    Full text link

    The simulation extrapolation technique meets ecology and evolution: A general and intuitive method to account for measurement error

    Full text link
    Measurement error and other forms of uncertainty are commonplace in ecology and evolution, and may bias estimates of parameters of interest. Although a variety of approaches to obtain unbiased estimators are available, these usually require the formulation of an explicit (parametric) model for the error‐prone variable and a latent model for the unobserved (latent) error‐free variable. In practice, this is often difficult. We propose to generalize the simulation extrapolation (SIMEX) technique, a heuristic approach to correct for measurement error, to situations where it is difficult to explicitly formulate an error model or latent model for a variable of interest. We illustrate the idea with the example of error in pedigrees. Pedigree error causes error in estimates of inbreeding coefficients and the relatedness matrix, thus biasing estimates of inbreeding depression or heritability. Instead of formulating error models for inbreeding coefficients or the relatedness matrix, we directly apply the SIMEX idea to the pedigree. The initially known error proportion in the pedigree is progressively increased, all models are refitted, and the observed trend in the quantities of interest is extrapolated back to a hypothetical error‐free pedigree to obtain bias‐corrected estimates. We tested this pedigree‐SIMEX (PSIMEX) method with simulated pedigrees and with data from a free‐living population of song sparrows. The simulation study indicates that the PSIMEX estimator is almost unbiased for inbreeding depression and heritability, and that it has a much lower mean squared error (MSE) than the naive estimator. In the application to the song sparrows, the error‐corrected results could be validated against the actual values thanks to the availability of both an error‐prone and an error‐free pedigree. The results indicate that bias and MSE are reduced by PSIMEX. For easy accessibility of the method, we provide the R‐package PSIMEX. By transferring the SIMEX philosophy to error in pedigrees, we have illustrated how this heuristic approach can be generalized to situations where explicit error models are difficult to formulate. Thanks to the simplicity of the idea, many other error problems in ecology and evolution might be amenable to SIMEX‐like error correction methods

    The simulation extrapolation technique meets ecology and evolution: A general and intuitive method to account for measurement error

    No full text
    1. Measurement error and other forms of uncertainty are commonplace in ecology and evolution, and may bias estimates of parameters of interest. Although a variety of approaches to obtain unbiased estimators are available, these usually require the formulation of an explicit (parametric) model for the error‐prone variable and a latent model for the unobserved (latent) error‐free variable. In practice, this is often difficult. 2. We propose to generalize the simulation extrapolation (SIMEX) technique, a heuristic approach to correct for measurement error, to situations where it is difficult to explicitly formulate an error model or latent model for a variable of interest. We illustrate the idea with the example of error in pedigrees. Pedigree error causes error in estimates of inbreeding coefficients and the relatedness matrix, thus biasing estimates of inbreeding depression or heritability. Instead of formulating error models for inbreeding coefficients or the relatedness matrix, we directly apply the SIMEX idea to the pedigree. The initially known error proportion in the pedigree is progressively increased, all models are refitted, and the observed trend in the quantities of interest is extrapolated back to a hypothetical error‐free pedigree to obtain bias‐corrected estimates. We tested this pedigree‐SIMEX (PSIMEX) method with simulated pedigrees and with data from a free‐living population of song sparrows. 3. The simulation study indicates that the PSIMEX estimator is almost unbiased for inbreeding depression and heritability, and that it has a much lower mean squared error (MSE) than the naive estimator. In the application to the song sparrows, the error‐corrected results could be validated against the actual values thanks to the availability of both an error‐prone and an error‐free pedigree. The results indicate that bias and MSE are reduced by PSIMEX. For easy accessibility of the method, we provide the R‐package PSIMEX. 4. By transferring the SIMEX philosophy to error in pedigrees, we have illustrated how this heuristic approach can be generalized to situations where explicit error models are difficult to formulate. Thanks to the simplicity of the idea, many other error problems in ecology and evolution might be amenable to SIMEX‐like error correction methods

    Heritability, selection, and the response to selection in the presence of phenotypic measurement error: Effects, cures, and the role of repeated measurements

    Get PDF
    Quantitative genetic analyses require extensive measurements of phenotypic traits, a task that is often not trivial, especially in wild populations. On top of instrumental measurement error, some traits may undergo transient (i.e. non-persistent) fluctuations that are biologically irrelevant for selection processes. These two sources of variability, which we denote here as measurement error in a broad sense, are possible causes for bias in the estimation of quantitative genetic parameters. We illustrate how in a continuous trait transient effects with a classical measurement error structure may bias estimates of heritability, selection gradients, and the predicted response to selection. We propose strategies to obtain unbiased estimates with the help of repeated measurements taken at an appropriate temporal scale. However, the fact that in quantitative genetic analyses repeated measurements are also used to isolate permanent environmental instead of transient effects, requires that the information content of repeated measurements is carefully assessed. To this end, we propose to distinguish "short-term" from "long-term" repeats, where the former capture transient variability and the latter the permanent effects. We show how the inclusion of the corresponding variance components in quantitative genetic models yields unbiased estimates of all quantities of interest, and we illustrate the application of the method to data from a Swiss snow vole population

    Heritability, selection, and the response to selection in the presence of phenotypic measurement error: effects, cures, and the role of repeated measurements.

    No full text
    Quantitative genetic analyses require extensive measurements of phenotypic traits, a task that is often not trivial, especially in wild populations. On top of instrumental measurement error, some traits may undergo transient (i.e., nonpersistent) fluctuations that are biologically irrelevant for selection processes. These two sources of variability, which we denote here as measurement error in a broad sense, are possible causes for bias in the estimation of quantitative genetic parameters. We illustrate how in a continuous trait transient effects with a classical measurement error structure may bias estimates of heritability, selection gradients, and the predicted response to selection. We propose strategies to obtain unbiased estimates with the help of repeated measurements taken at an appropriate temporal scale. However, the fact that in quantitative genetic analyses repeated measurements are also used to isolate permanent environmental instead of transient effects requires that the information content of repeated measurements is carefully assessed. To this end, we propose to distinguish “short‐term” from “long‐term” repeats, where the former capture transient variability and the latter help isolate permanent effects. We show how the inclusion of the corresponding variance components in quantitative genetic models yields unbiased estimates of all quantities of interest, and we illustrate the application of the method to data from a Swiss snow vole population

    Integrative, multi-omics, analysis of blood samples improves model predictions: applications to cancer

    Get PDF
    Background Cancer genomic studies often include data collected from several omics platforms. Each omics data source contributes to the understanding of the underlying biological process via source specific (“individual”) patterns of variability. At the same time, statistical associations and potential interactions among the different data sources can reveal signals from common biological processes that might not be identified by single source analyses. These common patterns of variability are referred to as “shared” or “joint”. In this work, we show how the use of joint and individual components can lead to better predictive models, and to a deeper understanding of the biological process at hand. We identify joint and individual contributions of DNA methylation, miRNA and mRNA expression collected from blood samples in a lung cancer case–control study nested within the Norwegian Women and Cancer (NOWAC) cohort study, and we use such components to build prediction models for case–control and metastatic status. To assess the quality of predictions, we compare models based on simultaneous, integrative analysis of multi-source omics data to a standard non-integrative analysis of each single omics dataset, and to penalized regression models. Additionally, we apply the proposed approach to a breast cancer dataset from The Cancer Genome Atlas. Results Our results show how an integrative analysis that preserves both components of variation is more appropriate than standard multi-omics analyses that are not based on such a distinction. Both joint and individual components are shown to contribute to a better quality of model predictions, and facilitate the interpretation of the underlying biological processes in lung cancer development. Conclusions In the presence of multiple omics data sources, we recommend the use of data integration techniques that preserve the joint and individual components across the omics sources. We show how the inclusion of such components increases the quality of model predictions of clinical outcomes
    corecore