5,586 research outputs found

    A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology

    Full text link
    The widespread availability of high-dimensional biological data has made the simultaneous screening of numerous biological characteristics a central statistical problem in computational biology. While the dimensionality of such datasets continues to increase, the problem of teasing out the effects of biomarkers in studies measuring baseline confounders while avoiding model misspecification remains only partially addressed. Efficient estimators constructed from data adaptive estimates of the data-generating distribution provide an avenue for avoiding model misspecification; however, in the context of high-dimensional problems requiring simultaneous estimation of numerous parameters, standard variance estimators have proven unstable, resulting in unreliable Type-I error control under standard multiple testing corrections. We present the formulation of a general approach for applying empirical Bayes shrinkage approaches to asymptotically linear estimators of parameters defined in the nonparametric model. The proposal applies existing shrinkage estimators to the estimated variance of the influence function, allowing for increased inferential stability in high-dimensional settings. A methodology for nonparametric variable importance analysis for use with high-dimensional biological datasets with modest sample sizes is introduced and the proposed technique is demonstrated to be robust in small samples even when relying on data adaptive estimators that eschew parametric forms. Use of the proposed variance moderation strategy in constructing stabilized variable importance measures of biomarkers is demonstrated by application to an observational study of occupational exposure. The result is a data adaptive approach for robustly uncovering stable associations in high-dimensional data with limited sample sizes

    Supervised Distance Matrices: Theory and Applications to Genomics

    Get PDF
    We propose a new approach to studying the relationship between a very high dimensional random variable and an outcome. Our method is based on a novel concept, the supervised distance matrix, which quantifies pairwise similarity between variables based on their association with the outcome. A supervised distance matrix is derived in two stages. The first stage involves a transformation based on a particular model for association. In particular, one might regress the outcome on each variable and then use the residuals or the influence curve from each regression as a data transformation. In the second stage, a choice of distance measure is used to compute all pairwise distances between variables in this transformed data. When the outcome is right-censored, we show that the supervised distance matrix can be consistently estimated using inverse probability of censoring weighted (IPCW) estimators based on the mean and covariance of the transformed data. The proposed methodology is illustrated with examples of gene expression data analysis with a survival outcome. This approach is widely applicable in genomics and other fields where high-dimensional data is collected on each subject

    Resampling-based Multiple Testing: Asymptotic Control of Type I Error and Applications to Gene Expression Data

    Get PDF
    We define a general statistical framework for multiple hypothesis testing and show that the correct null distribution for the test statistics is obtained by projecting the true distribution of the test statistics onto the space of mean zero distributions. For common choices of test statistics (based on an asymptotically linear parameter estimator), this distribution is asymptotically multivariate normal with mean zero and the covariance of the vector influence curve for the parameter estimator. This test statistic null distribution can be estimated by applying the non-parametric or parametric bootstrap to correctly centered test statistics. We prove that this bootstrap estimated null distribution provides asymptotic control of most type I error rates. We show that obtaining a test statistic null distribution from a data null distribution, e.g. projecting the data generating distribution onto the space of all distributions satisfying the complete null), only provides the correct test statistic null distribution if the covariance of the vector influence curve is the same under the data null distribution as under the true data distribution. This condition is a weak version of the subset pivotality condition. We show that our multiple testing methodology controlling type I error is equivalent to constructing an error-specific confidence region for the true parameter and checking if it contains the hypothesized value. We also study the two sample problem and show that the permutation distribution produces an asymptotically correct null distribution if (i) the sample sizes are equal or (ii) the populations have the same covariance structure. We include a discussion of the application of multiple testing to gene expression data, where the dimension typically far exceeds the sample size. An analysis of a cancer gene expression data set illustrates the methodology

    Statistical Inference for Simultaneous Clustering of Gene Expression Data

    Get PDF
    Current methods for analysis of gene expression data are mostly based on clustering and classification of either genes or samples. We offer support for the idea that more complex patterns can be identified in the data if genes and samples are considered simultaneously. We formalize the approach and propose a statistical framework for two-way clustering. A simultaneous clustering parameter is defined as a function of the true data generating distribution, and an estimate is obtained by applying this function to the empirical distribution. We illustrate that a wide range of clustering procedures, including generalized hierarchical methods, can be defined as parameters which are compositions of individual mappings for clustering patients and genes. This framework allows one to assess classical properties of clustering methods, such as consistency, and to formally study statistical inference regarding the clustering parameter. We present results of simulations designed to assess the asymptotic validity of different bootstrap methods for estimating the distributions of estimated simultaneous clustering parameters. The method is illustrated on a publicly available data set

    Family in Rehabilitation, Empowering Carers for Improved Malnutrition Outcomes: Protocol for the FREER Pilot Study

    Get PDF
    Interventions to improve the nutritional status of older adults and the integration of formal and family care systems are critical research areas to improve the independence and health of aging communities and are particularly relevant in the rehabilitation setting.The primary outcome aimed to determine if the FREER (Family in Rehabilitation: EmpowERing Carers for improved malnutrition outcomes) intervention in malnourished older adults during and postrehabilitation improve nutritional status, physical function, quality of life, service satisfaction, and hospital and aged care admission rates up to 3 months postdischarge, compared with usual care. Secondary outcomes evaluated include family carer burden, carer services satisfaction, and patient and carer experiences. This pilot study will also assess feasibility and intervention fidelity to inform a larger randomized controlled trial.This protocol is for a mixed-methods two-arm historically-controlled prospective pilot study intervention. The historical control group has 30 participants, and the pilot intervention group aims to recruit 30 patient-carer pairs. The FREER intervention delivers nutrition counseling during rehabilitation, 3 months of postdischarge telehealth follow-up, and provides supportive resources using a novel model of patient-centered and carer-centered nutrition care. The primary outcome is nutritional status measured by the Scored Patient-Generated Subjective Global Assessment Score. Qualitative outcomes such as experiences and perceptions of value will be measured using semistructured interviews followed by thematic analysis. The process evaluation addresses intervention fidelity and feasibility.Recruitment commenced on July 4, 2018, and is ongoing with eight patient-carer pairs recruited at the time of manuscript submission.This research will inform a larger randomized controlled trial, with potential for translation to health service policies and new models of dietetic care to support the optimization of nutritional status across a continuum of nutrition care from rehabilitation to home.Australian New Zealand Clinical Trials Registry Number (ACTRN) 12618000338268; https://www.anzctr.org.au/Trial/Registration/TrialReview.aspx?id=374608&isReview=true (Archived by WebCite at http://www.webcitation.org/74gtZplU2).DERR1-10.2196/12647

    Family in Rehabilitation, Empowering Carers for Improved Malnutrition Outcomes: Protocol for the FREER Pilot Study

    Get PDF
    Interventions to improve the nutritional status of older adults and the integration of formal and family care systems are critical research areas to improve the independence and health of aging communities and are particularly relevant in the rehabilitation setting.The primary outcome aimed to determine if the FREER (Family in Rehabilitation: EmpowERing Carers for improved malnutrition outcomes) intervention in malnourished older adults during and postrehabilitation improve nutritional status, physical function, quality of life, service satisfaction, and hospital and aged care admission rates up to 3 months postdischarge, compared with usual care. Secondary outcomes evaluated include family carer burden, carer services satisfaction, and patient and carer experiences. This pilot study will also assess feasibility and intervention fidelity to inform a larger randomized controlled trial.This protocol is for a mixed-methods two-arm historically-controlled prospective pilot study intervention. The historical control group has 30 participants, and the pilot intervention group aims to recruit 30 patient-carer pairs. The FREER intervention delivers nutrition counseling during rehabilitation, 3 months of postdischarge telehealth follow-up, and provides supportive resources using a novel model of patient-centered and carer-centered nutrition care. The primary outcome is nutritional status measured by the Scored Patient-Generated Subjective Global Assessment Score. Qualitative outcomes such as experiences and perceptions of value will be measured using semistructured interviews followed by thematic analysis. The process evaluation addresses intervention fidelity and feasibility.Recruitment commenced on July 4, 2018, and is ongoing with eight patient-carer pairs recruited at the time of manuscript submission.This research will inform a larger randomized controlled trial, with potential for translation to health service policies and new models of dietetic care to support the optimization of nutritional status across a continuum of nutrition care from rehabilitation to home.Australian New Zealand Clinical Trials Registry Number (ACTRN) 12618000338268; https://www.anzctr.org.au/Trial/Registration/TrialReview.aspx?id=374608&isReview=true (Archived by WebCite at http://www.webcitation.org/74gtZplU2).DERR1-10.2196/12647

    Revisiting the propensity score's central role: Towards bridging balance and efficiency in the era of causal machine learning

    Full text link
    About forty years ago, in a now--seminal contribution, Rosenbaum & Rubin (1983) introduced a critical characterization of the propensity score as a central quantity for drawing causal inferences in observational study settings. In the decades since, much progress has been made across several research fronts in causal inference, notably including the re-weighting and matching paradigms. Focusing on the former and specifically on its intersection with machine learning and semiparametric efficiency theory, we re-examine the role of the propensity score in modern methodological developments. As Rosenbaum & Rubin (1983)'s contribution spurred a focus on the balancing property of the propensity score, we re-examine the degree to which and how this property plays a role in the development of asymptotically efficient estimators of causal effects; moreover, we discuss a connection between the balancing property and efficient estimation in the form of score equations and propose a score test for evaluating whether an estimator achieves balance.Comment: Accepted for publication in a forthcoming special issue of Observational Studie

    Natural convection in high heat flux tanks at the Hanford Waste Site / [by] Mark van der Helm and Mujid S. Kazimi

    Get PDF
    "February 1996."Series statement handwritten on title-pagePage 118 blankAlso issued as an M.S. thesis written by the first author, and supervised by the second author, MIT Dept. of Nuclear EngineeringIncludes bibliographical references (pages 115-117)A study was carried out on the potential for natural convection and the effect of natural convection in a High Heat Flux Tank, Tank 241-C-106, at the Hanford Reservation. To determine the existence of natural convection, multiple computations based on analytical models were made knowing the tank geometry and contents' thermal characteristics. Each computation of the existence of natural convection was based on the determination of the onset of natural convection generalizing the tank as a 1-D porous medium. Computations were done for a range of permeabilities considering the porous medium alone, with a superposed fluid layer, and with a salt gradient. Considering only the porous medium, the higher permeability value, 3.2 *10-10 ft2, allowed convection, though the lower permeability, 2.6*10-14 ft2, did not. The presence of the superposed layer induced convection throughout the porous medium for the full range of permeabilities.Considering the effect of the salt gradient and superposed layer together, the effect of the superposed layer is expected to induce convection despite the stabilizing salt gradient. Therefore, natural convection is expected to exist in Tank 241-C-106. Secondly, because temperature measurements indicated lower temperatures at a location near the center of the tank, a thermal model was used to compute the local effects of a convective annulus around a thermocouple tree at that location. A conduction model of the tank and surroundings was used to bound the local model. The local model allowing convection in the annulus set the size of the annulus based on the known temperature measurements of the thermocouple tree and the boundary conditions set by the conduction model. Previous published calculations on Tank 241-C-106, allowing for only conduction within the tank, reported a steam region at the bottom of the tank with an approximately 24 foot radius.In the present analysis, using the computer code, TEMPEST, it is found that the cooling effect of the annulus creates a region with a 12 foot radius surrounding the thermocouple tree in which the temperature is suppressed below the saturation temperature due to the effects of the convective annulus. The annulus gap width for matching temperatures and the boundary conditions is on the order of 1 inch
    • …
    corecore