175 research outputs found

    Gene ranking and biomarker discovery under correlation

    Full text link
    Biomarker discovery and gene ranking is a standard task in genomic high throughput analysis. Typically, the ordering of markers is based on a stabilized variant of the t-score, such as the moderated t or the SAM statistic. However, these procedures ignore gene-gene correlations, which may have a profound impact on the gene orderings and on the power of the subsequent tests. We propose a simple procedure that adjusts gene-wise t-statistics to take account of correlations among genes. The resulting correlation-adjusted t-scores ("cat" scores) are derived from a predictive perspective, i.e. as a score for variable selection to discriminate group membership in two-class linear discriminant analysis. In the absence of correlation the cat score reduces to the standard t-score. Moreover, using the cat score it is straightforward to evaluate groups of features (i.e. gene sets). For computation of the cat score from small sample data we propose a shrinkage procedure. In a comparative study comprising six different synthetic and empirical correlation structures we show that the cat score improves estimation of gene orderings and leads to higher power for fixed true discovery rate, and vice versa. Finally, we also illustrate the cat score by analyzing metabolomic data. The shrinkage cat score is implemented in the R package "st" available from URL http://cran.r-project.org/web/packages/st/Comment: 18 pages, 5 figures, 1 tabl

    A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data

    Get PDF
    In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation. Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores. To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error

    Inferring Causal Relationships Between Risk Factors and Outcomes from Genome-Wide Association Study Data.

    Get PDF
    An observational correlation between a suspected risk factor and an outcome does not necessarily imply that interventions on levels of the risk factor will have a causal impact on the outcome (correlation is not causation). If genetic variants associated with the risk factor are also associated with the outcome, then this increases the plausibility that the risk factor is a causal determinant of the outcome. However, if the genetic variants in the analysis do not have a specific biological link to the risk factor, then causal claims can be spurious. We review the Mendelian randomization paradigm for making causal inferences using genetic variants. We consider monogenic analysis, in which genetic variants are taken from a single gene region, and polygenic analysis, which includes variants from multiple regions. We focus on answering two questions: When can Mendelian randomization be used to make reliable causal inferences, and when can it be used to make relevant causal inferences? Expected final online publication date for the Annual Review of Genomics and Human Genetics Volume 19 is August 31, 2018. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates

    Task-related edge density (TED) - a new method for revealing large-scale network formation in fMRI data of the human brain

    Full text link
    The formation of transient networks in response to external stimuli or as a reflection of internal cognitive processes is a hallmark of human brain function. However, its identification in fMRI data of the human brain is notoriously difficult. Here we propose a new method of fMRI data analysis that tackles this problem by considering large-scale, task-related synchronisation networks. Networks consist of nodes and edges connecting them, where nodes correspond to voxels in fMRI data, and the weight of an edge is determined via task-related changes in dynamic synchronisation between their respective times series. Based on these definitions, we developed a new data analysis algorithm that identifies edges in a brain network that differentially respond in unison to a task onset and that occur in dense packs with similar characteristics. Hence, we call this approach "Task-related Edge Density" (TED). TED proved to be a very strong marker for dynamic network formation that easily lends itself to statistical analysis using large scale statistical inference. A major advantage of TED compared to other methods is that it does not depend on any specific hemodynamic response model, and it also does not require a presegmentation of the data for dimensionality reduction as it can handle large networks consisting of tens of thousands of voxels. We applied TED to fMRI data of a fingertapping task provided by the Human Connectome Project. TED revealed network-based involvement of a large number of brain areas that evaded detection using traditional GLM-based analysis. We show that our proposed method provides an entirely new window into the immense complexity of human brain function.Comment: 21 pages, 11 figure

    Selection of invalid instruments can improve estimation in Mendelian randomization

    Full text link
    Mendelian randomization (MR) is a widely-used method to identify causal links between a risk factor and disease. A fundamental part of any MR analysis is to choose appropriate genetic variants as instrumental variables. Current practice usually involves selecting only those genetic variants that are deemed to satisfy certain exclusion restrictions, in a bid to remove bias from unobserved confounding. Many more genetic variants may violate these exclusion restrictions due to unknown pleiotropic effects (i.e. direct effects on the outcome not via the exposure), but their inclusion could increase the precision of causal effect estimates at the cost of allowing some bias. We explore how to optimally tackle this bias-variance trade-off by carefully choosing from many weak and locally invalid instruments. Specifically, we study a focused instrument selection approach for publicly available two-sample summary data on genetic associations, whereby genetic variants are selected on the basis of how they impact the asymptotic mean square error of causal effect estimates. We show how different restrictions on the nature of pleiotropic effects have important implications for the quality of post-selection inferences. In particular, a focused selection approach under systematic pleiotropy allows for consistent model selection, but in practice can be susceptible to winner's curse biases. Whereas a more general form of idiosyncratic pleiotropy allows only conservative model selection, but offers uniformly valid confidence intervals. We propose a novel method to tighten honest confidence intervals through support restrictions on pleiotropy. We apply our results to several real data examples which suggest that the optimal selection of instruments does not only involve biologically-justified valid instruments, but additionally hundreds of potentially pleiotropic variants.Comment: 56 pages, 8 figure

    Modal-based estimation via heterogeneity-penalized weighting: model averaging for consistent and efficient estimation in Mendelian randomization when a plurality of candidate instruments are valid.

    Get PDF
    BACKGROUND: A robust method for Mendelian randomization does not require all genetic variants to be valid instruments to give consistent estimates of a causal parameter. Several such methods have been developed, including a mode-based estimation method giving consistent estimates if a plurality of genetic variants are valid instruments; i.e. there is no larger subset of invalid instruments estimating the same causal parameter than the subset of valid instruments. METHODS: We here develop a model-averaging method that gives consistent estimates under the same 'plurality of valid instruments' assumption. The method considers a mixture distribution of estimates derived from each subset of genetic variants. The estimates are weighted such that subsets with more genetic variants receive more weight, unless variants in the subset have heterogeneous causal estimates, in which case that subset is severely down-weighted. The mode of this mixture distribution is the causal estimate. This heterogeneity-penalized model-averaging method has several technical advantages over the previously proposed mode-based estimation method. RESULTS: The heterogeneity-penalized model-averaging method outperformed the mode-based estimation in terms of efficiency and outperformed other robust methods in terms of Type 1 error rate in an extensive simulation analysis. The proposed method suggests two distinct mechanisms by which inflammation affects coronary heart disease risk, with subsets of variants suggesting both positive and negative causal effects. CONCLUSIONS: The heterogeneity-penalized model-averaging method is an additional robust method for Mendelian randomization with excellent theoretical and practical properties, and can reveal features in the data such as the presence of multiple causal mechanisms

    Selecting likely causal risk factors from high-throughput experiments using multivariable Mendelian randomization

    Get PDF
    Modern high-throughput experiments provide a rich resource to investigate causal determinants of disease risk. Mendelian randomization (MR) is the use of genetic variants as instrumental variables to infer the causal effect of a specific risk factor on an outcome. Multivariable MR is an extension of the standard MR framework to consider multiple potential risk factors in a single model. However, current implementations of multivariable MR use standard linear regression and hence perform poorly with many risk factors. Here, we propose a two-sample multivariable MR approach based on Bayesian model averaging (MR-BMA) that scales to high-throughput experiments. In a realistic simulation study, we show that MR-BMA can detect true causal risk factors even when the candidate risk factors are highly correlated. We illustrate MR-BMA by analysing publicly-available summarized data on metabolites to prioritise likely causal biomarkers for age-related macular degeneration

    High-throughput multivariable Mendelian randomization analysis prioritizes apolipoprotein B as key lipid risk factor for coronary artery disease.

    Get PDF
    BACKGROUND: Genetic variants can be used to prioritize risk factors as potential therapeutic targets via Mendelian randomization (MR). An agnostic statistical framework using Bayesian model averaging (MR-BMA) can disentangle the causal role of correlated risk factors with shared genetic predictors. Here, our objective is to identify lipoprotein measures as mediators between lipid-associated genetic variants and coronary artery disease (CAD) for the purpose of detecting therapeutic targets for CAD. METHODS: As risk factors we consider 30 lipoprotein measures and metabolites derived from a high-throughput metabolomics study including 24 925 participants. We fit multivariable MR models of genetic associations with CAD estimated in 453 595 participants (including 113 937 cases) regressed on genetic associations with the risk factors. MR-BMA assigns to each combination of risk factors a model score quantifying how well the genetic associations with CAD are explained. Risk factors are ranked by their marginal score and selected using false-discovery rate (FDR) criteria. We perform supplementary and sensitivity analyses varying the dataset for genetic associations with CAD. RESULTS: In the main analysis, the top combination of risk factors ranked by the model score contains apolipoprotein B (ApoB) only. ApoB is also the highest ranked risk factor with respect to the marginal score (FDR <0.005). Additionally, ApoB is selected in all sensitivity analyses. No other measure of cholesterol or triglyceride is consistently selected otherwise. CONCLUSIONS: Our agnostic genetic investigation prioritizes ApoB across all datasets considered, suggesting that ApoB, representing the total number of hepatic-derived lipoprotein particles, is the primary lipid determinant of CAD

    Selecting likely causal risk factors from high-throughput experiments using multivariable Mendelian randomization

    Get PDF
    Funder: UK Medical Research Council (MC UU 00002/7) and Wellcome Trust and the Royal Society (Grant Number 204623/Z/16/Z)Abstract: Modern high-throughput experiments provide a rich resource to investigate causal determinants of disease risk. Mendelian randomization (MR) is the use of genetic variants as instrumental variables to infer the causal effect of a specific risk factor on an outcome. Multivariable MR is an extension of the standard MR framework to consider multiple potential risk factors in a single model. However, current implementations of multivariable MR use standard linear regression and hence perform poorly with many risk factors. Here, we propose a two-sample multivariable MR approach based on Bayesian model averaging (MR-BMA) that scales to high-throughput experiments. In a realistic simulation study, we show that MR-BMA can detect true causal risk factors even when the candidate risk factors are highly correlated. We illustrate MR-BMA by analysing publicly-available summarized data on metabolites to prioritise likely causal biomarkers for age-related macular degeneration
    • …
    corecore