32 research outputs found

    Estimation of significance thresholds for genomewide association scans

    Get PDF
    The question of what significance threshold is appropriate for genomewide association studies is somewhat unresolved. Previous theoretical suggestions have yet to be validated in practice, whereas permutation testing does not resolve a discrepancy between the genomewide multiplicity of the experiment and the subset of markers actually tested. We used genotypes from the Wellcome Trust Case-Control Consortium to estimate a genomewide significance threshold for the UK Caucasian population. We subsampled the genotypes at increasing densities, using permutation to estimate the nominal P-value for 5% family-wise error. By extrapolating to infinite density, we estimated the genomewide significance threshold to be about 7.2 × 10−8. To reduce the computation time, we considered Patterson's eigenvalue estimator of the effective number of tests, but found it to be an order of magnitude too low for multiplicity correction. However, by fitting a Beta distribution to the minimum P-value from permutation replicates, we showed that the effective number is a useful heuristic and suggest that its estimation in this context is an open problem. We conclude that permutation is still needed to obtain genomewide significance thresholds, but with subsampling, extrapolation and estimation of an effective number of tests, the threshold can be standardized for all studies of the same population

    MODEL AVERAGING, AN ALTERNATIVE APPROACH TO MODEL SELECTION IN HIGH DIMENSIONAL DATA ESTIMATION

    Get PDF
    Model averaging is an alternative approach to classical model selection in model estimation. The model selection such as forward or stepwise regression, use certain criteria in choosing one best model fitted the data such as AIC and BIC. On the other hand, model averaging estimates one model whose parameters determined by weighted averaging the parameter of each approximation models. Instead of conducting inference and prediction only based one best chosen model, model averaging covering model uncertainty problem by including all possible model in determining prediction model. Some of its developments and applications also challenges will be described in this paper. Frequentist model averaging will be preferential described.Keywords : model selection, frequentist model averaging, high dimensional dat

    Detecting multiple associations in genome-wide studies.

    Get PDF
    Recent developments in the statistical analysis of genome-wide studies are reviewed. Genome-wide analyses are becoming increasingly common in areas such as scans for disease-associated markers and gene expression profiling. The data generated by these studies present new problems for statistical analysis, owing to the large number of hypothesis tests, comparatively small sample size and modest number of true gene effects. In this review, strategies are described for optimising the genotyping cost by discarding promising genes at an earlier stage, saving resources for the genes that show a trend of association. In addition, there is a review of new methods of analysis that combine evidence across genes to increase sensitivity to multiple true associations in the presence of many non-associated genes. Some methods achieve this by including only the most significant results, whereas others model the overall distribution of results as a mixture of distributions from true and null effects. Because genes are correlated even when having no effect, permutation testing is often necessary to estimate the overall significance, but this can be very time consuming. Efficiency can be improved by fitting a parametric distribution to permutation replicates, which can be re-used in subsequent analyses. Methods are also available to generate random draws from the permutation distribution. The review also includes discussion of new error measures that give a more reasonable interpretation of genome-wide studies, together with improved sensitivity. The false discovery rate allows a controlled proportion of positive results to be false, while detecting more true positives; and the local false discovery rate and false-positive report probability give clarity on whether or not a statistically significant test represents a real discovery

    Discussion on the paper ‘Statistical contributions to bioinformatics: Design, modelling, structure learning and integration’ by Jeffrey S. Morris and Veerabhadran Baladandayuthapani

    Get PDF
    Bioinformatics is an important research area for statisticians. This discussion provides some additional topics to the paper, namely on statistical contributions to detect differential expressed genes, for protein structure prediction, and for the analysis of highly correlated features in Glycomics datasets

    Identification of candidate genes linking systemic inflammation to atherosclerosis; results of a human in vivo LPS infusion study.

    Get PDF
    BACKGROUND: It is widely accepted that atherosclerosis and inflammation are intimately linked. Monocytes play a key role in both of these processes and we hypothesized that activation of inflammatory pathways in monocytes would lead to, among others, proatherogenic changes in the monocyte transcriptome. Such differentially expressed genes in circulating monocytes would be strong candidates for further investigation in disease association studies. METHODS: Endotoxin, lipopolysaccharide (LPS), or saline control was infused in healthy volunteers. Monocyte RNA was isolated, processed and hybridized to Hver 2.1.1 spotted cDNA microarrays. Differential expression of key genes was confirmed by RT-PCR and results were compared to in vitro data obtained by our group to identify candidate genes. RESULTS: All subjects who received LPS experienced the anticipated clinical response indicating successful stimulation. One hour after LPS infusion, 11 genes were identified as being differentially expressed; 1 down regulated and 10 up regulated. Four hours after LPS infusion, 28 genes were identified as being differentially expressed; 3 being down regulated and 25 up regulated. No genes were significantly differentially expressed following saline infusion. Comparison with results obtained in in vitro experiments lead to the identification of 6 strong candidate genes (BATF, BID, C3aR1, IL1RN, SEC61B and SLC43A3) CONCLUSION: In vivo endotoxin exposure of healthy individuals resulted in the identification of several candidate genes through which systemic inflammation links to atherosclerosis.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

    Copy number alterations in cancer: detection and precision medicine

    No full text
    Copy number alterations (CNAs) are genomic alterations where some regions exhibit more or less copy number than the normal two copies. In this talk, I will describe two ideas: (1) how CNAs are estimated from data generated by next generation sequencing (NGS) and what steps are required to make the data interpretable, (2) how the CNA can be utilised for precision medicine in terms of prediction of tumour subtypes and prediction of cancer patients’ survival. If time permits, I will also discuss how to estimate genomic markers from CNA profile across cancer patients

    Regression on high-dimensional predictor space : With application in chemometrics and microarray data

    No full text
    This thesis focuses on regression methodology for prediction and classification in situations where there are many predictors but limited number of observations. This situation is common in chemometrics and microarray data. In chemometrics, we obtain the absorbance level of each sample at hundreds or thousands of wavelengths (variables) in a calibration of a near-infrared (NIR) calibration. In microarray data, we have expression level of thousands of genes or proteins (variables) from each sample. When the variables are put in regression analysis, we have a vector of response and hundreds or thousands of predictors in the model. The challenge in regression analysis is how do we infer the pattern in the data when the number of samples is limited. The situation where we have a large number of variables but a limited number of samples in a dataset raises many problems. We address some of them in this research, including parameter estimation method, variable selection, and inference, and develop methodology to deal with them. We deal with the question of variable selection in NIR calibration and conclude that the variable selection does not guarantee better prediction. A case-by-case investigation is necessary to determine whether all available variables are relevant for prediction. In microarray data, we infer that procedures to select variables into logistic regression based on multivariate information give a better model fit than using t-statistics. We deal with the problem of parameter estimation with such a large number of variables by considering some models where we can put all of the available variables in the model. The goal of selecting differentially expressed genes from logistic regression with random effects is not possible due to a limited amount of information. To deal with this inference problem, we investigate a linear mixed model where we assume the random effects follow a mixture of three normal distributions. The mixture distribution corresponds to genes that are down, non, and up differentially expressed. The inference on each gene becomes whether a gene belongs to one of the mixture components. In this context, estimation of fold-change and identification of differentially expressed genes can be done simultaneously. We conclude that the method performs reasonably well to identify the genes. This is validated by a spike-in study and simulation. In applying the model to find coregulated genes, the method identifies the genes while its performance relies on the amount of information in the data

    Detection of Oscillatory Modes in Power Systems using Empirical Wavelet Transform

    No full text
    In electric power systems, detecting inter-area oscillations is crucial to the system operators for maintaining the security of the grid - especially in the case of unstable oscillatory behaviour. However, extracting information from unstable, noisy, signals is complicated with conventional signal processing tools suffering from insufficient adaptability. In this paper, we propose a method based on Empirical Wavelet Transform (EWT) to estimate in real-time the dominant inter-area modes in electricity grids. EWT extracts the inherent modulation information by decomposing the signal into its mono components under an orthogonal basis. The instantaneous amplitude and instantaneous frequency is estimated by applying Hilbert transform from the narrow band components of the decomposed EWT signal. The performance of the proposed method is demonstrated using the Nordic test system
    corecore