16 research outputs found

    Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis

    Get PDF
    Sample size estimation is important in microarray or proteomic experiments since biologists can typically afford only a few repetitions. In the multiple testing problems involving these experiments, it is more powerful and more reasonable to control false discovery rate (FDR) or positive FDR (pFDR) instead of type I error, e.g., family-wise error rate (FWER) (Storey and Tibshirani, 2003). However, the traditional approach of estimating sample size is no longer applicable to controlling FDR, which has left most practitioners to rely on haphazard guessing. We propose a procedure to calculate sample size while controlling false discovery rate. Two major definitions of the false discovery rate (FDR in Benjamini and Hochberg, 1995, and pFDR in Storey, 2002) vary slightly. Our procedure applies to both definitions. The proposed method is straightforward to apply and requires minimal computation, as illustrated with two sample t-tests and F-tests. We have also demonstrated by simulation that, with the calculated sample size, the desired level of power is achievable by the q-value procedure (Storey, Taylor and Siegmund, 2004) when gene expressions are either independent or dependent

    A simulationā€“approximation approach to sample size planning for high-dimensional classification studies

    Get PDF
    Classification studies with high-dimensional measurements and relatively small sample sizes are increasingly common. Prospective analysis of the role of sample sizes in the performance of such studies is important for study design and interpretation of results, but the complexity of typical pattern discovery methods makes this problem challenging. The approach developed here combines Monte Carlo methods and new approximations for linear discriminant analysis, assuming multivariate normal distributions. Monte Carlo methods are used to sample the distribution of which features are selected for a classifier and the mean and variance of features given that they are selected. Given selected features, the linear discriminant problem involves different distributions of training data and generalization data, for which 2 approximations are compared: one based on Taylor series approximation of the generalization error and the other on approximating the discriminant scores as normally distributed. Combining the Monte Carlo and approximation approaches to different aspects of the problem allows efficient estimation of expected generalization error without full simulations of the entire sampling and analysis process. To evaluate the method and investigate realistic study design questions, full simulations are used to ask how validation error rate depends on the strength and number of informative features, the number of noninformative features, the sample size, and the number of features allowed into the pattern. Both approximation methods perform well for most cases but only the normal discriminant score approximation performs well for cases of very many weakly informative or uninformative dimensions. The simulated cases show that many realistic study designs will typically estimate substantially suboptimal patterns and may have low probability of statistically significant validation results

    Gene expression profiling in whole blood of patients with coronary artery disease

    Get PDF
    Owing to the dynamic nature of the transcriptome, gene expression profiling is a promising tool for discovery of disease-related genes and biological pathways. In the present study, we examined gene expression in whole blood of 12 patients with CAD (coronary artery disease) and 12 healthy control subjects. Furthermore, ten patients with CAD underwent whole-blood gene expression analysis before and after the completion of a cardiac rehabilitation programme following surgical coronary revascularization. mRNA and miRNA (microRNA) were isolated for expression profiling. Gene expression analysis identified 365 differentially expressed genes in patients with CAD compared with healthy controls (175 up- and 190 down-regulated in CAD), and 645 in CAD rehabilitation patients (196 up- and 449 down-regulated post-rehabilitation). Biological pathway analysis identified a number of canonical pathways, including oxidative phosphorylation and mitochondrial function, as being significantly and consistently modulated across the groups. Analysis of miRNA expression revealed a number of differentially expressed miRNAs, including hsa-miR-140-3p (control compared with CAD, P=0.017), hsa-miR-182 (control compared with CAD, P=0.093), hsa-miR-92a and hsa-miR-92b (post- compared with pre-exercise, P<0.01). Global analysis of predicted miRNA targets found significantly reduced expression of genes with target regions compared with those without: hsa-miR-140-3p (P=0.002), hsa-miR-182 (P=0.001), hsa-miR-92a and hsa-miR-92b (P=2.2Ɨ10āˆ’16). In conclusion, using whole blood as a ā€˜surrogate tissueā€™ in patients with CAD, we have identified differentially expressed miRNAs, differentially regulated genes and modulated pathways which warrant further investigation in the setting of cardiovascular function. This approach may represent a novel non-invasive strategy to unravel potentially modifiable pathways and possible therapeutic targets in cardiovascular disease

    A mixture model approach to sample size estimation in two-sample comparative microarray experiments

    Get PDF
    Background: Choosing the appropriate sample size is an important step in the design of a microarray experiment, and recently methods have been proposed that estimate sample sizes for control of the False Discovery Rate (FDR). Many of these methods require knowledge of the distribution of effect sizes among the differentially expressed genes. If this distribution can be determined then accurate sample size requirements can be calculated. Results: We present a mixture model approach to estimating the distribution of effect sizes in data from two-sample comparative studies. Specifically, we present a novel, closed form, algorithm for estimating the noncentrality parameters in the test statistic distributions of differentially expressed genes. We then show how our model can be used to estimate sample sizes that control the FDR together with other statistical measures like average power or the false nondiscovery rate. Method performance is evaluated through a comparison with existing methods for sample size estimation, and is found to be very good. Conclusion: A novel method for estimating the appropriate sample size for a two-sample comparative microarray study is presented. The method is shown to perform very well when compared to existing methods

    Sample Size Calculation for Controlling False Discovery Proportion

    Get PDF
    The false discovery proportion (FDP), the proportion of incorrect rejections among all rejections, is a direct measure of abundance of false positive findings in multiple testing. Many methods have been proposed to control FDP, but they are too conservative to be useful for power analysis. Study designs for controlling the mean of FDP, which is false discovery rate, have been commonly used. However, there has been little attempt to design study with direct FDP control to achieve certain level of efficiency. We provide a sample size calculation method using the variance formula of the FDP under weak-dependence assumptions to achieve the desired overall power. The relationship between design parameters and sample size is explored. The adequacy of the procedure is assessed by simulation. We illustrate the method using estimated correlations from a prostate cancer dataset

    Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms

    Get PDF
    Background: Data generated using ā€˜omicsā€™ technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of ā€˜omicsā€™ data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques. Results: The analysis of data from seven ā€˜omicsā€™ studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper. Conclusion: No single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data

    Molecular Profile of Women With and Without Secondary Breast Cancer After the Treatment of Pediatric Hodgkin Lymphoma

    Get PDF
    dentification of genetic risk factors associated with the development of secondary cancers would facilitate identification of at risk patients and permit modification of therapy and heightened surveillance that may reduce cancer-related morbidity and mortality. Women survivors of pediatric Hodgkin lymphoma (HL) have an increased risk of morbidity and mortality associated with secondary effects of therapy, with a 35-75 fold excess risk of developing breast cancer over the general population. The mechanism for secondary breast cancer among Hodgkin survivors is not understood. Researchers have postulated that the familial characteristics of HL could be associated with mutations found within familial cancer syndromes; however, these mutations have not been identified. This has led to the exploration of inherent polymorphisms that might impair the patientā€™s capability to detoxify chemotherapy and/or repair DNA damage produced by irradiation. Examinations of candidate polymorphisms indicate that single nucleotide changes may have only a small effect on the development of subsequent cancers. However, multiple studies support the idea that sensitivity to irradiation and the subsequent development of breast cancer is mediated through the interaction of multiple genes or gene complexes. The objective of this case-control study design was to explore the identification of potential candidate genes and polymorphisms that may be risk factors for the development of secondary breast cancer among women who are pediatric HL survivors. Global gene expression and genotyping of women with (n=13) and without (n=36) secondary breast cancer after the treatment of pediatric HL were compared. Differences were found in global gene expression and genotyping between the cases and controls. Additionally, copy number variation in association with gene expression found a locus of interest at 15q11.2 in association with the development of secondary breast cancer
    corecore