25 research outputs found

    Adjusting for Gene-Specific Covariates to Improve RNA-seq Analysis

    Get PDF
    Summary This paper suggests a novel positive false discovery rate (pFDR) controlling method for testing gene-specific hypotheses using a gene-specific covariate variable, such as gene length. We suppose the null probability depends on the covariate variable. In this context, we propose a rejection rule that accounts for heterogeneity among tests by employing two distinct types of null probabilities. We establish a pFDR estimator for a given rejection rule by following Storey\u27s q-value framework. A condition on a type 1 error posterior probability is provided that equivalently characterizes our rejection rule. We also present a suitable procedure for selecting a tuning parameter through cross-validation that maximizes the expected number of hypotheses declared significant. A simulation study demonstrates that our method is comparable to or better than existing methods across realistic scenarios. In data analysis, we find support for our method\u27s premise that the null probability varies with a gene-specific covariate variable

    Advances in random forest tuning and improvements in false discovery rate controlling procedures via test-specific covariate adjustments

    Get PDF
    Each chapter of this dissertation is devoted to one of three topics. The first two are novel false discovery rate (FDR) controlling methods in different situations, and the third deals with a new tuning parameter selection approach for the random forest method in regression problems. The primary research of this dissertation is to develop methods for controlling FDR while conducting multiple hypothesis tests with gene expression data. The first topic of this dissertation is a gene-specific covariate-based FDR-controlling method. We propose gene length as a potential gene-specific covariate. We develop a method based on covariate-specific conditional null probability for promising hypotheses with low p-values. We prove that the method controls positive FDR (pFDR) and provide an equivalent statement producing the method's rejection rule. Simulations demonstrate our method controls over pFDR, and the suggested method is better than existing methods in terms of true positive rate and summary statistics for the receiver operating characteristic (ROC) curve. Using data provided by Dr. Lim, we observe that our method rejects more null hypotheses at most target levels than existing methods. Another topic of this dissertation is developing an FDR-controlling method for circumstances where data are obtained from the pilot and main studies. We assume each study has unique properties such as sample size and error variance. Our method's rejection rule permits a higher p-value rejection threshold for the main study when the p-value for the pilot study is relatively low. This relationship enables us to evaluate fewer rejection rules than a competing method, resulting in more inference power. Our simulation study demonstrates our approach for combining results from two studies is superior to existing methods and controls FDR to a predetermined level. The number of rejected null hypotheses in the data analysis was greater than that of competing methods. The last topic of the dissertation is a unique tuning approach for random forest (RF) regression. We propose a case-specific tuning strategy for selecting the RF tuning parameter values of mtry and nodesize. We provide an example showing case-specific tuning parameters can be useful by demonstrating that the best choice for tuning parameter values varies across the predictor space. The tuning algorithm is then outlined mathematically. In a simulation study, our approach outperforms the conventional algorithms implemented in various R packages to minimize mean squared prediction error. Moreover, this method outperforms competing methods for the majority of the datasets we examined

    Advances in random forest tuning and improvements in false discovery rate controlling procedures via test-specific covariate adjustments

    No full text
    Each chapter of this dissertation is devoted to one of three topics. The first two are novel false discovery rate (FDR) controlling methods in different situations, and the third deals with a new tuning parameter selection approach for the random forest method in regression problems. The primary research of this dissertation is to develop methods for controlling FDR while conducting multiple hypothesis tests with gene expression data. The first topic of this dissertation is a gene-specific covariate-based FDR-controlling method. We propose gene length as a potential gene-specific covariate. We develop a method based on covariate-specific conditional null probability for promising hypotheses with low p-values. We prove that the method controls positive FDR (pFDR) and provide an equivalent statement producing the method's rejection rule. Simulations demonstrate our method controls over pFDR, and the suggested method is better than existing methods in terms of true positive rate and summary statistics for the receiver operating characteristic (ROC) curve. Using data provided by Dr. Lim, we observe that our method rejects more null hypotheses at most target levels than existing methods. Another topic of this dissertation is developing an FDR-controlling method for circumstances where data are obtained from the pilot and main studies. We assume each study has unique properties such as sample size and error variance. Our method's rejection rule permits a higher p-value rejection threshold for the main study when the p-value for the pilot study is relatively low. This relationship enables us to evaluate fewer rejection rules than a competing method, resulting in more inference power. Our simulation study demonstrates our approach for combining results from two studies is superior to existing methods and controls FDR to a predetermined level. The number of rejected null hypotheses in the data analysis was greater than that of competing methods. The last topic of the dissertation is a unique tuning approach for random forest (RF) regression. We propose a case-specific tuning strategy for selecting the RF tuning parameter values of mtry and nodesize. We provide an example showing case-specific tuning parameters can be useful by demonstrating that the best choice for tuning parameter values varies across the predictor space. The tuning algorithm is then outlined mathematically. In a simulation study, our approach outperforms the conventional algorithms implemented in various R packages to minimize mean squared prediction error. Moreover, this method outperforms competing methods for the majority of the datasets we examined

    Advances in random forest tuning and improvements in false discovery rate controlling procedures via test-specific covariate adjustments

    Get PDF
    Each chapter of this dissertation is devoted to one of three topics. The first two are novel false discovery rate (FDR) controlling methods in different situations, and the third deals with a new tuning parameter selection approach for the random forest method in regression problems. The primary research of this dissertation is to develop methods for controlling FDR while conducting multiple hypothesis tests with gene expression data. The first topic of this dissertation is a gene-specific covariate-based FDR-controlling method. We propose gene length as a potential gene-specific covariate. We develop a method based on covariate-specific conditional null probability for promising hypotheses with low p-values. We prove that the method controls positive FDR (pFDR) and provide an equivalent statement producing the method's rejection rule. Simulations demonstrate our method controls over pFDR, and the suggested method is better than existing methods in terms of true positive rate and summary statistics for the receiver operating characteristic (ROC) curve. Using data provided by Dr. Lim, we observe that our method rejects more null hypotheses at most target levels than existing methods. Another topic of this dissertation is developing an FDR-controlling method for circumstances where data are obtained from the pilot and main studies. We assume each study has unique properties such as sample size and error variance. Our method's rejection rule permits a higher p-value rejection threshold for the main study when the p-value for the pilot study is relatively low. This relationship enables us to evaluate fewer rejection rules than a competing method, resulting in more inference power. Our simulation study demonstrates our approach for combining results from two studies is superior to existing methods and controls FDR to a predetermined level. The number of rejected null hypotheses in the data analysis was greater than that of competing methods. The last topic of the dissertation is a unique tuning approach for random forest (RF) regression. We propose a case-specific tuning strategy for selecting the RF tuning parameter values of mtry and nodesize. We provide an example showing case-specific tuning parameters can be useful by demonstrating that the best choice for tuning parameter values varies across the predictor space. The tuning algorithm is then outlined mathematically. In a simulation study, our approach outperforms the conventional algorithms implemented in various R packages to minimize mean squared prediction error. Moreover, this method outperforms competing methods for the majority of the datasets we examined

    Advances in random forest tuning and improvements in false discovery rate controlling procedures via test-specific covariate adjustments

    No full text
    Each chapter of this dissertation is devoted to one of three topics. The first two are novel false discovery rate (FDR) controlling methods in different situations, and the third deals with a new tuning parameter selection approach for the random forest method in regression problems. The primary research of this dissertation is to develop methods for controlling FDR while conducting multiple hypothesis tests with gene expression data. The first topic of this dissertation is a gene-specific covariate-based FDR-controlling method. We propose gene length as a potential gene-specific covariate. We develop a method based on covariate-specific conditional null probability for promising hypotheses with low p-values. We prove that the method controls positive FDR (pFDR) and provide an equivalent statement producing the method's rejection rule. Simulations demonstrate our method controls over pFDR, and the suggested method is better than existing methods in terms of true positive rate and summary statistics for the receiver operating characteristic (ROC) curve. Using data provided by Dr. Lim, we observe that our method rejects more null hypotheses at most target levels than existing methods. Another topic of this dissertation is developing an FDR-controlling method for circumstances where data are obtained from the pilot and main studies. We assume each study has unique properties such as sample size and error variance. Our method's rejection rule permits a higher p-value rejection threshold for the main study when the p-value for the pilot study is relatively low. This relationship enables us to evaluate fewer rejection rules than a competing method, resulting in more inference power. Our simulation study demonstrates our approach for combining results from two studies is superior to existing methods and controls FDR to a predetermined level. The number of rejected null hypotheses in the data analysis was greater than that of competing methods. The last topic of the dissertation is a unique tuning approach for random forest (RF) regression. We propose a case-specific tuning strategy for selecting the RF tuning parameter values of mtry and nodesize. We provide an example showing case-specific tuning parameters can be useful by demonstrating that the best choice for tuning parameter values varies across the predictor space. The tuning algorithm is then outlined mathematically. In a simulation study, our approach outperforms the conventional algorithms implemented in various R packages to minimize mean squared prediction error. Moreover, this method outperforms competing methods for the majority of the datasets we examined

    Advances in random forest tuning and improvements in false discovery rate controlling procedures via test-specific covariate adjustments

    Get PDF
    Each chapter of this dissertation is devoted to one of three topics. The first two are novel false discovery rate (FDR) controlling methods in different situations, and the third deals with a new tuning parameter selection approach for the random forest method in regression problems. The primary research of this dissertation is to develop methods for controlling FDR while conducting multiple hypothesis tests with gene expression data. The first topic of this dissertation is a gene-specific covariate-based FDR-controlling method. We propose gene length as a potential gene-specific covariate. We develop a method based on covariate-specific conditional null probability for promising hypotheses with low p-values. We prove that the method controls positive FDR (pFDR) and provide an equivalent statement producing the method's rejection rule. Simulations demonstrate our method controls over pFDR, and the suggested method is better than existing methods in terms of true positive rate and summary statistics for the receiver operating characteristic (ROC) curve. Using data provided by Dr. Lim, we observe that our method rejects more null hypotheses at most target levels than existing methods. Another topic of this dissertation is developing an FDR-controlling method for circumstances where data are obtained from the pilot and main studies. We assume each study has unique properties such as sample size and error variance. Our method's rejection rule permits a higher p-value rejection threshold for the main study when the p-value for the pilot study is relatively low. This relationship enables us to evaluate fewer rejection rules than a competing method, resulting in more inference power. Our simulation study demonstrates our approach for combining results from two studies is superior to existing methods and controls FDR to a predetermined level. The number of rejected null hypotheses in the data analysis was greater than that of competing methods. The last topic of the dissertation is a unique tuning approach for random forest (RF) regression. We propose a case-specific tuning strategy for selecting the RF tuning parameter values of mtry and nodesize. We provide an example showing case-specific tuning parameters can be useful by demonstrating that the best choice for tuning parameter values varies across the predictor space. The tuning algorithm is then outlined mathematically. In a simulation study, our approach outperforms the conventional algorithms implemented in various R packages to minimize mean squared prediction error. Moreover, this method outperforms competing methods for the majority of the datasets we examined

    Adjusting for gene-specific covariates to improve RNA-seq analysis

    No full text
    Summary This article suggests a novel positive false discovery rate (pFDR) controlling method for testing gene-specific hypotheses using a gene-specific covariate variable, such as gene length. We suppose the null probability depends on the covariate variable. In this context, we propose a rejection rule that accounts for heterogeneity among tests by using two distinct types of null probabilities. We establish a pFDR estimator for a given rejection rule by following Storey’s q-value framework. A condition on a type 1 error posterior probability is provided that equivalently characterizes our rejection rule. We also present a suitable procedure for selecting a tuning parameter through cross-validation that maximizes the expected number of hypotheses declared significant. A simulation study demonstrates that our method is comparable to or better than existing methods across realistic scenarios. In data analysis, we find support for our method’s premise that the null probability varies with a gene-specific covariate variable.This article is published as Hyeongseon Jeon, Kyu-Sang Lim, Yet Nguyen, Dan Nettleton, Adjusting for gene-specific covariates to improve RNA-seq analysis, Bioinformatics, Volume 39, Issue 8, August 2023, btad498, https://doi.org/10.1093/bioinformatics/btad498. © The Author(s) 2023. Posted with permission.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited

    Statistical Power Analysis for Designing Bulk, Single-Cell, and Spatial Transcriptomics Experiments: Review, Tutorial, and Perspectives

    No full text
    Gene expression profiling technologies have been used in various applications such as cancer biology. The development of gene expression profiling has expanded the scope of target discovery in transcriptomic studies, and each technology produces data with distinct characteristics. In order to guarantee biologically meaningful findings using transcriptomic experiments, it is important to consider various experimental factors in a systematic way through statistical power analysis. In this paper, we review and discuss the power analysis for three types of gene expression profiling technologies from a practical standpoint, including bulk RNA-seq, single-cell RNA-seq, and high-throughput spatial transcriptomics. Specifically, we describe the existing power analysis tools for each research objective for each of the bulk RNA-seq and scRNA-seq experiments, along with recommendations. On the other hand, since there are no power analysis tools for high-throughput spatial transcriptomics at this point, we instead investigate the factors that can influence power analysis

    Statistical Power Analysis for Designing Bulk, Single-Cell, and Spatial Transcriptomics Experiments: Review, Tutorial, and Perspectives

    No full text
    Gene expression profiling technologies have been used in various applications such as cancer biology. The development of gene expression profiling has expanded the scope of target discovery in transcriptomic studies, and each technology produces data with distinct characteristics. In order to guarantee biologically meaningful findings using transcriptomic experiments, it is important to consider various experimental factors in a systematic way through statistical power analysis. In this paper, we review and discuss the power analysis for three types of gene expression profiling technologies from a practical standpoint, including bulk RNA-seq, single-cell RNA-seq, and high-throughput spatial transcriptomics. Specifically, we describe the existing power analysis tools for each research objective for each of the bulk RNA-seq and scRNA-seq experiments, along with recommendations. On the other hand, since there are no power analysis tools for high-throughput spatial transcriptomics at this point, we instead investigate the factors that can influence power analysis

    Predicting spatial distribution of stable isotopes in precipitation by classical geostatistical- and machine learning methods

    No full text
    Stable isotopes of precipitation are important natural tracers in hydrology, ecology, and forensics. The spatially explicit predictions of oxygen and hydrogen isotopes in precipitation are obtained through different interpolation techniques. In the present study we aim to examine the performance of various interpolation techniques when predicting the spatial distribution of precipitation stable isotopes. The efficiency of combined geostatistical tools (i.e. regression kriging; RK) and various machine learning methods (including regression enhanced random forest methods: MRRF, RERF) are compared in interpolating the spatial variability of precipitation stable oxygen isotope values from two different sampling networks in Europe. To assess the performance of the models, mean squared error (MSE), nonparametric Kling Gupta efficiency (KGE), absolute differences and relative mean absolute error metrics were employed. It was found that the combination of the different regression techniques with Random Forest can produce estimations with comparable accuracy in terms of descending order of overall average MSE, MRRF: 2.61, RK: 2.77, RERF: 2.99, RF: 3.08. The best performing combined random forest model variant (MRRF) outperformed regression kriging in terms of a hybrid error metric (KGE) by 7.5%. Sequential random rarefying the station networks showed that machine-learning methods are more capable of maintaining high prediction accuracy even with fewer input data. This can be a great advantage when a suitable method is needed to predict the stable isotope composition of precipitation for large spatial domains where the spatial density of data stations shows large differences
    corecore