13 research outputs found

    Evaluation of statistical methods, modeling, and multiple testing in RNA-seq studies

    Get PDF
    Recent Next Generation Sequencing methods provide a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Due to this feature of RNA sequencing (RNA-seq) data, appropriate statistical inference methods are required. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA-seq data, its appropriateness in the application to genetic studies has not been exhaustively evaluated. Additionally, adjusting for covariates that have an unknown relationship with expression of a gene has not been extensively evaluated in RNA-seq studies using the NB framework. Finally, the dependent structures in RNA-Seq data may violate the assumptions of some multiple testing correction methods. In this dissertation, we suggest an alternative regression method, evaluate the effect of covariates, and compare various multiple testing correction methods. We conduct simulation studies and apply these methods to a real data set. First, we suggest Firth’s logistic regression for detecting differentially expressed genes in RNA-seq data. We also recommend the data adaptive method that estimates a recalibrated distribution of test statistics. Firth’ logistic regression exhibits an appropriately controlled Type-I error rate using the data adaptive method and shows comparable power to NB regression in simulation studies. Next, we evaluate the effect of disease-associated covariates where the relationship between the covariate and gene expression is unknown. Although the power of NB and Firth’s logistic regression is decreased as disease-associated covariates are added in a model, Type-I error rates are well controlled in Firth’ logistic regression if the relationship between a covariate and disease is not strong. Finally, we compare multiple testing correction methods that control family-wise error rates and impose false discovery rates. The evaluation reveals that an understanding of study designs, RNA-seq data, and the consequences of applying specific regression and multiple testing correction methods are very important factors to control family-wise error rates or false discovery rates. We believe our statistical investigations will enrich gene expression studies and influence related statistical methods

    MACHINE LEARNING AND DEEP LEARNING APPROACHES FOR GENE REGULATORY NETWORK INFERENCE IN PLANT SPECIES

    Get PDF
    The construction of gene regulatory networks (GRNs) is vital for understanding the regulation of metabolic pathways, biological processes, and complex traits during plant growth and responses to environmental cues and stresses. The increasing availability of public databases has facilitated the development of numerous methods for inferring gene regulatory relationships between transcription factors and their targets. However, there is limited research on supervised learning techniques that utilize available regulatory relationships of plant species in public databases. This study investigates the potential of machine learning (ML), deep learning (DL), and hybrid approaches for constructing GRNs in plant species, specifically Arabidopsis thaliana, poplar, and maize. Challenges arise due to limited training data for gene regulatory pairs, especially in less-studied species such as poplar and maize. Nonetheless, our results demonstrate that hybrid models integrating ML and artificial neural network (ANN) techniques significantly outperformed traditional methods in predicting gene regulatory relationships. The best-performing hybrid models achieved over 95% accuracy on holdout test datasets, surpassing traditional ML and ANN models and also showed good accuracy on lignin biosynthesis pathway analysis. Employing transfer learning techniques, this study has also successfully transferred the known knowledge of gene regulation from one species to another, substantially improving performance and manifesting the viability of cross-species learning using deep learning-based approaches. This study contributes to the methodology for growing body of knowledge in GRN prediction and construction for plant species, highlighting the value of adopting hybrid models and transfer learning techniques. This study and the results will help to pave a way for future research on how to learn from known to unknown and will be conductive to the advance of modern genomics and bioinformatics

    Childhood Blood Lead Levels and Adolescent Crime Rates in the United States

    Get PDF
    Juvenile violent crime rates in the United States have been on a continuous decline since 1996. Despite this decrease, youth violence as well as racial differences in crime rates continues to be a public health issue in the United States. Researchers have linked externalization behavior in children to factors including genetics, parental upbringing, abuse, school environment, and media exposure but have not fully considered the relationship between early childhood lead contamination and youth violence. This was an ecologic study of the relationship between early childhood blood lead levels (BLLs; ≥ 10µg/dL before 2012 or ≥ 5µg/dL after 2012) and crime arrest rates in United States. A secondary data analysis was conducted of existing data on youth violence and BLL obtained from the Office of Juvenile Justice and Delinquency Prevention and Center for Disease Control and Prevention respectively. Results of linear multiple regression analysis showed a significant positive correlation between the percentage of confirmed childhood BLL ≥ 10µg/dL in states from 1999 to 2001 and robbery, weapon, and drug abuse arrest rates in 2016. Further analysis indicated that the total crime rate per 100,000 population in states was significantly correlated with the 2012-2016 mean percentage of confirmed childhood BLL ≥ 5µg/dL in states (B = 35.17, p = 0.03). Results may help public health professionals, medical care providers, and policy makers to make informed decisions and better target interventions to further alleviate the effects of childhood lead poisoning at home and abroad. Improvements in children’s health may benefit individuals, families, organizations, and society through the promotion of public health and the reduction of adverse impacts associated with lead contamination in childhood

    Additional file 16: Table S7. of Evaluation of logistic regression models and effect of covariates for case–control study in RNA-Seq analysis

    No full text
    Bias with covariate models from the balanced design of N D=1  = 10 and μ D=0  = 1000. Disp: Dispersion, CovOR: Odds ratios between covariates and case–control status, Ncov: The number of covariates in a model, NB_TD: Negative binomial regression with the dispersion is used for the sampling, FL: Firth’s logistic regression. (DOCX 47 kb

    Additional file 4: Table S2. of Evaluation of logistic regression models and effect of covariates for case–control study in RNA-Seq analysis

    No full text
    Type-I error rates of regression methods from the unbalanced design with μ D=0  = 1000. Alpha: Significance levels, N D=1 : The number of cases, N D=0 : The number of controls, Disp: Dispersion, NB: Negative binomial regression with true dispersion, CL: Classical logistic regression, BL: Bayes logistic regression, FL: Firth’s logistic regression. (DOCX 65 kb

    Additional file 11: Figure S7. of Evaluation of logistic regression models and effect of covariates for case–control study in RNA-Seq analysis

    No full text
    Q-Q plots of the HD Analyses. Figure S7 exhibits the Q-Q plots from the HD analysis adjusting for age at death and RIN from DESeq2 (A), and Classical (B), Bayes (C), and Firth’s (D) Logistic regressions. Each regression method contains three different ways of calculating p-values (Original, DA, and Perm). “Original” p-values (Blue dots) are estimated from asymptotic distribution. “DA” p-values (Black dots) are evaluated from data adaptive asymptotic distribution using 1,000 permutations. “Perm” p-values (Yellow dots) are calculated using 10,000 permutations. (PNG 157 kb

    Additional file 14: Table S5. of Evaluation of logistic regression models and effect of covariates for case–control study in RNA-Seq analysis

    No full text
    All significant genes in FL regressions using the DA method. Mean.Exp.Case: Normalized mean expression value in cases, Mean.Exp.Cont: Normalized mean expression value in controls, Disp: Dispersion, NB, CL, BL, FL: P-values from negative binomial regression, classical logistic regression, Bayes logistic regression, Firth’s logistic regression. (XLS 1013 kb

    Additional file 10: Figure S6. of Evaluation of logistic regression models and effect of covariates for case–control study in RNA-Seq analysis

    No full text
    Bias from regression methods using the permuted HD data with μ g  > 3. Figure S6 contains bias from Negative Binomial regression using DESeq2, Classical Logistic regression (CL), Bayes Logistic regression (BL), and Firth’s Logistic regression (FL). Each black empty dot represents the bias of a gene. The black dotted horizontal line is no bias point. The bias of each gene is calculated using effect sizes of 10,000 permutations. (PNG 53 kb
    corecore