169,092 research outputs found

    Statistical Approaches for Functional Annotation Tree Guided Prioritization of Genome-wide Association Studies (GWAS) Results

    Get PDF
    Genome-wide association studies (GWAS) have successfully identified over two hundred thousand trait risk-associated genetic variants; however, several challenges remain. First, a complex trait is associated with many single nucleotide polymorphisms (SNPs), each with small or moderate effect sizes that are hard to detect with limited sample size due to a phenomenon called polygenicity. Additionally, currently available statistical methods are limited in explaining the functional mechanisms through which genetic variants are associated with complex traits. In the first dissertation aim, we address these challenges by proposing a statistical approach called GPA-Tree. GPA-Tree integratesGWAS summary statistics and functional annotation information for a single trait within a unified framework. Specifically, by combining a decision tree algorithm with a hierarchical modeling framework, GPA-Tree simultaneously implements association mapping and identifies key combinations of functional annotations related to the trait risk-associated SNPs. We evaluate the proposed GPA-Tree approach using simulation studies and demonstrate that, in most scenarios, GPA-Tree shows greater area under the curve (AUC) and power relative to existing statistical approaches in detecting risk-associated SNPs and greater accuracy in identifying the true combinations of functional annotations. We applied GPA-Tree to a systemic lupus erythematosus (SLE) GWAS and functional annotation data including GenoSkyline and GenoSkylinePlus. The results from GPA-Tree highlight the dysregulation of blood immune cells, including but not limited to primary B, memory helper T, regulatory T, neutrophils and CD8+ memory T cells. The second dissertation aim exploits the phenomenon called pleiotropy, shared genetic basis among multiple traits, to improve statistical power to detect SNPs associated with one or more traits. We extend GPA-Tree to develop Multi-GPA-Tree so that GWAS summary statistics for multiple traits and functional annotation information can be integrated within a unified framework. Specifically, by combining a multivariate decision tree algorithm with a hierarchical modeling framework, Multi-GPA-Tree simultaneously implements association mapping and identifies key combinations of functional annotations related to the SNPs associated with one or more traits. We evaluate the proposed Multi-GPA-Tree approach using simulation studies and demonstrate that, in most scenarios, Multi-GPA-Tree outperforms existing statistical approaches in detecting SNPs associated with one or more traits and identifying the true combinations of functional annotations with high accuracy. We utilize Multi-GPA-Tree to integrate GWAS from two rheumatic diseases, SLE and Rheumatoid Arthritis (RA), and GWAS from two inflammatory bowel diseases, Crohn’s trait (CD) and ulcerative colitis (UC), with GenoSkyline and GenoSkylinePlus annotations. The results from Multi-GPA-Tree highlight the dysregulation of blood immune cells for both joint analysis, including dysregulation of primary B cells for SLE and RA, and dysregulation of primary T regulatory cells for UC and CD. In the third dissertation aim, we develop the R package GPATree and the R Shiny app ShinyGPATree. The R package and Shiny app facilitate users’ convenience and make the GPA-Tree and Multi-GPA-Tree approach easily accessible. The package includes an example data and a vignette to facilitate seamless step-by-step implementation of the proposed methods. In addition, the Shiny app allows interactive and dynamic investigation of association mapping results and functional annotation trees

    Statistical and machine learning methods evaluated for incorporating soil and weather into corn nitrogen recommendations

    Get PDF
    Nitrogen (N) fertilizer recommendation tools could be improved for estimating corn (Zea mays L.) N needs by incorporating site-specific soil and weather information. However, an evaluation of analytical methods is needed to determine the success of incorporating this information. The objectives of this research were to evaluate statistical and machine learning (ML) algorithms for utilizing soil and weather information for improving corn N recommendation tools. Eight algorithms [stepwise, ridge regression, least absolute shrinkage and selection operator (Lasso), elastic net regression, principal component regression (PCR), partial least squares regression (PLSR), decision tree, and random forest] were evaluated using a dataset containing measured soil and weather variables from a regional database. The performance was evaluated based on how well these algorithms predicted corn economically optimal N rates (EONR) from 49 sites in the U.S. Midwest. Multiple algorithm modeling scenarios were examined with and without adjustment for multicollinearity and inclusion of two-way interaction terms to identify the soil and weather variables that could improve three dissimilar N recommendation tools. Results showed the out-of-sample root-mean-square error (RMSE) for the decision tree and some random forest modeling scenarios were better than the stepwise or ridge regression, but not significantly different than any other algorithm. The best ML algorithm for adjusting N recommendation tools was the random forest approach (r2 increased between 0.72 and 0.84 and the RMSE decreased between 41 and 94 kg N ha−1). However, the ML algorithm that best adjusted tools while using a minimal amount of variables was the decision tree. This method was simple, needing only one or two variables (regardless of modeling scenario) and provided moderate improvement as r2 values increased between 0.15 and 0.51 and RMSE decreased between 16 and 66 kg N ha−1. Using ML algorithms to adjust N recommendation tools with soil and weather information shows promising results for better N management in the U.S. Midwest

    Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

    Full text link
    This paper proposes a novel approach for directly-modeling speech at the waveform level using a neural network. This approach uses the neural network-based statistical parametric speech synthesis frame-work with a specially designed output layer. As acoustic feature extraction is integrated to acoustic model training, it can overcome the limitations of conventional approaches, such as two-step (feature extraction and acoustic modeling) optimization, use of spectra rather than waveforms as targets, use of overlapping and shifting frames as unit, and fixed decision tree structure. Experimental results show that the proposed approach can directly maximize the likelihood de-fined at the waveform domain. Index Terms — Statistical parametric speech synthesis; neural network; adaptive cepstral analysis. 1

    Modeling the Impacts of Climate Change on Phytogeographical Units. A Case Study of the Moesz Line

    Get PDF
    Regional climate models (RCMs) provide reliable climatic predictions for the next 90 years with high horizontal and temporal resolution. In the 21st century northward latitudinal and upward altitudinal shift of the distribution of plant species and phytogeographical units is expected. It is discussed how the modeling of phytogeographical unit can be reduced to modeling plant distributions. Predicted shift of the Moesz line is studied as case study (with three different modeling approaches) using 36 parameters of REMO regional climate data-set, ArcGIS geographic information software, and periods of 1961-1990 (reference period), 2011-2040, and 2041-2070. The disadvantages of this relatively simple climate envelope modeling (CEM) approach are then discussed and several ways of model improvement are suggested. Some statistical and artificial intelligence (AI) methods (logistic regression, cluster analysis and other clustering methods, decision tree, evolutionary algorithm, artificial neural network) are able to provide development of the model. Among them artificial neural networks (ANN) seems to be the most suitable algorithm for this purpose, which provides a black box method for distribution modeling

    Decision Trees and Their Application for Classification and Regression Problems

    Get PDF
    Tree methods are some of the best and most commonly used methods in the field of statistical learning. They are widely used in classification and regression modeling. This thesis introduces the concept and focuses more on decision trees such as Classification and Regression Trees (CART) used for classification and regression predictive modeling problems. We also introduced some ensemble methods such as bagging, random forest and boosting. These methods were introduced to improve the performance and accuracy of the models constructed by classification and regression tree models. This work also provides an in-depth understanding of how the CART models are constructed, the algorithm behind the construction and also using cost-complexity approaching in tree pruning for regression trees and classification error rate approach used for pruning classification trees. We took two real-life examples, which we used to solve classification problem such as classifying the type of cancer based on tumor type, size and other parameters present in the dataset and regression problem such as predicting the first year GPA of a college student based on high school GPA, SAT scores and other parameters present in the dataset

    Modeling the Impacts of Climate Change on Phytogeographical Units: A Case Study of the Moesz Line

    Get PDF
    Regional climate models (RCMs) provide reliable climatic predictions for the next 90 years with high horizontal and temporal resolution. In the 21st century northward latitudinal and upward altitudinal shift of the distribution of plant species and phytogeographical units is expected. It is discussed how the modeling of phytogeographical unit can be reduced to modeling plant distributions. Predicted shift of the Moesz line is studied as case study (with three different modeling approaches) using 36 parameters of REMO regional climate dataset, ArcGIS geographic information software, and periods of 1961-1990 (reference period), 2011-2040, and 2041-2070. The disadvantages of this relatively simple climate envelope modeling (CEM) approach are then discussed and several ways of model improvement are suggested. Some statistical and artificial intelligence (AI) methods (logistic regression, cluster analysis and other clustering methods, decision tree, evolutionary algorithm, artificial neural network) are able to provide development of the model. Among them artificial neural networks (ANN) seems to be the most suitable algorithm for this purpose, which provides a black box method for distribution modeling

    MULTIVALUED SUBSETS UNDER INFORMATION THEORY

    Get PDF
    In the fields of finance, engineering and varied sciences, Data Mining/ Machine Learning has held an eminent position in predictive analysis. Complex algorithms and adaptive decision models have contributed towards streamlining directed research as well as improve on the accuracies in forecasting. Researchers in the fields of mathematics and computer science have made significant contributions towards the development of this field. Classification based modeling, which holds a significant position amongst the different rule-based algorithms, is one of the most widely used decision making tools. The decision tree has a place of profound significance in classification-based modeling. A number of heuristics have been developed over the years to prune the decision making process. Some key benchmarks in the evolution of the decision tree could to attributed to the researchers like Quinlan (ID3 and C4.5), Fayyad (GID3/3*, continuous value discretization), etc. The most common heuristic applied for these trees is the entropy discussed under information theory by Shannon. The current application with entropy covered under the term `Information Gain\u27 is directed towards individual assessment of the attribute-value sets. The proposed study takes a look at the effects of combining the attribute-value sets, aimed at improving the information gain. Couple of key applications have been tested and presented with statistical conclusions. The first being the application towards the feature selection process, a key step in the data mining process, while the second application is targeted towards the discretization of data. A search-based heuristic tool is applied towards identifying the subsets sharing a better gain value than the ones presented in the GID approach

    Mediation analysis in partial least squares path modeling: Helping researchers discuss more sophisticated models

    Get PDF
    Purpose – Indirect or mediated effects constitute a type of relationship between constructs that often occurs in partial least squares (PLS) path modeling. Over the past few years, the methods for testing mediation have become more sophisticated. However, many researchers continue to use outdated methods to test mediating effects in PLS, which can lead to erroneous results. One reason for the use of outdated methods or even the lack of their use altogether is that no systematic tutorials on PLS exist that draw on the newest statistical findings. The paper aims to discuss these issues. Design/methodology/approach – This study illustrates the state-of-the-art use of mediation analysis in the context of PLS-structural equation modeling (SEM). Findings – This study facilitates the adoption of modern procedures in PLS-SEM by challenging the conventional approach to mediation analysis and providing more accurate alternatives. In addition, the authors propose a decision tree and classification of mediation effects. Originality/value – The recommended approach offers a wide range of testing options (e.g. multiple mediators) that go beyond simple mediation analysis alternatives, helping researchers discuss their studies in a more accurate way

    Drug Repurposing Targeting COVID-19 3CL Protease using Molecular Docking and Machine Learning Regression Approach

    Full text link
    The COVID-19 pandemic has created a global health crisis, driving the need for the rapid identification of potential therapeutics. In this study, we used the Zinc database to screen the world-approved including FDA-approved 5903 drugs for repurposing as potential COVID-19 treatments targeting the main protease 3CL of SARS-CoV-2. We performed molecular docking using Autodock-Vina to check the efficacy of drug molecules. To enhance the efficiency of drug repurposing approach, we modeled the binding affinities using several machine learning regression approaches for QSAR modeling such as decision tree, extra trees, MLP, KNN, XGBoost, and gradient boosting. The computational results demonstrated that Decision Tree Regression (DTR) model has improved statistical measures of R2 and RMSE. These simulated results helped to identify drugs with high binding affinity and favorable binding energies. From the statistical analysis, we shortlisted 13 promising drugs with their respective Zinc IDs (ZINC000003873365, ZINC000085432544, ZINC000203757351, ZINC000085536956, ZINC000085536990, ZINC000008214470, ZINC000261494640, ZINC000169344691, ZINC000094303244, ZINC000095618608, ZINC000095618689, ZINC000095618743, and ZINC000253684767) within the range of -15.1 kcal/mol to -12.7 kcal/mol. Further, we analyzed the physiochemical properties of these selected drugs with respect to their best binding interaction to specific target protease. Our study has provided an efficient framework for drug repurposing against COVID-19. This highlights the potential of combining molecular docking with machine learning regression approaches to accelerate the identification of potential therapeutic candidates.Comment: 30 Page
    corecore