948 research outputs found

    Application of a spatially-weighted Relief algorithm for ranking genetic predictors of disease

    Get PDF
    Background: Identification of genetic variants that are associated with disease is an important goal in elucidating the genetic causes of diseases. The genetic patterns that are associated with common diseases are complex and may involve multiple interacting genetic variants. The Relief family of algorithms is a powerful tool for efficiently identifying genetic variants that are associated with disease, even if the variants have nonlinear interactions without significant main effects. Many variations of Relief have been developed over the past two decades and several of them have been applied to single nucleotide polymorphism (SNP) data. Results: We developed a new spatially weighted variation of Relief called Sigmoid Weighted ReliefF Star (SWRF*), and applied it to synthetic SNP data. When compared to ReliefF and SURF*, which are two algorithms that have been applied to SNP data for identifying interactions, SWRF* had significantly greater power. Furthermore, we developed a framework called the Modular Relief Framework (MoRF) that can be used to develop novel variations of the Relief algorithm, and we used MoRF to develop the SWRF* algorithm. Conclusions: MoRF allows easy development of new Relief algorithms by specifying different interchangeable functions for the component terms. Using MORF, we developed a new Relief algorithm called SWRF* that had greater ability to identify interacting genetic variants in synthetic data compared to existing Relief algorithms. © 2012 Stokes and Visweswaran.; licensee BioMed Central Ltd

    The application of network label propagation to rank biomarkers in genome-wide Alzheimer's data

    Get PDF
    Background: Ranking and identifying biomarkers that are associated with disease from genome-wide measurements holds significant promise for understanding the genetic basis of common diseases. The large number of single nucleotide polymorphisms (SNPs) in genome-wide studies (GWAS), however, makes this task computationally challenging when the ranking is to be done in a multivariate fashion. This paper evaluates the performance of a multivariate graph-based method called label propagation (LP) that efficiently ranks SNPs in genome-wide data.Results: The performance of LP was evaluated on a synthetic dataset and two late onset Alzheimer's disease (LOAD) genome-wide datasets, and the performance was compared to that of three control methods. The control methods included chi squared, which is a commonly used univariate method, as well as a Relief method called SWRF and a sparse logistic regression (SLR) method, which are both multivariate ranking methods. Performance was measured by evaluating the top-ranked SNPs in terms of classification performance, reproducibility between the two datasets, and prior evidence of being associated with LOAD.On the synthetic data LP performed comparably to the control methods. On GWAS data, LP performed significantly better than chi squared and SWRF in classification performance in the range from 10 to 1000 top-ranked SNPs for both datasets, and not significantly different from SLR. LP also had greater ranking reproducibility than chi squared, SWRF, and SLR. Among the 25 top-ranked SNPs that were identified by LP, there were 14 SNPs in one dataset that had evidence in the literature of being associated with LOAD, and 10 SNPs in the other, which was higher than for the other methods.Conclusion: LP performed considerably better in ranking SNPs in two high-dimensional genome-wide datasets when compared to three control methods. It had better performance in the evaluation measures we used, and is computationally efficient to be applied practically to data from genome-wide studies. These results provide support for including LP in the methods that are used to rank SNPs in genome-wide datasets. © 2014 Stokes et al.; licensee BioMed Central Ltd

    Ensemble learning for detecting gene-gene interactions in colorectal cancer

    Get PDF
    The fundamental task of human genetics is to detect genetic variations that primarily contribute to a disease phenotype. The most popular method for understanding etiology of human inheritable diseases (e.g., cancer) is to utilize genome-wide association studies (GWAS). Colorectal cancer (CRC) is a common cause of deaths in developed countries; specifically, it has a high incidence rate in the province of Newfoundland and Labrador. Therefore, finding the affecting genetic factors associated with CRC can help better understand the disease in order to more effectively treat and prevent it. This study seeks to identify genetic variations associated with CRC using machine learning including feature selection and ensemble learning algorithms. In this study, we analyze a GWAS dataset on CRC collected from Newfoundland population. First, we perform quality control steps on the raw genetic data and prepare it for the machine learning methods. Second, we investigate six feature selection methods through a comparative study by applying them to a simulated dataset and CRC GWAS data. The best feature selection method, in terms of gene-gene interactions, is then used to choose a subset of more relevant features for the next step analysis. Subsequently, two ensemble algorithms, Random Forests and Gradient Boosting machine, are applied to the reduced data to identify significant interacting genetic markers associated with CRC. Last, the findings from machine learning methods are biologically validated using online databases and enrichment analysis tools. From the results of the ensemble algorithms, 44 significant genetic markers are detected in which 29 of them have corresponding genes in DNA. Among them, genes DCC, ALK and ITGA1 are previously found to be associated with CRC. In addition, there are genes E2F3 and NID2, which have the potential of having association with CRC, because of their already known associations with other types of cancer. Moreover, the biological interpretations of these genes reveal biological pathways that may help predict the risk of the disease and better understand the etiology of the disease

    Novel Extensions of Label Propagation for Biomarker Discovery in Genomic Data

    Get PDF
    One primary goal of analyzing genomic data is the identification of biomarkers which may be causative of, correlated with, or otherwise biologically relevant to disease phenotypes. In this work, I implement and extend a multivariate feature ranking algorithm called label propagation (LP) for biomarker discovery in genome-wide single-nucleotide polymorphism (SNP) data. This graph-based algorithm utilizes an iterative propagation method to efficiently compute the strength of association between a SNP and a phenotype. I developed three extensions to the LP algorithm, with the goal of tailoring it to genomic data. The first extension is a modification to the LP score which yields a variable-level score for each SNP, rather than a score for each SNP genotype. The second extension incorporates prior biological knowledge that is encoded as a prior value for each SNP. The third extension enables the combination of rankings produced by LP and another feature ranking algorithm. The LP algorithm, its extensions, and two control algorithms (chi squared and sparse logistic regression) were applied to 11 genomic datasets, including a synthetic dataset, a semi-synthetic dataset, and nine genome-wide association study (GWAS) datasets covering eight diseases. The quality of each feature ranking algorithm was evaluated by using a subset of top-ranked SNPs to construct a classifier, whose predictive power was evaluated in terms of the area under the Receiver Operating Characteristic curve. Top-ranked SNPs were also evaluated for prior evidence of being associated with disease using evidence from the literature. The LP algorithm was found to be effective at identifying predictive and biologically meaningful SNPs. The single-score extension performed significantly better than the original algorithm on the GWAS datasets. The prior knowledge extension did not improve on the feature ranking results, and in some cases it reduced the predictive power of top-ranked variants. The ranking combination method was effective for some pairs of algorithms, but not for others. Overall, this work’s main results are the formulation and evaluation of several algorithmic extensions of LP for use in the analysis of genomic data, as well as the identification of several disease-associated SNPs

    Benchmarking environmental machine-learning models: methodological progress and an application to forest health

    Get PDF
    Geospatial machine learning is a versatile approach to analyze environmental data and can help to better understand the interactions and current state of our environment. Due to the artificial intelligence of these algorithms, complex relationships can possibly be discovered which might be missed by other analysis methods. Modeling the interaction of creatures with their environment is referred to as ecological modeling, which is a subcategory of environmental modeling. A subfield of ecological modeling is SDM, which aims to understand the relation between the presence or absence of certain species in their environments. SDM is different from classical mapping/detection analysis. While the latter primarily aim for a visual representation of a species spatial distribution, the former focuses on using the available data to build models and interpreting these. Because no single best option exists to build such models, different settings need to be evaluated and compared against each other. When conducting such modeling comparisons, which are commonly referred to as benchmarking, care needs to be taken throughout the analysis steps to achieve meaningful and unbiased results. These steps are composed out of data preprocessing, model optimization and performance assessment. While these general principles apply to any modeling analysis, their application in an environmental context often requires additional care with respect to data handling, possibly hidden underlying data effects and model selection. To conduct all in a programmatic (and efficient) way, toolboxes in the form of programming modules or packages are needed. This work makes methodological contributions which focus on efficient, machine-learning based analysis of environmental data. In addition, research software to generalize and simplify the described process has been created throughout this work

    Discovering genetic drivers in acute graft-versus-host disease after allogeneic hematopoietic stem cell transplantation

    Get PDF
    University of Minnesota Ph.D. dissertation. May 2019. Major: Biomedical Informatics and Computational Biology. Advisors: Caleb Kennedy, Claudia Neuhauser. 1 computer file (PDF); x, 128 pages + 2 supplementary tablesAcute graft-versus-host disease (GVHD) is one of the major complications after allogeneic hematopoietic stem cell transplantation (allo-HCT) that cause non-relapse morbidity and mortality. Although the increasing matching rate of the human leukocyte antigen (HLA) genes between donor and recipient (DR) has significantly reduced the risk of GVHD, clinically significant GVHD remains as a transplantation challenge, even in HLA-identical transplants. Candidate gene studies and genome-wide association studies have revealed susceptible individual genes and gene pairs from DR pairs that are associated with acute GVHD; however, the roles of genetic disparities between donor and recipient remain to be understood. To identify genetic factors linked to acute GVHD, we investigated the classical HLA and non-HLA genes and conducted a genome-wide clinical outcome association study. Assessment of 4,646 antigen recognition domain (ARD)-matched unrelated donor allo-HCT cases showed that the frequency of mismatches outside the ARD in HLA genes is very low when the DR pairs are matched at ARD. Due to the low frequency of amino acid mismatches in the non-ARD region and their reportedly weak alloimmune reactions, we suggest that the non-ARD sequence mismatches within the ARD-matched DR pairs have limited influence on the development of acute GVHD, and may not be a primary factor. The genome-wide clinical outcome association study between DR pairs observed multiple autosomal minor histocompatibility antigens (MiHAs) restricted by HLA typing, though their association with acute GVHD outcome was not statistically significant. This result suggests that HLA mismatching outweighs other genetic mismatches as contributors to acute GVHD risk. In the cases of female donors to male recipients, we identified the significant association of the Y chromosome-specific peptides encoded by PCDH11Y, USP9Y, UTY, and NLGN4Y with the acute GVHD outcome. Additionally, we developed a machine learning-based genetic variant selection algorithm for ultra-high dimensional transplant genomic studies. The algorithm successfully selected a set of genes from over 1 M genetic variants, all of which have evidence to be linked to the transplant-related complications. This work offers evidence and guidance for further research in acute GVHD and allo-HCT and provides useful bioinformatics and data mining tools for transplant genomic studies
    • …
    corecore