1,750 research outputs found

    Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects

    Get PDF
    Random forests (RF) is one of a broad class of machine learning methods that are able to deal with large-scale data without model specification, which makes it an attractive method for genome-wide association studies (GWAS). The performance of RF and other association methods in the presence of interactions was evaluated using the simulated data from Genetic Analysis Workshop 16 Problem 3, with knowledge of the major causative markers, risk factors, and their interactions in the simulated traits. There was good power to detect the environmental risk factors using RF, trend tests, or regression analyses but the power to detect the effects of the causal markers was poor for all methods. The causal marker that had an interactive effect with smoking did show moderate evidence of association in the RF and regression analyses, suggesting that RF may perform well at detecting such interactions in larger, more highly powered datasets

    Deciphering the genetic background of quantitative traits using machine learning and bioinformatics frameworks

    Get PDF
    In dieser Doktorarbeit habe ich zwei Ansätze verfolgt, mit denen genetische Mechanismen, welche quantitativen Merkmalen zugrunde liegen, aufgezeigt und bestimmt werden können. In diesem Zusammenhang lag mein Fokus auf der Entwicklung effizienter Methoden um Genotyp-Phänotyp Assoziationen zu identifizieren. Durch diese lassen sich im Weiteren regulatorische Mechanismen beschreiben, welche phänotypische Unterschiede zwischen Individuen verursachen. Im ersten Ansatz habe ich Schlüsselmechanismen der Genregulation untersucht, welche die Entwicklung der Bruchfestigkeit von Eierschalen steuern. Das Ziel war es zeitliche Unterschiede der Signalkaskaden, welche die Eierschalen Bruchfestigkeit im Verlauf eines Vogellebens regulieren, zu detektieren. Hierfür habe ich die Bruchfestigkeit zu zwei verschiedenen Zeitpunkten innerhalb eines Produktionszyklus betrachtet und die Genotyp-Phänotyp Assoziationen mithilfe eines Random Forest-Algorithmus bestimmt. Für die Analyse der entsprechenden Gene wurde ein etablierter systembiologischer Ansatz verfolgt, mit dem genregulatorische Pathways und Master-Regulatoren identifiziert werden konnten. Meine Ergebnisse zeigen, dass einige Pathways und Master-Regulatoren (z.B. Slc22a1 und Sox11) gleichzeitig in verschiedenen Legephasen identifiziert wurden, andere (z.B. Scn11a, St8sia2 oder der TGF-beta Pathway) speziell in lediglich einer Phase gefunden wurden. Sie stellen somit altersspezifische Mechanismen dar.Insgesamt liefern meine Ergebnisse (i) signifikante Einblicke in altersspezifische und allgemeine molekulare Mechanismen, welche die Eierschalen-Bruchfestigkeit regulieren und bestimmen; und (ii) neue Zuchtziele, um die Bruchstärke von Eierschalen vor allem in späteren Legephasen zu erhöhen und somit die Eierschalen Qualität zu verbessern. In meinem zweitem Ansatz, habe ich die Methode der Random Forests mit einer Strategie zur Signaldetektierung kombiniert, um robuste Genotyp-Phänotyp-Beziehungen zu identifizieren. Ziel dieses Ansatzes war die Verbesserung der Effizienz der Einzel-SNP basierten Assoziationsanalyse. Genomweite Assoziationsstudien (GWAS) sind ein weit verbreiteter Ansatz zur Identifikation genomischer Varianten und Genen, die verantwortlich sind für Merkmale, welche von Interesse sowohl für den akademischen als auch den wirtschaftlichen Sektor sind. Trotz des langjährigen Einsatzes verschiedener GWAS-Methoden stellt die zuverlässige Identifikation von Genotyp-Phänotyp-Beziehungen noch immer eine Herausforderung für viele quantitative Merkmale dar. Dies wird hauptsächlich durch die große Anzahl genomischer Loci begründet, welche lediglich einen schwachen Effekt auf das zu untersuchende Merkmal haben. Daher lässt sich Hypothese aufstellen, dass genomische Varianten, welche zwar einen geringen, aber dennoch realen Einfluss ausüben, in vielen GWAS-Ansätzen unentdeckt bleiben. Zur Behandlung dieser Unzulänglichkeiten wird in der Arbeit ein zweistufiges Verfahren verwendet. Zunächst werden kubische Splines für Teststatistiken und genomische Regionen angepasst. Die Spline-Maxima, welche höher als die zu erwartenden zufallsbasierten Maximalwerte ausfallen, werden als quantitative Merkmals-Loci (QTL) eingestuft. Anschließend werden die SNPs in diesen QTLs, basierend auf ihrer Assoziationsstärke mit den Phänotypen, durch einen Random Forests-Ansatz priorisiert. Im Rahmen einer Fallstudie haben wir unseren Ansatz auf reale Datensätze angewendet und eine plausible Anzahl, teilweise neuartiger, genomischer Varianten und Genen identifiziert, welche verschiedenen Qualitätsmerkmalen zugrunde liegen.In this thesis, I developed two frameworks that can help highlight the genetic mechanisms underlying quantitative traits. In this regard, my focus was to design efficient methodologies to discover genotype-phenotype associations and then use these identified associations to describe the regulatory mechanism that affects the manifestation of phenotypic differences among the individuals. In the first framework, I investigated key regulatory mechanisms governing the development of eggshell strength. The aim was to highlight the temporal changes in the signaling cascades governing the dynamic eggshell strength during the life of birds. I considered chicken eggshell strength at two different time points during the egg production cycle and studied the genotype-phenotype associations by employing the Random Forest algorithm on genotypic data. For the analysis of corresponding genes, a well established systems biology approach was adopted to delineate gene regulatory pathways and master regulators underlying this important trait. My results indicate that, while some of the master regulators (Slc22a1 and Sox11) and pathways are common at different laying stages of chicken, others (e.g., Scn11a, St8sia2, or the TGF-beta pathway) represent age-specific functions. Overall, my results provide: (i) significant insights into age-specific and common molecular mechanisms underlying the regulation of eggshell strength; and (ii) new breeding targets to improve the eggshell quality during the later stages of the chicken production cycle. In my second framework, I combined the Random Forests and a signal detection strategy to identify robust genotype-phenotype associations. The objective of this framework was to improve on the efficiency of single-SNP based association analysis. Genome wide association studies (GWAS) are a well established methodology to identify genomic variants and genes that are responsible for traits of interest in all branches of the life sciences. Despite the long time this methodology has had to mature the reliable detection of genotype-phenotype associations is still a challenge for many quantitative traits mainly because of the large number of genomic loci with weak individual effects on the trait under investigation. Thus, it can be hypothesized that many genomic variants that have a small, however real, effect~remain unnoticed in many GWAS approaches. Here, we propose a two-step procedure to address this problem. In a first step, cubic splines are fitted to the test statistic values and genomic regions with spline-peaks that are higher than expected by chance are considered as quantitative trait loci (QTL). Then the SNPs in these QTLs are prioritized with respect to the strength of their association with the phenotype using a Random Forests approach. As a case study, we apply our procedure to real data sets and find trustworthy numbers of, partially novel, genomic variants and genes involved in various egg quality traits.2021-10-1

    A random forest approach to the detection of epistatic interactions in case-control studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The key roles of epistatic interactions between multiple genetic variants in the pathogenesis of complex diseases notwithstanding, the detection of such interactions remains a great challenge in genome-wide association studies. Although some existing multi-locus approaches have shown their successes in small-scale case-control data, the "combination explosion" course prohibits their applications to genome-wide analysis. It is therefore indispensable to develop new methods that are able to reduce the search space for epistatic interactions from an astronomic number of all possible combinations of genetic variants to a manageable set of candidates.</p> <p>Results</p> <p>We studied case-control data from the viewpoint of binary classification. More precisely, we treated single nucleotide polymorphism (SNP) markers as categorical features and adopted the random forest to discriminate cases against controls. On the basis of the gini importance given by the random forest, we designed a sliding window sequential forward feature selection (SWSFS) algorithm to select a small set of candidate SNPs that could minimize the classification error and then statistically tested up to three-way interactions of the candidates. We compared this approach with three existing methods on three simulated disease models and showed that our approach is comparable to, sometimes more powerful than, the other methods. We applied our approach to a genome-wide case-control dataset for Age-related Macular Degeneration (AMD) and successfully identified two SNPs that were reported to be associated with this disease.</p> <p>Conclusion</p> <p>Besides existing pure statistical approaches, we demonstrated the feasibility of incorporating machine learning methods into genome-wide case-control studies. The gini importance offers yet another measure for the associations between SNPs and complex diseases, thereby complementing existing statistical measures to facilitate the identification of epistatic interactions and the understanding of epistasis in the pathogenesis of complex diseases.</p

    Investigation of methods for machine learning associations between genetic variations and phenotype

    Get PDF
    The relationship between genetics and phenotype is a complex one that remains poorly understood. Many factors contribute to the relationship between genetic variations and differences in phenotype. An improved understanding of the genetic underpinnings of various phenotypes can help us make important advances in testing for, preventing, treating, and curing a number of diseases and disorders. The recent popularization of direct-to-consumer sequencing services, coupled with consumers releasing their genetic information for public use, has led to an unprecedented level of access to genetic information. Crowd-sourcing the problem of developing robust genome-wide association techniques for ever larger amounts of data is a promising trend. This thesis explores likely methods to data mine one such public genetic data repository, openSNP, for correlated genotypes and phenotypes. Particular care is given to data clean-up and the steps required to preprocess public data for machine learning. The preprocessing methods are detailed in such a way that they may be applied to other genetic data repositories that already exist, for example the Personal Genome Project, as well as genetic data repositories that may become available in the future. Following data clean-up, a number of machine learning techniques are investigated, applied, and assessed for their utility in such a big-data problem. No single machine learning approach was found to be sufficient; the combination of imbalanced phenotype response classes and an underdetermined system led to a difficult machine learning challenge. Additional techniques must be explored or developed in order to make such genome-wide association studies possible and meaningful

    Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis

    Get PDF
    Evidence from human genetic studies of several disorders suggests that interactions between alleles at multiple genes play an important role in influencing phenotypic expression. Analytical methods for identifying Mendelian disease genes are not appropriate when applied to common multigenic diseases, because such methods investigate association with the phenotype only one genetic locus at a time. New strategies are needed that can capture the spectrum of genetic effects, from Mendelian to multifactorial epistasis. Random Forests (RF) and Relief-F are two powerful machine-learning methods that have been studied as filters for genetic case-control data due to their ability to account for the context of alleles at multiple genes when scoring the relevance of individual genetic variants to the phenotype. However, when variants interact strongly, the independence assumption of RF in the tree node-splitting criterion leads to diminished importance scores for relevant variants. Relief-F, on the other hand, was designed to detect strong interactions but is sensitive to large backgrounds of variants that are irrelevant to classification of the phenotype, which is an acute problem in genome-wide association studies. To overcome the weaknesses of these data mining approaches, we develop Evaporative Cooling (EC) feature selection, a flexible machine learning method that can integrate multiple importance scores while removing irrelevant genetic variants. To characterize detailed interactions, we construct a genetic-association interaction network (GAIN), whose edges quantify the synergy between variants with respect to the phenotype. We use simulation analysis to show that EC is able to identify a wide range of interaction effects in genetic association data. We apply the EC filter to a smallpox vaccine cohort study of single nucleotide polymorphisms (SNPs) and infer a GAIN for a collection of SNPs associated with adverse events. Our results suggest an important role for hubs in SNP disease susceptibility networks. The software is available at http://sites.google.com/site/McKinneyLab/software
    • …
    corecore