583 research outputs found

    Discovering Higher-order SNP Interactions in High-dimensional Genomic Data

    Get PDF
    In this thesis, a multifactor dimensionality reduction based method on associative classification is employed to identify higher-order SNP interactions for enhancing the understanding of the genetic architecture of complex diseases. Further, this thesis explored the application of deep learning techniques by providing new clues into the interaction analysis. The performance of the deep learning method is maximized by unifying deep neural networks with a random forest for achieving reliable interactions in the presence of noise

    Genetic Interactions and Gene-by-Environment Interactions in Evolution

    Full text link
    The phenotypic effect of a mutation depends on both genetic interactions (G×G) and gene-by-environment interactions (G×E). G×G and G×E can distort the additive relationship between genotypes and phenotypes and complicate biological and biomedical studies. Understanding the patterns and mechanisms of these interactions is important for predicting evolutionary trajectories, designing plant and animal breeding strategies, detecting “missing heritability”, and guiding “personalized medicine”. In this thesis, I study how G×G and G×E affect mutational effects, including developing new methods and new models. Recent advancements in high-throughput DNA sequencing and high-throughput phenotyping provide powerful tools to study the relationships among genotypes, phenotypes, and the environment at unprecedented scales. Therefore, I take advantage of several published large datasets in my study, each containing hundreds to thousands of different genotypes of model organisms and their corresponding phenotypes in tens of environments. In Chapter 2, I report some general patterns of G×E and demonstrate the importance of considering potential environmental variations in mapping quantitative trait loci. In Chapter 3, I report how the environment affects diminishing returns epistasis and propose a modular life model to explain the patterns of diminishing returns. In Chapter 4, I propose and demonstrate that genetic dominance is a special case of diminishing returns epistasis. In Chapter 5, I report how and why the relationship between growth rate (r) and carrying capacity (K) in density-dependent population growth varies across environments. In Chapter 6, I demonstrate the existence of an intermediate optimal mating distance for hybrid performance in three model organisms. Overall, I find that large genomic and phenomic data are useful resources to address classical genetic questions, such as the origin of dominance (Chapter 4), the relationship between r and K (Chapter 5), and presence of an optimal mating distance (Chapter 6). The environment is a key player in the phenotypic effects of mutations, but it is also a high-dimension complex system that is hard to quantify. In this thesis, I define environment quality (Q) as the average fitness of many different genotypes measured in the environment. I demonstrate that Q is useful in studying how the environment affects additive (Chapter 3), interactive (Chapters 3 and 4), and pleiotropic mutational effects (Chapter 5). Many classical theories and models were developed based on observations made in a single environment, and they are often insufficient to explain across-environment observations. Studying across-environment effects provides valuable information for testing old models and for designing new models when old models fail. I conclude that studying G×G and G×E shed light on underlying biological mechanisms.PHDEcology and Evolutionary BiologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144160/1/xinzhuw_1.pd

    SAERMA: Stacked Autoencoder Rule Mining Algorithm for the Interpretation of Epistatic Interactions in GWAS for Extreme Obesity

    Get PDF
    One of the most important challenges in the analysis of high-throughput genetic data is the development of efficient computational methods to identify statistically significant Single Nucleotide Polymorphisms (SNPs). Genome-wide association studies (GWAS) use single-locus analysis where each SNP is independently tested for association with phenotypes. The limitation with this approach, however, is its inability to explain genetic variation in complex diseases. Alternative approaches are required to model the intricate relationships between SNPs. Our proposed approach extends GWAS by combining deep learning stacked autoencoders (SAEs) and association rule mining (ARM) to identify epistatic interactions between SNPs. Following traditional GWAS quality control and association analysis, the most significant SNPs are selected and used in the subsequent analysis to investigate epistasis. SAERMA controls the classification results produced in the final fully connected multi-layer feedforward artificial neural network (MLP) by manipulating the interestingness measures, support and confidence, in the rule generation process. The best classification results were achieved with 204 SNPs compressed to 100 units (77% AUC, 77% SE, 68% SP, 53% Gini, logloss=0.58, and MSE=0.20), although it was possible to achieve 73% AUC (77% SE, 63% SP, 45% Gini, logloss=0.62, and MSE=0.21) with 50 hidden units - both supported by close model interpretation

    Experimental Illumination of Comprehensive Fitness Landscapes: A Dissertation

    Get PDF
    Evolution is the single cohesive logical framework in which all biological processes may exist simultaneously. Incremental changes in phenotype over imperceptibly large timescales have given rise to the enormous diversity of life we witness on earth both presently and through the natural record. The basic unit of evolution is mutation, and by perturbing biological processes, mutations may alter the fitness of an individual. However, the fitness effect of a mutation is difficult to infer from historical record, and complex to obtain experimentally in an efficient and accurate manner. We have recently developed a high throughput method to iteratively mutagenize regions of essential genes in yeast and subsequently analyze individual mutant fitness termed Exceedingly Methodical and Parallel Investigation of Randomized Individual Codons (EMPIRIC). Utilizing this technique as exemplified in Chapters II and III, it is possible to determine the fitness effects of all possible point mutations in parallel through growth competition followed by a high throughput sequencing readout. We have employed this technique to determine the distribution of fitness effects in a nine amino acid region of the Hsp90 gene of S. cerevisiae under elevated temperature, and found the bimodal distribution of fitness effects to be remarkably consistent with near-neutral theory. Comparing the measured fitness effects of mutants to the natural record, phylogenetic alignments appear to be a poor predictor of experimental fitness. In Chapter IV, to further interrogate the properties of this region, library competition under conditions of elevated temperature and salinity were performed to study the potential of protein adaptation. Strikingly, whereas both optimal and elevated temperatures produced no statistically significant beneficial mutations, under conditions of elevated salinity, adaptive mutations appear with fitness advantages up to 8% greater than wild type. Of particular interest, mutations conferring fitness benefits under conditions of elevated salinity almost always experience a fitness defect in other experimental conditions, indicating these mutations are environmentally specialized. Applying the experimental fitness measurements to long standing theoretical predictions of adaptation, our results are remarkably consistent with Fisher’s Geometric Model of protein evolution. Epistasis between mutations can have profound effects on evolutionary trajectories. Although the importance of epistasis has been realized since the early 1900s, the interdependence of mutations is difficult to study in vivo due to the stochastic and constant nature of background mutations. In Chapter V, utilizing the EMPIRIC methodology allows us to study the distribution of fitness effects in the context of mutant genetic backgrounds with minimal influence from unintended background mutations. By analyzing intragenic epistatic interactions, we uncovered a complex interplay between solvent shielded structural residues and solvent exposed hydrophobic surface in the amino acid 582-590 region of Hsp90. Additionally, negative epistasis appears to be negatively correlated with mutational promiscuity while additive interactions are positively correlated, indicating potential avenues for proteins to navigate fitness ‘valleys’. In summary, the work presented in this dissertation is focused on applying experimental context to the theory-rich field of evolutionary biology. The development and implementation of a novel methodology for the rapid and accurate assessment of organismal fitness has allowed us to address some of the most basic processes of evolution including adaptation and protein expression level. Through the work presented here and by investigators across the world, the application of experimental data to evolutionary theory has the potential to improve drug design and human health in general, as well as allow for predictive medicine in the coming era of personalized medicine

    Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic Medicine and Biomedical Research

    Get PDF
    Advances in sequencing technology have significantly contributed to shaping the area of genetics and enabled the identification of genetic variants associated with complex traits through genome-wide association studies. This has provided insights into genetic medicine, in which case, genetic factors influence variability in disease and treatment outcomes. On the other side, the missing or hidden heritability has suggested that the host quality of life and other environmental factors may also influence differences in disease risk and drug/treatment responses in genomic medicine, and orient biomedical research, even though this may be highly constrained by genetic capabilities. It is expected that combining these different factors can yield a paradigm-shift of personalized medicine and lead to a more effective medical treatment. With existing “big data” initiatives and high-performance computing infrastructures, there is a need for data-driven learning algorithms and models that enable the selection and prioritization of relevant genetic variants (post-genomic medicine) and trigger effective translation into clinical practice. In this chapter, we survey and discuss existing machine learning algorithms and post-genomic analysis models supporting the process of identifying valuable markers

    Mitmekesiste bioloogiliste andmete ĂŒhendamine ja analĂŒĂŒs

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneTĂ€nu tehnoloogiate arengule on bioloogiliste andmete maht viimastel aastatel mitmekordistunud. Need andmed katavad erinevaid bioloogia valdkondi. Piirdudes vaid ĂŒhe andmestikuga saab bioloogilisi protsesse vĂ”i haigusi uurida vaid ĂŒhest aspektist korraga. SeetĂ”ttu on tekkinud ĂŒha suurem vajadus masinĂ”ppe meetodite jĂ€rele, mis aitavad kombineerida eri valdkondade andmeid, et uurida bioloogilisi protsesse tervikuna. Lisaks on nĂ”udlus usaldusvÀÀrsete haigusspetsiifiliste andmestike kogude jĂ€rele, mis vĂ”imaldaks vastavaid analĂŒĂŒse efektiivsemalt lĂ€bi viia. KĂ€esolev vĂ€itekiri kirjeldab, kuidas rakendada masinĂ”ppel pĂ”hinevaid integratsiooni meetodeid erinevate bioloogiliste kĂŒsimuste uurimiseks. Me nĂ€itame kuidas integreeritud andmetel pĂ”hinev analĂŒĂŒs vĂ”imaldab paremini aru saada bioloogilistes protsessidest kolmes valdkonnas: Alzheimeri tĂ”bi, toksikoloogia ja immunoloogia. Alzheimeri tĂ”bi on vanusega seotud neurodegeneratiivne haigus millel puudub efektiivne ravi. VĂ€itekirjas nĂ€itame, kuidas integreerida erinevaid Alzheimeri tĂ”ve spetsiifilisi andmestikke, et moodustada heterogeenne graafil pĂ”hinev Alzheimeri spetsiifiline andmestik HENA. SeejĂ€rel demonstreerime sĂŒvaĂ”ppe meetodi, graafi konvolutsioonilise tehisnĂ€rvivĂ”rgu, rakendamist HENA-le, et leida potentsiaalseid haigusega seotuid geene. Teiseks uurisime kroonilist immuunpĂ”letikulist haigust psoriaasi. Selleks kombineerisime patsientide verest ja nahast pĂ€rinevad laboratoorsed mÔÔtmised kliinilise infoga ning integreerisime vastavad analĂŒĂŒside tulemused tuginedes valdkonnaspetsiifilistel teadmistel. Töö viimane osa keskendub toksilisuse testimise strateegiate edasiarendusele. Toksilisuse testimine on protsess, mille kĂ€igus hinnatakse, kas uuritavatel kemikaalidel esineb organismile kahjulikke toimeid. See on vajalik nĂ€iteks ravimite ohutuse hindamisel. Töös me tuvastasime sarnase toimemehhanismiga toksiliste ĂŒhendite rĂŒhmad. Lisaks arendasime klassifikatsiooni mudeli, mis vĂ”imaldab hinnata uute ĂŒhendite toksilisust.A fast advance in biotechnological innovation and decreasing production costs led to explosion of experimental data being produced in laboratories around the world. Individual experiments allow to understand biological processes, e.g. diseases, from different angles. However, in order to get a systematic view on disease it is necessary to combine these heterogeneous data. The large amounts of diverse data requires building machine learning models that can help, e.g. to identify which genes are related to disease. Additionally, there is a need to compose reliable integrated data sets that researchers could effectively work with. In this thesis we demonstrate how to combine and analyze different types of biological data in the example of three biological domains: Alzheimer’s disease, immunology, and toxicology. More specifically, we combine data sets related to Alzheimer’s disease into a novel heterogeneous network-based data set for Alzheimer’s disease (HENA). We then apply graph convolutional networks, state-of-the-art deep learning methods, to node classification task in HENA to find genes that are potentially associated with the disease. Combining patient’s data related to immune disease helps to uncover its pathological mechanisms and to find better treatments in the future. We analyse laboratory data from patients’ skin and blood samples by combining them with clinical information. Subsequently, we bring together the results of individual analyses using available domain knowledge to form a more systematic view on the disease pathogenesis. Toxicity testing is the process of defining harmful effects of the substances for the living organisms. One of its applications is safety assessment of drugs or other chemicals for a human organism. In this work we identify groups of toxicants that have similar mechanism of actions. Additionally, we develop a classification model that allows to assess toxic actions of unknown compounds.https://www.ester.ee/record=b523255
    • 

    corecore