153 research outputs found

    Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis

    Full text link
    The multivariate regression model is a useful tool to explore complex associations between two kinds of molecular markers, which enables the understanding of the biological pathways underlying disease etiology. For a set of correlated response variables, accounting for such dependency can increase statistical power. Motivated by integrative genomic data analyses, we propose a new methodologyâ sparse multivariate factor analysis regression model (smFARM), in which correlations of response variables are assumed to follow a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of association parameters is larger than the sample size, but also to adjust for unobserved genetic and/or nongenetic factors that potentially conceal the underlying responseâ predictor associations. The proposed smFARM is implemented by the EM algorithm and the blockwise coordinate descent algorithm. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. Our results show that accounting for latent factors through the proposed smFARM can improve sensitivity of signal detection and accuracy of sparse association map estimation. We illustrate smFARM by two integrative genomics analysis examples, a breast cancer dataset, and an ovarian cancer dataset, to assess the relationship between DNA copy numbers and gene expression arrays to understand genetic regulatory patterns relevant to the disease. We identify two transâ hub regions: one in cytoband 17q12 whose amplification influences the RNA expression levels of important breast cancer genes, and the other in cytoband 9q21.32â 33, which is associated with chemoresistance in ovarian cancer.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/135396/1/gepi22018.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/135396/2/gepi22018_am.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/135396/3/gepi22018-sup-0001-SuppMat.pd

    Statistical analysis of genomic data : a new model for class prediction and inference

    Get PDF
    Genomics is a major scientific revolution in this century. High-throughput genomic data provides an opportunity for identifying genes and SNPs (singlenucleotide polymorphism) that are related to various clinical phenotypes. To deal with the sheer volume of genetic data being produced, it requires advanced methodological development in biostatistics that is lagging behind the technical capability to generate genomic data. SNPs have great importance in biomedical research for comparing regions of the genome between cohorts (such as case-control studies). Within a population, SNPs can be assigned a minor allele frequency, the lowest allele frequency at a locus that is observed in a particular population, and be recoded to binary datasets. Therefore, it is important to develop suitable statistical methods for SNPs analysis of genome alteration with the goal of contributing to the understanding of complex human diseases or traits such as mental health.In this thesis, we develop new statistical methodologies for the analysis of schizophrenia genomic data from the WA Genetic Epidemiology Resource (WAGER). The motivation is driven by the schizophrenia class prediction, (i.e. the prediction of individuals’ disease status through their genotype and quantitative traits). In general, individual’s disease status is a nominal variable, while genotypes can be converted into ordinal variables but are of high dimension. Note that the usual nonparametric regression that is developed for continuous variables cannot be applied here. There are some methodologies, such as the tree-based logistic Non-parametric Pathway-based Regression model (NPR) proposed by Wei and Li (2007)available in the literature. However, it is found that this model does not well adapt to the data set that we are analyzing. It is even worse than the (generalized) linear logistic regression model. Using logistic discrimination rule, together with adding quantitative traits, some important results have been obtained. However, some shortcomings remain. Firstly, the generalized linear logistic model has a high type I error rate for schizophrenia classification. Secondly, quantitative traits required for schizophrenia class prediction are performance assessments which demand several hours on-site participation by both assessor and assessee. These traits are generally quite difficult to reach even for a medium size sample. Meanwhile, though the laboratory analyzing cost is high, a person’s genotype can be obtained by merely collecting a drop of blood.Thus, two kinds of nonlinear models are proposed to capture the nonlinear effects in SNP datasets, which are categorical. The main contributions of this thesis are summarized as follows: • Two kinds of nonlinear threshold index logistic regression models are proposed to capture the nonlinear effects by applying the idea of threshold models (Tong (1983, 1990)) which are parametric and therefore applicable to the categorical data. One of the proposed models, which is called the partially linear threshold index logistic regression (PL-TILoR) model, is given by log ( P(Yi = 1|Xi) 1 − P(Yi = 1|Xi) ) = ®TXi + g(¯TXi), (0.1) where Yi is the disease status of the ith person under case-control study, taking on values of 1 (case) or 0 (control), Xi is the vector of genotype variables, which is p-dimensional, and the superscript T stands for transpose of a vector or matrix. Here, ® and ¯ are p-dimensional unknown parameters with ¯ being an index vector used for the reduction of dimension, satisfying k¯k = 1 and ®T¯ = 0 for model identifiability, and g is, therefore, a one-dimensional nonlinear function, which is modelled as stepwise linear function through threshold effect (Tong, 1990), given below. g(z) = (b1z + b2)I{z•c} + (b3z + b4)I{z>c}, (0.2) where bi’s and c are unknown parameters to be estimated and IA is an indicator function of the set A. In practice, the first component in model (0.1) could also be nonlinear. In this case, model (0.1) becomes log ( P(Yi = 1|Xi) 1 − P(Yi = 1|Xi) ) = g1(®TXi) + g2(¯TXi), (0.3) where k®k = 1, k¯k = 1 and ®T¯ = 0 for model identifiability, and g1 and g2 are two one-dimensional nonlinear functions which are modelled by stepwise linear functions through threshold effects as follows: gk(z) = (bk1z + bk2)I{z•ck} + (bk3z + bk4)I{z>ck}, k = 1, 2, (0.4) where bki’s and ck’s are unknown parameters to be estimated. Thus, (0.3) and (0.4) form an additive threshold index logistic regression (ATILoR) model. • A maximum likelihood methodology is developed to estimate the unknown parameters in the PL-TILoR and A-TILoR models. Simulation studies have found that the proposed methodology works well for finite size samples. • Empirical studies of the proposed models applied to the analysis of schizophrenia genomic data from the WA Genetic Epidemiology Resource (WAGER) have shown that A-TILoR model is very successful in reducing the type I error rate in schizophrenia classification without even using quantitative traits. It outperforms the generalized linear logistic model that is widely used in the literature

    Genome-Wide Datasets of Chicories (Cichorium intybus L.) for Marker-Assisted Crop Breeding Applications: A Systematic Review and Meta-Analysis

    Get PDF
    Cichorium intybus L. is the most economically important species of its genus and among the most important of the Asteraceae family. In chicory, many linkage maps have been produced, several sets of mapped and unmapped markers have been developed, and dozens of genes linked to traits of agronomic interest have been investigated. This treasure trove of information, properly cataloged and organized, is of pivotal importance for the development of superior commercial products with valuable agronomic potential in terms of yield and quality, including reduced bitter taste and increased inulin production, as well as resistance or tolerance to pathogens and resilience to environmental stresses. For this reason, a systematic review was conducted based on the scientific literature published in chicory during 1980-2023. Based on the results obtained from the meta-analysis, we created two consensus maps capable of supporting marker-assisted breeding (MAB) and marker-assisted selection (MAS) programs. By taking advantage of the recently released genome of C. intybus, we built a 639 molecular marker-based consensus map collecting all the available mapped and unmapped SNP and SSR loci available for this species. In the following section, after summarizing and discussing all the genes investigated in chicory and related to traits of interest such as reproductive barriers, sesquiterpene lactone biosynthesis, inulin metabolism and stress response, we produced a second map encompassing 64 loci that could be useful for MAS purposes. With the advent of omics technologies, molecular data chaos (namely, the situation where the amount of molecular data is so complex and unmanageable that their use becomes challenging) is becoming far from a negligible issue. In this review, we have therefore tried to contribute by standardizing and organizing the molecular data produced thus far in chicory to facilitate the work of breeders

    Mining for genotype-phenotype relations in Saccharomyces using partial least squares

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multivariate approaches are important due to their versatility and applications in many fields as it provides decisive advantages over univariate analysis in many ways. Genome wide association studies are rapidly emerging, but approaches in hand pay less attention to multivariate relation between genotype and phenotype. We introduce a methodology based on a BLAST approach for extracting information from genomic sequences and Soft- Thresholding Partial Least Squares (ST-PLS) for mapping genotype-phenotype relations.</p> <p>Results</p> <p>Applying this methodology to an extensive data set for the model yeast <it>Saccharomyces cerevisiae</it>, we found that the relationship between genotype-phenotype involves surprisingly few genes in the sense that an overwhelmingly large fraction of the phenotypic variation can be explained by variation in less than 1% of the full gene reference set containing 5791 genes. These phenotype influencing genes were evolving 20% faster than non-influential genes and were unevenly distributed over cellular functions, with strong enrichments in functions such as cellular respiration and transposition. These genes were also enriched with known paralogs, stop codon variations and copy number variations, suggesting that such molecular adjustments have had a disproportionate influence on <it>Saccharomyces </it>yeasts recent adaptation to environmental changes in its ecological niche.</p> <p>Conclusions</p> <p>BLAST and PLS based multivariate approach derived results that adhere to the known yeast phylogeny and gene ontology and thus verify that the methodology extracts a set of fast evolving genes that capture the phylogeny of the yeast strains. The approach is worth pursuing, and future investigations should be made to improve the computations of genotype signals as well as variable selection procedure within the PLS framework.</p

    Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions

    Get PDF
    Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially. IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.Peer reviewe

    Uncovering the genetic basis of seed amino acid composition in arabidopsis using a multi-omics integrative approach

    Get PDF
    Seeds are a vital source of protein in the diet of humans and livestock. However, protein composition in the seed is low, comprising about 10 percent of the total composition in the seed. Additionally, protein quality in the seed is poor due to low concentrations of certain essential amino acids (EAA). Since the body is unable to produce EAA, they must be consumed in the diet and failure to do so has detrimental, potentially irreversible, health implications that can result in death. In developing counties where meat and dairy are lacking, protein-energy malnutrition frequently occurs. In contrast, in developing countries large portions of seeds are used in the diet of livestock which must be supplemented with costly synthetic amino acids. Collectively seed amino acid composition of major crops are not sufficient to meet dietary requirements. Protein in the seed is comprised of free amino acids (FAA) and protein bound amino acids (PBAA) which have both been the targets of manipulation in order to create a seed with a more balanced amino acid profile. However, upon perturbations to the proteome, mutant seeds have demonstrated a rebalancing phenomenon where even large alterations to the amino acid composition activate a compensation mechanism that returns amino acid levels to a comparable composition to the wild-type. Although a lot is known about amino acid metabolic pathways, what regulates such rebalancing mechanisms is still unknown. However, despite the tight regulation, natural variation does exist in seed FAA and PBAA across Arabidopsis ecotypes with a unique composition specific to each ecotype; this suggests rebalancing has a genetic basis. Thus, the first step in seed biofortification efforts must be to first increase the fundamental understanding of the genetic basis of both FAA and PBAA composition in the seed. Chapter One of this dissertation gives a more in-depth introduction that elaborates on amino acid composition in the seed, the challenges identified in previous experimentation, and how the content of Chapter Two through Chapter Four builds upon and adds value to the area of seed amino acid research as a whole. Chapter Two focuses on uncovering the genes and biological processes that underly the regulation of free Glutamine which belongs to the Glutamate Family (Arginine, Proline, Glutamine, and Glutamate). Although Glutamine is not an EAA, it is a major nitrogencontaining amino acid that is transported to the seed; thus it's regulatory control is of particular interest. I harness the natural variation of Glutamine in a 360 Arabidopsis diversity panel to uncover key regulatory genes. Later, I validate observations from GWAS using both a quantitative trait locus (QTL) analysis and reverse genetic approaches to identify a unique, seed-specific Glutamine-glucosinolate relationship that alters nitrogen and sulfur homeostasis in the seed in the Arabidopsis 360 population. Such finds were substantial as they link primary and secondary metabolism in the seed. Chapter Three focuses on uncovering the genetic basis underlying PBAA composition in dry Arabidopsis seeds while expanding upon the work completed in Chapter Two. 576 high confidence candidate genes (HCCGs) are found through integration of GWAS using PBAA traits and transcriptomic analysis across seed development of two mutants showing active rebalancing. To reveal the underlying biological process, I further subject the HCCGs to a protein-protein interaction (PPI) network that strongly suggests that ribosomal genes and potentially other translational machinery may be in the heart of PBAA composition homeostasis and the proteomic rebalancing response. Chapter Four addresses the need of a comprehensive tool to efficiently and automatically analyze many biochemical derived-traits in GWAS, while also completing pre and post-GWAS analysis. Here, I present the R tool HAPPI GWAS, describing each step in the pipeline, and giving an example of its implementation. Lastly, Chapter Five reiterates the contributions of this dissertation to the field of seed amino acid research and provides insight into future direction and research projects. The results from this work are vital steps in understanding the complex regulatory mechanisms underlying amino acid composition in the seed which can be used in manipulating the amino acid pools in future translational crop research
    • …
    corecore