10 research outputs found

    netgwas: An R Package for Network-Based Genome-Wide Association Studies

    Full text link
    Graphical models are powerful tools for modeling and making statistical inferences regarding complex associations among variables in multivariate data. In this paper we introduce the R package netgwas, which is designed based on undirected graphical models to accomplish three important and interrelated goals in genetics: constructing linkage map, reconstructing linkage disequilibrium (LD) networks from multi-loci genotype data, and detecting high-dimensional genotype-phenotype networks. The netgwas package deals with species with any chromosome copy number in a unified way, unlike other software. It implements recent improvements in both linkage map construction (Behrouzi and Wit, 2018), and reconstructing conditional independence network for non-Gaussian continuous data, discrete data, and mixed discrete-and-continuous data (Behrouzi and Wit, 2017). Such datasets routinely occur in genetics and genomics such as genotype data, and genotype-phenotype data. We demonstrate the value of our package functionality by applying it to various multivariate example datasets taken from the literature. We show, in particular, that our package allows a more realistic analysis of data, as it adjusts for the effect of all other variables while performing pairwise associations. This feature controls for spurious associations between variables that can arise from classical multiple testing approach. This paper includes a brief overview of the statistical methods which have been implemented in the package. The main body of the paper explains how to use the package. The package uses a parallelization strategy on multi-core processors to speed-up computations for large datasets. In addition, it contains several functions for simulation and visualization. The netgwas package is freely available at https://cran.r-project.org/web/packages/netgwasComment: 32 pages, 9 figures; due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF fil

    Boosting heritability : estimating the genetic component of phenotypic variation with multiple sample splitting

    Get PDF
    Background Heritability is a central measure in genetics quantifying how much of the variability observed in a trait is attributable to genetic differences. Existing methods for estimating heritability are most often based on random-effect models, typically for computational reasons. The alternative of using a fixed-effect model has received much more limited attention in the literature. Results In this paper, we propose a generic strategy for heritability inference, termed as "boosting heritability", by combining the advantageous features of different recent methods to produce an estimate of the heritability with a high-dimensional linear model. Boosting heritability uses in particular a multiple sample splitting strategy which leads in general to a stable and accurate estimate. We use both simulated data and real antibiotic resistance data from a major human pathogen, Sptreptococcus pneumoniae, to demonstrate the attractive features of our inference strategy. Conclusions Boosting is shown to offer a reliable and practically useful tool for inference about heritability.Peer reviewe

    Clinical Significance of Pathogenicity of Somatic Mutations in Oral Leukoplakia: a Prospective Observational Study

    Get PDF
    Background. The vast majority of malignant neoplasms of the oral mucosa refer to squamous cell carcinomas. The development of squamous cell carcinoma of the oral mucosa is often promoted by previous potentially malignant diseases, with oral leukoplakia dominating among them.Objective. To determine the clinical significance of the pathogenicity of somatic mutations in oral mucosal leukoplakia.Methods. The study material included 24 samples of abnormal epithelium of the oral mucosa from leukoplakia patients. QIAamp DNA FFPE Tissue Kit (Qiagen, Germany) was used for deoxyribonucleic acid (DNA) extraction from the samples. DNA sequencing was performed using IlluminaNextSeq 550 sequencer and TruSight™ Oncology 500 DNA Kit For Use with NextSeq (Illumina, USA). All DNA extractions from biological samples, preparation and sequencing of DNA libraries were performed step-by-step in strict accordance with the guidelines provided with the respective reagent kits. Bioinformatics analysis was carried out using specific software Illumina Base Space (Illumina, USA) and Galaxy Project (The Galaxy Community, a non-profit international project) according to current guidelines. The desired power of the study accounted for 90%. Two Proportions Z test was performed by means of The Sample Size Calculation of Statistica 12 (StatSoft, Inc.) with the set option “one-tailed hypothesis”, because it was initially assumed that pathogenic (oncogenic) genetic variants occur in the tissue of oral leukoplakia much more frequently than in the human reference genome used for sequence alignment.Results. The pathogenic somatic mutations in the TP53, KRAS, APC, NRAs and BRAF genes, identified in this study, alone or in combination, are highly likely (hazard ratio 3000-11000) to be associated with the development of oral mucosal leukoplakia and low-grade epithelial dysplasia. The multiplicity of pathogenic and likely pathogenic genetic variants associated with epithelial dysplasia, as well as the fact that a number of variants do not occur in all patients, suggests that the same histotype of oral mucosal dysplasia may develop under the influence of different mutations.Conclusion. The pathogenic and likely pathogenic variants of the TP53, KRAS, APC, NRAS and BRAF genes, identified in this study, alone or in combination, are highly likely (hazard ratio 3000–11000) to be associated with the development of leukoplakia and low-grade epithelial dysplasia

    Group Inference in High Dimensions with Applications to Hierarchical Testing

    Full text link
    High-dimensional group inference is an essential part of statistical methods for analysing complex data sets, including hierarchical testing, tests of interaction, detection of heterogeneous treatment effects and inference for local heritability. Group inference in regression models can be measured with respect to a weighted quadratic functional of the regression sub-vector corresponding to the group. Asymptotically unbiased estimators of these weighted quadratic functionals are constructed and a novel procedure using these estimators for inference is proposed. We derive its asymptotic Gaussian distribution which enables the construction of asymptotically valid confidence intervals and tests which perform well in terms of length or power. The proposed test is computationally efficient even for a large group, statistically valid for any group size and achieving good power performance for testing large groups with many small regression coefficients. We apply the methodology to several interesting statistical problems and demonstrate its strength and usefulness on simulated and real data

    Worldwide Distribution of the Human Apolipoprotein E Gene - The Association between APOE, Subsistence, and Latitude

    Full text link
    The human apolipoprotein E gene (APOE) plays an important role in metabolizing lipids, regulating plasma cholesterol, and maintaining biological function. Structural differences in APOE variants impact cholesterol absorption and health risk, so that alleles serve as biomarkers for numerous cardiovascular and neurological diseases (Lai 2015). Variant differences are determined by changes in two single nucleotide polymorphisms (SNPs), rs429358 and rs7412. Distribution of alleles varies across populations. Allele frequencies in populations have been shown to be associated with cultural and environmental factors, including subsistence strategy and latitude (Eisenberg 2010). This study aims to provide a cross-population, genetic association study between APOE, subsistence strategy, and latitude. The objective of the study is to examine the roles that subsistence and latitude have in shaping APOE allele frequencies within populations. The study hypothesizes that E3 correlates with agriculture / post-agriculture and low latitude, and E4 correlates with non-agricultural and high latitude. The study further predicts that E2 is not linked to either subsistence or latitude. To test these hypotheses, genotype data on 124 APOE SNPs, and subsistence and latitude data was compiled for 26 populations. The data were adjusted for population stratification, and remaining SNPs were tested for significance based on linkage between loci. Afterward, subsistence and latitude were first tested as independent variables for an association with each SNP / haplotype, then as covariates. Results on the associations between APOE and subsistence and latitude were mixed. SNPs rs429358 and rs7412 were confirmed to be significant in determining APOE variation. Association results on each SNP showed a link between rs429358 and subsistence, and latitude, as well as between rs7412 and latitude – but not between rs7412 and subsistence. Association results on haplotypes confirmed the hypothesis that subsistence and latitude each play a role in APOE distribution – although this role lessened when considering the other variable. When subsistence and latitude were treated as independent variables, E3 showed an association with both subsistence and latitude. Yet, the correlation between E3 and subsistence disappeared when latitude was a covariate. Further, while E4 was confirmed to be associated with subsistence, this association decreased when latitude was a covariate. The study also confirmed the subsistence hypotheses, with E3 linked to post-agriculture (when subsistence was an independent variable) and E4 linked to non-agriculture. However, the study refuted the latitude hypotheses by showing a reverse association than predicted, with E3 being associated with high latitude and E4 being associated with low latitude. Also, contrary to the hypotheses, E2 was shown to be associated with both subsistence and latitude. In summary, results from the study support an association between APOE, subsistence, and latitude; however, the results do not support the direction of association between specific APOE alleles and these variables

    Strategies For Improving Epistasis Detection And Replication

    Get PDF
    Genome-wide association studies (GWAS) have been extensively critiqued for their perceived inability to adequately elucidate the genetic underpinnings of complex disease. Of particular concern is “missing heritability,” or the difference between the total estimated heritability of a phenotype and that explained by GWAS-identified loci. There are numerous proposed explanations for this missing heritability, but a frequently ignored and potentially vastly informative alternative explanation is the ubiquity of epistasis underlying complex phenotypes. Given our understanding of how biomolecules interact in networks and pathways, it is not unreasonable to conclude that the effect of variation at individual genetic loci may non-additively depend on and should be analyzed in the context of their interacting partners. It has been recognized for over a century that deviation from expected Mendelian proportions can be explained by the interaction of multiple loci, and the epistatic underpinnings of phenotypes in model organisms have been extensively experimentally quantified. Therefore, the dearth of inspiring single locus GWAS hits for complex human phenotypes (and the inconsistent replication of these between populations) should not be surprising, as one might expect the joint effect of multiple perturbations to interacting partners within a functional biological module to be more important than individual main effects. Current methods for analyzing data from GWAS are not well-equipped to detect epistasis or replicate significant interactions. The multiple testing burden associated with testing each pairwise interaction quickly becomes nearly insurmountable with increasing numbers of loci. Statistical and machine learning approaches that have worked well for other types of high-dimensional data are appealing and may be useful for detecting epistasis, but potentially require tweaks to function appropriately. Biological knowledge may also be leveraged to guide the search for epistasis candidates, but requires context-appropriate application (as, for example, two loci with significant main effects may not have a significant interaction, and vice versa). Rather than renouncing GWAS and the wealth of associated data that has been accumulated as a failure, I propose the development of new techniques and incorporation of diverse data sources to analyze GWAS data in an epistasis-centric framework

    Assessing statistical significance in multivariable genome wide association analysis

    No full text
    Motivation: Although Genome Wide Association Studies (GWAS) genotype a very large number of single nucleotide polymorphisms (SNPs), the data are often analyzed one SNP at a time. The low predictive power of single SNPs, coupled with the high significance threshold needed to correct for multiple testing, greatly decreases the power of GWAS. Results: We propose a procedure in which all the SNPs are analyzed in a multiple generalized linear model, and we show its use for extremely high-dimensional datasets. Our method yields P-values for assessing significance of single SNPs or groups of SNPs while controlling for all other SNPs and the family wise error rate (FWER). Thus, our method tests whether or not a SNP carries any additional information about the phenotype beyond that available by all the other SNPs. This rules out spurious correlations between phenotypes and SNPs that can arise from marginal methods because the ‘spuriously correlated' SNP merely happens to be correlated with the ‘truly causal' SNP. In addition, the method offers a data driven approach to identifying and refining groups of SNPs that jointly contain informative signals about the phenotype. We demonstrate the value of our method by applying it to the seven diseases analyzed by the Wellcome Trust Case Control Consortium (WTCCC). We show, in particular, that our method is also capable of finding significant SNPs that were not identified in the original WTCCC study, but were replicated in other independent studies. Availability and implementation: Reproducibility of our research is supported by the open-source Bioconductor package hierGWAS. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    Assessing statistical significance in multivariable genome wide association analysis

    Get PDF
    Motivation: Although Genome Wide Association Studies (GWAS) genotype a very large number of single nucleotide polymorphisms (SNPs), the data are often analyzed one SNP at a time. The low predictive power of single SNPs, coupled with the high significance threshold needed to correct for multiple testing, greatly decreases the power of GWAS. Results: We propose a procedure in which all the SNPs are analyzed in a multiple generalized linear model, and we show its use for extremely high-dimensional datasets. Our method yields P -values for assessing significance of single SNPs or groups of SNPs while controlling for all other SNPs and the family wise error rate (FWER). Thus, our method tests whether or not a SNP carries any additional information about the phenotype beyond that available by all the other SNPs. This rules out spurious correlations between phenotypes and SNPs that can arise from marginal methods because the ‘spuriously correlated’ SNP merely happens to be correlated with the ‘truly causal’ SNP. In addition, the method offers a data driven approach to identifying and refining groups of SNPs that jointly contain informative signals about the phenotype. We demonstrate the value of our method by applying it to the seven diseases analyzed by the Wellcome Trust Case Control Consortium (WTCCC). We show, in particular, that our method is also capable of finding significant SNPs that were not identified in the original WTCCC study, but were replicated in other independent studies. Availability and implementation: Reproducibility of our research is supported by the open-source Bioconductor package hierGWAS. Contact:[email protected] Supplementary information:Supplementary data are available at Bioinformatics online.E.F. and L.B. gratefully acknowledge financial support from the European Research Council (grant 295642, The Foundations of Economic Preferences, FEP). D.S. gratefully acknowledges financial support from the German National Science Foundation (DFG, grant SCHU 2828/2-1, Inference statistical methods for behavioral genetics and neuroeconomics). A.N. gratefully acknowledges support from the Instituto de Salud Carlos III (grants RD12/0032/0011 and PT13/0001/0026) and the Spanish Government Grant (BFU2012-38236) and from FEDER
    corecore