Search CORE

Directory of Open Access Journals

Journal of Statistical Software

Multi-allelic exact tests for Hardy-Weinberg equilibrium that account for gender

Author: Graffelman Jan
Weir B.S.
Publication venue: 'Wiley'
Publication date: 06/08/2017
Field of study

Statistical tests for Hardy–Weinberg equilibrium are important elementary tools in genetic data analysis. X-chromosomal variants have long been tested by applying autosomal test procedures to females only, and gender is usually not considered when testing autosomal variants for equilibrium. Recently, we proposed specific Xchromosomal exact test procedures for bi-allelic variants that include the hemizygous males, as well as autosomal tests that consider gender. In this study, we present the extension of the previous work for variants with multiple alleles. A full enumeration algorithm is used for the exact calculations of tri-allelic variants. For variants with many alternate alleles, we use a permutation test. Some empirical examples with data from the 1,000 genomes project are discussed.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

On the testing of Hardy-Weinberg proportions and equality of allele frequencies in males and females at biallelic genetic markers

Author: Graffelman Jan
Weir B.S.
Publication venue: 'Wiley'
Publication date: 01/01/2018
Field of study

Standard statistical tests for equality of allele frequencies in males and females and tests for Hardy-Weinberg equilibrium are tightly linked by their assumptions. Tests for equality of allele frequencies assume Hardy-Weinberg equilibrium, whereas the usual chi-square or exact test for Hardy-Weinberg equilibrium assume equality of allele frequencies in the sexes. In this paper, we propose ways to break this interdependence in assumptions of the two tests by proposing an omnibus exact test that can test both hypotheses jointly, as well as a likelihood ratio approach that permits these phenomena to be tested both jointly and separately. The tests are illustrated with data from the 1000 Genomes project.Peer ReviewedPostprint (author's final draft

Diposit Digital de la Universitat de Barcelona

Statistical inference for hardy-weinberg proportions in the presence of missing genotype information

Author: Cook Samantha
Graffelman Jan
Moreno Aguado Víctor
Sánchez Milagros
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 09/02/2016
Field of study

In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions

A genome-wide study of Hardy–Weinberg equilibrium with next generation sequence data

Author: AM Veerappa
Bruce Weir
BS Weir
Deepti Jain
DL Hartl
G Benson
I Gomes
J Graffelman
J Graffelman
J Graffelman
JA Bailey
JA Traherne
Jan Graffelman
JE Wigginton
JF Crow
JS Beckmann
L Crooks
L Hosking
R Nielsen
RV Rohlfs
SM Leal
TH Emigh
The MHC sequencing consortium
YY Teo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Statistical tests for Hardy–Weinberg equilibrium have been an important tool for detecting genotyping errors in the past, and remain important in the quality control of next generation sequence data. In this paper, we analyze complete chromosomes of the 1000 genomes project by using exact test procedures for autosomal and X-chromosomal variants. We find that the rate of disequilibrium largely exceeds what might be expected by chance alone for all chromosomes. Observed disequilibrium is, in about 60% of the cases, due to heterozygote excess. We suggest that most excess disequilibrium can be explained by sequencing problems, and hypothesize mechanisms that can explain exceptional heterozygosities. We report higher rates of disequilibrium for the MHC region on chromosome 6, regions flanking centromeres and p-arms of acrocentric chromosomes. We also detected long-range haplotypes and areas with incidental high disequilibrium. We report disequilibrium to be related to read depth, with variants having extreme read depths being more likely to be out of equilibrium. Disequilibrium rates were found to be 11 times higher in segmental duplications and simple tandem repeat regions. The variants with significant disequilibrium are seen to be concentrated in these areas. For next generation sequence data, Hardy–Weinberg disequilibrium seems to be a major indicator for copy number variation.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

arXiv.org e-Print Archive

On the visualisation of the correlation matrix

Author: Graffelman Jan
Publication venue
Publication date: 23/01/2024
Field of study

Extensions of earlier algorithms and enhanced visualization techniques for approximating a correlation matrix are presented. The visualization problems that result from using column or colum--and--row adjusted correlation matrices, which give numerically a better fit, are addressed. For visualization of a correlation matrix a weighted alternating least squares algorithm is used, with either a single scalar adjustment, or a column-only adjustment with symmetric factorization; these choices form a compromise between the numerical accuracy of the approximation and the comprehensibility of the obtained correlation biplots. Some illustrative examples are discussed.Comment: 23 pages, 5 figure

Contributions to the multivariate Analysis of Marine Environmental Monitoring

Author: Graffelman Jan
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2000
Field of study

The thesis parts from the view that statistics starts with data, and starts by introducing the data sets studied: marine benthic species counts and chemical measurements made at a set of sites in the Norwegian Ekofisk oil field, with replicates and annually repeated. An introductory chapter details the sampling procedure and shows with reliability calculations that the (transformed) chemical variables have excellent reliability, whereas the biological variables have poor reliability, except for a small subset of abundant species. Transformed chemical variables are shown to be approximately normal. Bootstrap methods are used to assess whether the biological variables follow a Poisson distribution, and lead to the conclusion that the Poisson distribution must be rejected, except for rare species. A separate chapter details more work on the distribution of the species variables: truncated and zero-inflated Poisson distributions as well as Poisson mixtures are used in order to account for sparseness and overdispersion. Species are thought to respond to environmental variables, and regressions of the abundance of a few selected species onto chemical variables are reported. For rare species, logistic regression and Poisson regression are the tools considered, though there are problems of overdispersion. For abundant species, random coefficient models are needed in order to cope with intraclass correlation. The environmental variables, mainly heavy metals, are highly correlated, leading to multicollinearity problems. The next chapters use a multivariate approach, where all species data is now treated simultaneously. The theory of correspondence analysis is reviewed, and some theoretical results on this method are reported (bounds for singular values, centring matrices). An applied chapter discusses the correspondence analysis of the species data in detail, detects outliers, addresses stability issues, and considers different ways of stacking data matrices to obtain an integrated analysis of several years of data, and to decompose variation into a within-sites and between-sites component. More than 40 % of the total inertia is due to variation within stations. Principal components analysis is used to analyse the set of chemical variables. Attempts are made to integrate the analysis of the biological and chemical variables. A detailed theoretical development shows how continuous variables can be mapped in an optimal manner as supplementary vectors into a correspondence analysis biplot. Geometrical properties are worked out in detail, and measures for the quality of the display are given, whereas artificial data and data from the monitoring survey are used to illustrate the theory developed. The theory of display of supplementary variables in biplots is also worked out in detail for principal component analysis, with attention for the different types of scaling, and optimality of displayed correlations. A theoretical chapter follows that gives an in depth theoretical treatment of canonical correspondence analysis, (linearly constrained correspondence analysis, CCA for short) detailing many mathematical properties and aspects of this multivariate method, such as geometrical properties, biplots, use of generalized inverses, relationships with other methods, etc. Some applications of CCA to the survey data are dealt with in a separate chapter, with their interpretation and indication of the quality of the display of the different matrices involved in the analysis. Weighted principal component analysis of weighted averages is proposed as an alternative for CCA. This leads to a better display of the weighted averages of the species, and in the cases so far studied, also leads to biplots with a higher amount of explained variance for the environmental data. The thesis closes with a bibliography and outlines some suggestions for further research, such as a the generalization of canonical correlation analysis for working with singular covariance matrices, the use partial least squares methods to account for the excess of predictors, and data fusion problems to estimate missing biological data.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Secretaría de Estado de Cultura

Diposit Digital de Documents de la UAB

Book review

Author: Graffelman Jan
Publication venue
Publication date: 01/01/2010
Field of study

Obra ressenyada: Michael GREENACRE, Biplots in Practice. Rubes Editorial, 2009

Maximum-likelihood estimation of the geometric niche preemption model

Author: Graffelman Jan
Publication venue: 'Wiley'
Publication date: 01/12/2021
Field of study

This is the peer reviewed version of the following article: Graffelman, J. Maximum-likelihood estimation of the geometric niche preemption model. "Ecosphere", Desembre 2021, vol. 12, núm. 12, p. e03834:1-e03834:12., which has been published in final form at https://esajournals.onlinelibrary.wiley.com/doi/10.1002/ecs2.3834. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.The geometric series or niche preemption model is an elementary ecological model in biodiversity studies. The preemption parameter of this model is usually estimated by regression or iteratively by using May’s equation. This article proposes a maximum-likelihood estimator for the niche preemption model, assuming a known number of species and multinomial sampling. A simulation study shows that the maximum-likelihood estimator outperforms the classical estimators in this context in terms of bias and precision. We obtain the distribution of the maximum-likelihood estimator and use it to obtain confidence intervals for the preemption parameter and to develop a preemption t test that can address the hypothesis of equal geometric decay in two samples. We illustrate the use of the new estimator with some empirical data sets taken from the literature and provide software for its use.This work was partially supported by grantsRTI2018-095518-B-C22 of the Spanish Ministry ofScience, Innovation and Universities and the Euro-pean Regional Development Fund and by grant R01GM075091 from the United States National Institutesof Health. I thank two anonymous reviewers whosecomments on the manuscript have helped toimprove it. The author declares there are no conflictsof interest.Peer ReviewedPostprint (author's final draft