253 research outputs found
Exploring Diallelic genetic markers: The Hardy Weinberg package
Testing genetic markers for Hardy-Weinberg equilibrium is an important issue in genetic association studies. The HardyWeinberg package o ers the classical tests for equilibrium, functions for power computation and for the simulation of marker data under equilibrium and disequilibrium. Functions for testing equilibrium in the presence of missing data by using multiple imputation are provided. The package also supplies various graphical tools such as ternary plots with acceptance regions, log-ratio plots and Q-Q plots for exploring the equilibrium status of a large set of diallelic markers. Classical
tests for equilibrium and graphical representations for diallelic marker data are reviewed. Several data sets illustrate the use of the package.Postprint (published version
Multi-allelic exact tests for Hardy-Weinberg equilibrium that account for gender
Statistical tests for Hardy–Weinberg equilibrium are important elementary tools in genetic data analysis. X-chromosomal variants have long been tested by applying autosomal test procedures to females only, and gender is usually not considered when testing autosomal variants for equilibrium. Recently, we proposed specific Xchromosomal exact test procedures for bi-allelic variants that include the hemizygous males, as well as autosomal tests that consider gender. In this study, we present the extension of the previous work for variants with multiple alleles. A full enumeration algorithm is used for the exact calculations of tri-allelic variants. For variants with many alternate alleles, we use a permutation test. Some empirical examples with data from the 1,000 genomes project are discussed.Peer ReviewedPostprint (published version
On the testing of Hardy-Weinberg proportions and equality of allele frequencies in males and females at biallelic genetic markers
Standard statistical tests for equality of allele frequencies in males and females and tests for Hardy-Weinberg equilibrium are tightly linked by their assumptions. Tests for equality of allele frequencies assume Hardy-Weinberg equilibrium, whereas the usual chi-square or exact test for Hardy-Weinberg equilibrium assume equality of allele frequencies in the sexes. In this paper, we propose ways to break this interdependence in assumptions of the two tests by proposing an omnibus exact test that can test both hypotheses jointly, as well as a likelihood ratio approach that permits these phenomena to be tested both jointly and separately. The tests are illustrated with data from the 1000 Genomes project.Peer ReviewedPostprint (author's final draft
Statistical inference for hardy-weinberg proportions in the presence of missing genotype information
In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions
A genome-wide study of Hardy–Weinberg equilibrium with next generation sequence data
Statistical tests for Hardy–Weinberg equilibrium have been an important tool for detecting genotyping errors in the past, and remain important in the quality control of next generation sequence data. In this paper, we analyze complete chromosomes of the 1000 genomes project by using exact test procedures for autosomal and X-chromosomal variants. We find that the rate of disequilibrium largely exceeds what might be expected by chance alone for all chromosomes. Observed disequilibrium is, in about 60% of the cases, due to heterozygote excess. We suggest that most excess disequilibrium can be explained by sequencing problems, and hypothesize mechanisms that can explain exceptional heterozygosities. We report higher rates of disequilibrium for the MHC region on chromosome 6, regions flanking centromeres and p-arms of acrocentric chromosomes. We also detected long-range haplotypes and areas with incidental high disequilibrium. We report disequilibrium to be related to read depth, with variants having extreme read depths being more likely to be out of equilibrium. Disequilibrium rates were found to be 11 times higher in segmental duplications and simple tandem repeat regions. The variants with significant disequilibrium are seen to be concentrated in these areas. For next generation sequence data, Hardy–Weinberg disequilibrium seems to be a major indicator for copy number variation.Peer ReviewedPostprint (published version
On the visualisation of the correlation matrix
Extensions of earlier algorithms and enhanced visualization techniques for
approximating a correlation matrix are presented. The visualization problems
that result from using column or colum--and--row adjusted correlation matrices,
which give numerically a better fit, are addressed. For visualization of a
correlation matrix a weighted alternating least squares algorithm is used, with
either a single scalar adjustment, or a column-only adjustment with symmetric
factorization; these choices form a compromise between the numerical accuracy
of the approximation and the comprehensibility of the obtained correlation
biplots. Some illustrative examples are discussed.Comment: 23 pages, 5 figure
Contributions to the multivariate Analysis of Marine Environmental Monitoring
The thesis parts from the view that statistics starts with data, and starts by introducing the data sets studied: marine benthic species counts and chemical measurements made at a set of sites in the Norwegian Ekofisk oil field, with replicates and annually repeated. An introductory chapter details the sampling procedure and shows with reliability calculations that the (transformed) chemical variables have excellent reliability, whereas the biological variables have poor reliability, except for a small subset of abundant species. Transformed chemical variables are shown to be approximately normal. Bootstrap methods are used to assess whether the biological variables follow a Poisson distribution, and lead to the conclusion that the Poisson distribution must be rejected, except for rare species. A separate chapter details more work on the distribution of the species variables: truncated and zero-inflated Poisson distributions as well as Poisson mixtures are used in order to account for sparseness and overdispersion. Species are thought to respond to environmental variables, and regressions of the abundance of a few selected species onto chemical variables are reported. For rare species, logistic regression and Poisson regression are the tools considered, though there are problems of overdispersion. For abundant species, random coefficient models are needed in order to cope with intraclass correlation. The environmental variables, mainly heavy metals, are highly correlated, leading to multicollinearity problems. The next chapters use a multivariate approach, where all species data is now treated simultaneously. The theory of correspondence analysis is reviewed, and some theoretical results on this method are reported (bounds for singular values, centring matrices). An applied chapter discusses the correspondence analysis of the species data in detail, detects outliers, addresses stability issues, and considers different ways of stacking data matrices to obtain an integrated analysis of several years of data, and to decompose variation into a within-sites and between-sites component. More than 40 % of the total inertia is due to variation within stations. Principal components analysis is used to analyse the set of chemical variables. Attempts are made to integrate the analysis of the biological and chemical variables. A detailed theoretical development shows how continuous variables can be mapped in an optimal manner as supplementary vectors into a correspondence analysis biplot. Geometrical properties are worked out in detail, and measures for the quality of the display are given, whereas artificial data and data from the monitoring survey are used to illustrate the theory developed. The theory of display of supplementary variables in biplots is also worked out in detail for principal component analysis, with attention for the different types of scaling, and optimality of displayed correlations. A theoretical chapter follows that gives an in depth theoretical treatment of canonical correspondence analysis, (linearly constrained correspondence analysis, CCA for short) detailing many mathematical properties and aspects of this multivariate method, such as geometrical properties, biplots, use of generalized inverses, relationships with other methods, etc. Some applications of CCA to the survey data are dealt with in a separate chapter, with their interpretation and indication of the quality of the display of the different matrices involved in the analysis. Weighted principal component analysis of weighted averages is proposed as an alternative for CCA. This leads to a better display of the weighted averages of the species, and in the cases so far studied, also leads to biplots with a higher amount of explained variance for the environmental data. The thesis closes with a bibliography and outlines some suggestions for further research, such as a the generalization of canonical correlation analysis for working with singular covariance matrices, the use partial least squares methods to account for the excess of predictors, and data fusion problems to estimate missing biological data.Postprint (published version
Maximum-likelihood estimation of the geometric niche preemption model
This is the peer reviewed version of the following article: Graffelman, J. Maximum-likelihood estimation of the geometric niche preemption model. "Ecosphere", Desembre 2021, vol. 12, núm. 12, p. e03834:1-e03834:12., which has been published in final form at https://esajournals.onlinelibrary.wiley.com/doi/10.1002/ecs2.3834. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.The geometric series or niche preemption model is an elementary ecological model in biodiversity studies. The preemption parameter of this model is usually estimated by regression or iteratively by using May’s equation. This article proposes a maximum-likelihood estimator for the niche preemption model, assuming a known number of species and multinomial sampling. A simulation study shows that the maximum-likelihood estimator outperforms the classical estimators in this context in terms of bias and precision. We obtain the distribution of the maximum-likelihood estimator and use it to obtain confidence intervals for the preemption parameter and to develop a preemption t test that can address the hypothesis of equal geometric decay in two samples. We illustrate the use of the new estimator with some empirical data sets taken from the literature and provide software for its use.This work was partially supported by grantsRTI2018-095518-B-C22 of the Spanish Ministry ofScience, Innovation and Universities and the Euro-pean Regional Development Fund and by grant R01GM075091 from the United States National Institutesof Health. I thank two anonymous reviewers whosecomments on the manuscript have helped toimprove it. The author declares there are no conflictsof interest.Peer ReviewedPostprint (author's final draft
- …
