215 research outputs found
Contributions to the multivariate Analysis of Marine Environmental Monitoring
The thesis parts from the view that statistics starts with data, and starts by introducing the data sets studied: marine benthic species counts and chemical measurements made at a set of sites in the Norwegian Ekofisk oil field, with replicates and annually repeated. An introductory chapter details the sampling procedure and shows with reliability calculations that the (transformed) chemical variables have excellent reliability, whereas the biological variables have poor reliability, except for a small subset of abundant species. Transformed chemical variables are shown to be approximately normal. Bootstrap methods are used to assess whether the biological variables follow a Poisson distribution, and lead to the conclusion that the Poisson distribution must be rejected, except for rare species. A separate chapter details more work on the distribution of the species variables: truncated and zero-inflated Poisson distributions as well as Poisson mixtures are used in order to account for sparseness and overdispersion. Species are thought to respond to environmental variables, and regressions of the abundance of a few selected species onto chemical variables are reported. For rare species, logistic regression and Poisson regression are the tools considered, though there are problems of overdispersion. For abundant species, random coefficient models are needed in order to cope with intraclass correlation. The environmental variables, mainly heavy metals, are highly correlated, leading to multicollinearity problems. The next chapters use a multivariate approach, where all species data is now treated simultaneously. The theory of correspondence analysis is reviewed, and some theoretical results on this method are reported (bounds for singular values, centring matrices). An applied chapter discusses the correspondence analysis of the species data in detail, detects outliers, addresses stability issues, and considers different ways of stacking data matrices to obtain an integrated analysis of several years of data, and to decompose variation into a within-sites and between-sites component. More than 40 % of the total inertia is due to variation within stations. Principal components analysis is used to analyse the set of chemical variables. Attempts are made to integrate the analysis of the biological and chemical variables. A detailed theoretical development shows how continuous variables can be mapped in an optimal manner as supplementary vectors into a correspondence analysis biplot. Geometrical properties are worked out in detail, and measures for the quality of the display are given, whereas artificial data and data from the monitoring survey are used to illustrate the theory developed. The theory of display of supplementary variables in biplots is also worked out in detail for principal component analysis, with attention for the different types of scaling, and optimality of displayed correlations. A theoretical chapter follows that gives an in depth theoretical treatment of canonical correspondence analysis, (linearly constrained correspondence analysis, CCA for short) detailing many mathematical properties and aspects of this multivariate method, such as geometrical properties, biplots, use of generalized inverses, relationships with other methods, etc. Some applications of CCA to the survey data are dealt with in a separate chapter, with their interpretation and indication of the quality of the display of the different matrices involved in the analysis. Weighted principal component analysis of weighted averages is proposed as an alternative for CCA. This leads to a better display of the weighted averages of the species, and in the cases so far studied, also leads to biplots with a higher amount of explained variance for the environmental data. The thesis closes with a bibliography and outlines some suggestions for further research, such as a the generalization of canonical correlation analysis for working with singular covariance matrices, the use partial least squares methods to account for the excess of predictors, and data fusion problems to estimate missing biological data.Postprint (published version
On the visualisation of the correlation matrix
Extensions of earlier algorithms and enhanced visualization techniques for
approximating a correlation matrix are presented. The visualization problems
that result from using column or colum--and--row adjusted correlation matrices,
which give numerically a better fit, are addressed. For visualization of a
correlation matrix a weighted alternating least squares algorithm is used, with
either a single scalar adjustment, or a column-only adjustment with symmetric
factorization; these choices form a compromise between the numerical accuracy
of the approximation and the comprehensibility of the obtained correlation
biplots. Some illustrative examples are discussed.Comment: 23 pages, 5 figure
Book review
Obra ressenyada: Michael GREENACRE, Biplots in Practice. Rubes Editorial, 2009
Statistical inference for Hardy-Weinberg equilibrium using log-ratio coordinates
Testing markers for Hardy-Weinberg equilibrium (HWE) is an important step in the analysis of
large databases used in genetic association studies. Gross deviation from HWE can be indicative of
genotyping error. There are many approaches to testing markers for HWE. The classical chi-square
test was, till recently, the most widely used approach to HWE-testing. Over the last decade, the
computationally more demanding exact test has become more popular. Bayesian approaches, where
the full posterior distribution of a disequilibrium parameter is obtained, have also been developed.
As far as CODA is concerned, Aitchison described how the HWE law can be “discovered” when
a set of samples, all genotyped for the same marker, is analyzed by log-ratio principal component
analysis. A well-known tool in CODA, the ternary plot, is known in genetics as a de Finetti
diagram. The Hardy-Weinberg law defines a parabola in a ternary plot of the three genotypes
frequencies of a bi-allelic marker. Ternary plots of bi-allelic genetic markers typically show points
that “follow” the parabola, though with certain scatter that depends on the sample size. When
represented in additive, centered or isometric log-ratio coordinates, the HW parabola becomes a
straight line. Much of CODA is concerned with data sets where each individual row in the data
set (an individual, a sample, an object) constitutes a composition. In data sets comprising genetic
markers, individual rows (persons) are not really compositions, but it is the total sample of all
individuals that constitutes a composition. The CODA approach to genetic data has shown useful
in supplying interesting graphics, but to date CODA seems not to have provided formal statistical
inference for HWE, probably because the distribution of the log-ratio coordinates is not known.
Nevertheless, the log-ratio approach directly suggests some statistics that can be used for measuring
disequilibrium: the second clr and the second ilr coordinate of the sample. Similar statistics have
been used in the genetics literature. In this contribution, we will use the multivariate delta method
to derive the asymptotic distribution of the isometric log-ratio coordinates. This allows hypothesis
testing for HWE and the construction of confidence intervals for large samples that contain no
zeros. The type 1 error rate of the test is compared with the classical chi-square test
Exploring Diallelic genetic markers: The Hardy Weinberg package
Testing genetic markers for Hardy-Weinberg equilibrium is an important issue in genetic association studies. The HardyWeinberg package o ers the classical tests for equilibrium, functions for power computation and for the simulation of marker data under equilibrium and disequilibrium. Functions for testing equilibrium in the presence of missing data by using multiple imputation are provided. The package also supplies various graphical tools such as ternary plots with acceptance regions, log-ratio plots and Q-Q plots for exploring the equilibrium status of a large set of diallelic markers. Classical
tests for equilibrium and graphical representations for diallelic marker data are reviewed. Several data sets illustrate the use of the package.Postprint (published version
Maximum-likelihood estimation of the geometric niche preemption model
This is the peer reviewed version of the following article: Graffelman, J. Maximum-likelihood estimation of the geometric niche preemption model. "Ecosphere", Desembre 2021, vol. 12, núm. 12, p. e03834:1-e03834:12., which has been published in final form at https://esajournals.onlinelibrary.wiley.com/doi/10.1002/ecs2.3834. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.The geometric series or niche preemption model is an elementary ecological model in biodiversity studies. The preemption parameter of this model is usually estimated by regression or iteratively by using May’s equation. This article proposes a maximum-likelihood estimator for the niche preemption model, assuming a known number of species and multinomial sampling. A simulation study shows that the maximum-likelihood estimator outperforms the classical estimators in this context in terms of bias and precision. We obtain the distribution of the maximum-likelihood estimator and use it to obtain confidence intervals for the preemption parameter and to develop a preemption t test that can address the hypothesis of equal geometric decay in two samples. We illustrate the use of the new estimator with some empirical data sets taken from the literature and provide software for its use.This work was partially supported by grantsRTI2018-095518-B-C22 of the Spanish Ministry ofScience, Innovation and Universities and the Euro-pean Regional Development Fund and by grant R01GM075091 from the United States National Institutesof Health. I thank two anonymous reviewers whosecomments on the manuscript have helped toimprove it. The author declares there are no conflictsof interest.Peer ReviewedPostprint (author's final draft
Contributions to the multivariate Analysis of Marine Environmental Monitoring
The thesis parts from the view that statistics starts with data, and starts by introducing the data sets studied: marine benthic species counts and chemical measurements made at a set of sites in the Norwegian Ekofisk oil field, with replicates and annually repeated. An introductory chapter details the sampling procedure and shows with reliability calculations that the (transformed) chemical variables have excellent reliability, whereas the biological variables have poor reliability, except for a small subset of abundant species. Transformed chemical variables are shown to be approximately normal. Bootstrap methods are used to assess whether the biological variables follow a Poisson distribution, and lead to the conclusion that the Poisson distribution must be rejected, except for rare species. A separate chapter details more work on the distribution of the species variables: truncated and zero-inflated Poisson distributions as well as Poisson mixtures are used in order to account for sparseness and overdispersion. Species are thought to respond to environmental variables, and regressions of the abundance of a few selected species onto chemical variables are reported. For rare species, logistic regression and Poisson regression are the tools considered, though there are problems of overdispersion. For abundant species, random coefficient models are needed in order to cope with intraclass correlation. The environmental variables, mainly heavy metals, are highly correlated, leading to multicollinearity problems. The next chapters use a multivariate approach, where all species data is now treated simultaneously. The theory of correspondence analysis is reviewed, and some theoretical results on this method are reported (bounds for singular values, centring matrices). An applied chapter discusses the correspondence analysis of the species data in detail, detects outliers, addresses stability issues, and considers different ways of stacking data matrices to obtain an integrated analysis of several years of data, and to decompose variation into a within-sites and between-sites component. More than 40 % of the total inertia is due to variation within stations. Principal components analysis is used to analyse the set of chemical variables. Attempts are made to integrate the analysis of the biological and chemical variables. A detailed theoretical development shows how continuous variables can be mapped in an optimal manner as supplementary vectors into a correspondence analysis biplot. Geometrical properties are worked out in detail, and measures for the quality of the display are given, whereas artificial data and data from the monitoring survey are used to illustrate the theory developed. The theory of display of supplementary variables in biplots is also worked out in detail for principal component analysis, with attention for the different types of scaling, and optimality of displayed correlations. A theoretical chapter follows that gives an in depth theoretical treatment of canonical correspondence analysis, (linearly constrained correspondence analysis, CCA for short) detailing many mathematical properties and aspects of this multivariate method, such as geometrical properties, biplots, use of generalized inverses, relationships with other methods, etc. Some applications of CCA to the survey data are dealt with in a separate chapter, with their interpretation and indication of the quality of the display of the different matrices involved in the analysis. Weighted principal component analysis of weighted averages is proposed as an alternative for CCA. This leads to a better display of the weighted averages of the species, and in the cases so far studied, also leads to biplots with a higher amount of explained variance for the environmental data. The thesis closes with a bibliography and outlines some suggestions for further research, such as a the generalization of canonical correlation analysis for working with singular covariance matrices, the use partial least squares methods to account for the excess of predictors, and data fusion problems to estimate missing biological data
Improved approximation and visualization of the correlation matrix
This is an Accepted Manuscript of an article published by Taylor & Francis Group in American statistician on 11/04/2023, available online at: http://www.tandfonline.com/10.1080/00031305.2023.2186952The graphical representation of the correlation matrix by means of different multivariate statistical methods is reviewed, a comparison of the different procedures is presentedwith the use of an example dataset, and an improved representation with better fit is proposed. Principal component analysis is widely used for making pictures of correlation structure, though as shown a weighted alternating least squares approach that avoids the fitting of the diagonal of the correlation matrix outperforms both principal component analysis and principal factor analysis in approximating a correlation matrix. Weighted alternating least squares is a very strong competitor for principal component analysis, in particular if the correlation matrix is the focus of the study, because it improves the representation of the correlation matrix, often at the expense of only aminor percentage of explained variance for the original data matrix, if the latter is mapped onto the correlation biplot by regression. In this article, we propose to combine weighted alternating least squares with an additive adjustment of the correlation matrix, and this is seen to lead to further improved approximation of the correlation matrix.This work was supported by the Spanish Ministry of Science and Innovation and the European Regional Development Fund under grant PID2021- 125380OB-I00 (MCIN/AEI/FEDER); and the National Institutes of Health under Grant GM075091.Peer ReviewedPostprint (published version
Multi-allelic exact tests for Hardy-Weinberg equilibrium that account for gender
Statistical tests for Hardy–Weinberg equilibrium are important elementary tools in genetic data analysis. X-chromosomal variants have long been tested by applying autosomal test procedures to females only, and gender is usually not considered when testing autosomal variants for equilibrium. Recently, we proposed specific Xchromosomal exact test procedures for bi-allelic variants that include the hemizygous males, as well as autosomal tests that consider gender. In this study, we present the extension of the previous work for variants with multiple alleles. A full enumeration algorithm is used for the exact calculations of tri-allelic variants. For variants with many alternate alleles, we use a permutation test. Some empirical examples with data from the 1,000 genomes project are discussed.Peer ReviewedPostprint (published version
On the testing of Hardy-Weinberg proportions and equality of allele frequencies in males and females at biallelic genetic markers
Standard statistical tests for equality of allele frequencies in males and females and tests for Hardy-Weinberg equilibrium are tightly linked by their assumptions. Tests for equality of allele frequencies assume Hardy-Weinberg equilibrium, whereas the usual chi-square or exact test for Hardy-Weinberg equilibrium assume equality of allele frequencies in the sexes. In this paper, we propose ways to break this interdependence in assumptions of the two tests by proposing an omnibus exact test that can test both hypotheses jointly, as well as a likelihood ratio approach that permits these phenomena to be tested both jointly and separately. The tests are illustrated with data from the 1000 Genomes project.Peer ReviewedPostprint (author's final draft
- …