Search CORE

28,633 research outputs found

Revisiting Guerry's data: Introducing spatial constraints in multivariate analysis

Author: Dray Stéphane
Jombart Thibaut
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2011
Field of study

Standard multivariate analysis methods aim to identify and summarize the main structures in large data sets containing the description of a number of observations by several variables. In many cases, spatial information is also available for each observation, so that a map can be associated to the multivariate data set. Two main objectives are relevant in the analysis of spatial multivariate data: summarizing covariation structures and identifying spatial patterns. In practice, achieving both goals simultaneously is a statistical challenge, and a range of methods have been developed that offer trade-offs between these two objectives. In an applied context, this methodological question has been and remains a major issue in community ecology, where species assemblages (i.e., covariation between species abundances) are often driven by spatial processes (and thus exhibit spatial patterns). In this paper we review a variety of methods developed in community ecology to investigate multivariate spatial patterns. We present different ways of incorporating spatial constraints in multivariate analysis and illustrate these different approaches using the famous data set on moral statistics in France published by Andr\'{e}-Michel Guerry in 1833. We discuss and compare the properties of these different approaches both from a practical and theoretical viewpoint.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS356 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Clustering student skill set profiles in a unit hypercube using mixtures of multivariate betas

Author: Dean Nema
Nugent Rebecca
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 22/08/2013
Field of study

<br>This paper presents a finite mixture of multivariate betas as a new model-based clustering method tailored to applications where the feature space is constrained to the unit hypercube. The mixture component densities are taken to be conditionally independent, univariate unimodal beta densities (from the subclass of reparameterized beta densities given by Bagnato and Punzo 2013). The EM algorithm used to fit this mixture is discussed in detail, and results from both this beta mixture model and the more standard Gaussian model-based clustering are presented for simulated skill mastery data from a common cognitive diagnosis model and for real data from the Assistment System online mathematics tutor (Feng et al 2009). The multivariate beta mixture appears to outperform the standard Gaussian model-based clustering approach, as would be expected on the constrained space. Fewer components are selected (by BIC-ICL) in the beta mixture than in the Gaussian mixture, and the resulting clusters seem more reasonable and interpretable.</br> <br>This article is in technical report form, the final publication is available at http://www.springerlink.com/openurl.asp?genre=article &id=doi:10.1007/s11634-013-0149-z</br&gt

Crossref

Enlighten

Autoregressive Kernels For Time Series

Author: Cuturi Marco
Doucet Arnaud
Publication venue
Publication date: 01/01/2011
Field of study

We propose in this work a new family of kernels for variable-length time series. Our work builds upon the vector autoregressive (VAR) model for multivariate stochastic processes: given a multivariate time series x, we consider the likelihood function p_{\theta}(x) of different parameters \theta in the VAR model as features to describe x. To compare two time series x and x', we form the product of their features p_{\theta}(x) p_{\theta}(x') which is integrated out w.r.t \theta using a matrix normal-inverse Wishart prior. Among other properties, this kernel can be easily computed when the dimension d of the time series is much larger than the lengths of the considered time series x and x'. It can also be generalized to time series taking values in arbitrary state spaces, as long as the state space itself is endowed with a kernel \kappa. In that case, the kernel between x and x' is a a function of the Gram matrices produced by \kappa on observations and subsequences of observations enumerated in x and x'. We describe a computationally efficient implementation of this generalization that uses low-rank matrix factorization techniques. These kernels are compared to other known kernels using a set of benchmark classification tasks carried out with support vector machines

arXiv.org e-Print Archive

CiteSeerX

Uncertainty in phylogenetic tree estimates

Author: Bell Rayna C.
Willis Amy D.
Publication venue
Publication date: 12/10/2017
Field of study

Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy and medicine. Although trees are estimated, their uncertainties are discarded by mathematicians working in tree space. Here we explicitly model the multivariate uncertainty of tree estimates. We consider both the cases where uncertainty information arises extrinsically (through covariate information) and intrinsically (through the tree estimates themselves). The importance of accounting for tree uncertainty in tree space is demonstrated in two case studies. In the first instance, differences between gene trees are small relative to their uncertainties, while in the second, the differences are relatively large. Our main goal is visualization of tree uncertainty, and we demonstrate advantages of our method with respect to reproducibility, speed and preservation of topological differences compared to visualization based on multidimensional scaling. The proposal highlights that phylogenetic trees are estimated in an extremely high-dimensional space, resulting in uncertainty information that cannot be discarded. Most importantly, it is a method that allows biologists to diagnose whether differences between gene trees are biologically meaningful, or due to uncertainty in estimation.Comment: Final version accepted to Journal of Computational and Graphical Statistic

arXiv.org e-Print Archive

FigShare

Predicting the distribution of canine leishmaniasis in western Europe based on environmental variables.

Author: ADRIAN MYLNE
ANA O. FRANCO
CARLOS ALVES PIRES
CLIVE R. DAVIES
CRISTINA BALLART
FRANCISCO MORILLAS-MÁRQUEZ
Hastie
JEAN-PIERRE DEDET
JONATHAN COX
LUIGI GRADONI
MARIA ODETE AFONSO
MARINA GRAMICCIA
MONTSERRAT GÁLLEGO
PAUL D. READY
Rabe-Hesketh
Ready
Ribeiro
RICARDO MOLINA
Rioux
Rogers
ROSA GÁLVEZ
SERGIO BARÓN-LÓPEZ
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 14/09/2011
Field of study

The domestic dog is the reservoir host of Leishmania infantum, the causative agent of zoonotic visceral leishmaniasis endemic in Mediterranean Europe. Targeted control requires predictive risk maps of canine leishmaniasis (CanL), which are now explored. We databased 2187 published and unpublished surveys of CanL in southern Europe. A total of 947 western surveys met inclusion criteria for analysis, including serological identification of infection (504, 369 dogs tested 1971-2006). Seroprevalence was 23 2% overall (median 10%). Logistic regression models within a GIS framework identified the main environmental predictors of CanL seroprevalence in Portugal, Spain, France and Italy, or in France alone. A 10-fold cross-validation approach determined model capacity to predict point-values of seroprevalence and the correct seroprevalence class (20%). Both the four-country and France-only models performed reasonably well for predicting correctly the 20% seroprevalence classes (AUC >0 70). However, the France-only model performed much better for France than the four-country model. The four-country model adequately predicted regions of CanL emergence in northern Italy (<5% seroprevalence). Both models poorly predicted intermediate point seroprevalences (5-20%) within regional foci, because surveys were biased towards known rural foci and Mediterranean bioclimates. Our recommendations for standardizing surveys would permit higher-resolution risk mapping

Crossref

LSHTM Research Online

Hal-Diderot

The Population Genetic Signature of Polygenic Local Adaptation

Author: Berg Jeremy J.
Coop Graham
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Adaptation in response to selection on polygenic phenotypes may occur via subtle allele frequencies shifts at many loci. Current population genomic techniques are not well posed to identify such signals. In the past decade, detailed knowledge about the specific loci underlying polygenic traits has begun to emerge from genome-wide association studies (GWAS). Here we combine this knowledge from GWAS with robust population genetic modeling to identify traits that may have been influenced by local adaptation. We exploit the fact that GWAS provide an estimate of the additive effect size of many loci to estimate the mean additive genetic value for a given phenotype across many populations as simple weighted sums of allele frequencies. We first describe a general model of neutral genetic value drift for an arbitrary number of populations with an arbitrary relatedness structure. Based on this model we develop methods for detecting unusually strong correlations between genetic values and specific environmental variables, as well as a generalization of

Q_{ST}/F_{ST}

comparisons to test for over-dispersion of genetic values among populations. Finally we lay out a framework to identify the individual populations or groups of populations that contribute to the signal of overdispersion. These tests have considerably greater power than their single locus equivalents due to the fact that they look for positive covariance between like effect alleles, and also significantly outperform methods that do not account for population structure. We apply our tests to the Human Genome Diversity Panel (HGDP) dataset using GWAS data for height, skin pigmentation, type 2 diabetes, body mass index, and two inflammatory bowel disease datasets. This analysis uncovers a number of putative signals of local adaptation, and we discuss the biological interpretation and caveats of these results.Comment: 42 pages including 8 figures and 3 tables; supplementary figures and tables not included on this upload, but are mostly unchanged from v

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

FigShare