Search CORE

68 research outputs found

Populations in statistical genetic modelling and inference

Author: Lawson Daniel John
Publication venue
Publication date: 04/06/2013
Field of study

What is a population? This review considers how a population may be defined in terms of understanding the structure of the underlying genetics of the individuals involved. The main approach is to consider statistically identifiable groups of randomly mating individuals, which is well defined in theory for any type of (sexual) organism. We discuss generative models using drift, admixture and spatial structure, and the ancestral recombination graph. These are contrasted with statistical models for inference, principle component analysis and other `non-parametric' methods. The relationships between these approaches are explored with both simulated and real-data examples. The state-of-the-art practical software tools are discussed and contrasted. We conclude that populations are a useful theoretical construct that can be well defined in theory and often approximately exist in practice

arXiv.org e-Print Archive

CiteSeerX

A New Method to Reconstruct Recombination Events at a Genomic Scale

Author: A Auton
AJ Jeffreys
Asif Javed
Chris P. Ponting
D Posada
DC Crawford
ED Parvanov
EO Wilson
F Baudat
Francesc Calafell
GAT McVean
J Felsenstein
J Rozas
Jaume Bertranpetit
JZ Li
K Paigen
K Sturrock
KK Kidd
L Excoffier
L Parida
L Parida
Laxmi Parida
M Jakobsson
M Stephens
M Stephens
Marc Pybus
Marta Melé
N Li
NA Rosenberg
P Scheet
RA Fisher
RR Hudson
S Myers
S Myers
SF Schaffner
SJE Baird
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Recombination is one of the main forces shaping genome diversity, but the information it generates is often overlooked. A recombination event creates a junction between two parental sequences that may be transmitted to the subsequent generations. Just like mutations, these junctions carry evidence of the shared past of the sequences. We present the IRiS algorithm, which detects past recombination events from extant sequences and specifies the place of each recombination and which are the recombinants sequences. We have validated and calibrated IRiS for the human genome using coalescent simulations replicating standard human demographic history and a variable recombination rate model, and we have fine-tuned IRiS parameters to simultaneously optimize for false discovery rate, sensitivity, and accuracy in placing the recombination events in the sequence. Newer recombinations overwrite traces of past ones and our results indicate more recent recombinations are detected by IRiS with greater sensitivity. IRiS analysis of the MS32 region, previously studied using sperm typing, showed good concordance with estimated recombination rates. We also applied IRiS to haplotypes for 18 X-chromosome regions in HapMap Phase 3 populations. Recombination events detected for each individual were recoded as binary allelic states and combined into recotypes. Principal component analysis and multidimensional scaling based on recotypes reproduced the relationships between the eleven HapMap Phase III populations that can be expected from known human population history, thus further validating IRiS. We believe that our new method will contribute to the study of the distribution of recombination events across the genomes and, for the first time, it will allow the use of recombination as genetic marker to study human genetic variation

Crossref

Directory of Open Access Journals

PubMed Central

UPF Digital Repository

ScholarlyCommons@Penn

Digital.CSIC

Bayesian Statistical Methods for Genetic Association Studies with Case-Control and Cohort Design

Author: Tachmazidou Ioanna
Tachmazidou Ioanna
Publication venue: Epidemiology and Public Health, Imperial College London
Publication date: 01/03/2009
Field of study

Large-scale genetic association studies are carried out with the hope of discovering single nucleotide polymorphisms involved in the etiology of complex diseases. We propose a coalescent-based model for association mapping which potentially increases the power to detect disease-susceptibility variants in genetic association studies with case-control and cohort design. The approach uses Bayesian partition modelling to cluster haplotypes with similar disease risks by exploiting evolutionary information. We focus on candidate gene regions and we split the chromosomal region of interest into sub-regions or windows of high linkage disequilibrium (LD) therein assuming a perfect phylogeny. The haplotype space is then partitioned into disjoint clusters within which the phenotype-haplotype association is assumed to be the same. The novelty of our approach consists in the fact that the distance used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered according to the time to their most recent common mutation. Our approach is fully Bayesian and we develop Markov Chain Monte Carlo algorithms to sample efficiently over the space of possible partitions. We have also developed a Bayesian survival regression model for high-dimension and small sample size settings. We provide a Bayesian variable selection procedure and shrinkage tool by imposing shrinkage priors on the regression coefficients. We have developed a computationally efficient optimization algorithm to explore the posterior surface and find the maximum a posteriori estimates of the regression coefficients. We compare the performance of the proposed methods in simulation studies and using real datasets to both single-marker analyses and recently proposed multi-marker methods and show that our methods perform similarly in localizing the causal allele while yielding lower false positive rates. Moreover, our methods offer computational advantages over other multi-marker approaches

Spiral - Imperial College Digital Repository

Multiple Advantageous Amino Acid Variants in the NAT2 Gene in Human Populations

Author: A Di Rienzo
A Husain
A Kawamura
AC Deitz
AG Clark
AG Clark
AM Adams
Andrea Novelletto
Andrey I. Kozlov
BF Voight
CD Bustamante
D Charlesworth
D Garrigan
DW Hein
DW Hein
E Patin
E Patin
E Sim
E Sim
E Sim
Emmanuel Michalodimitrakis
ET Wang
F Di Giacomo
F Luca
F Tajima
FJ Fernandes-Costa
Francesca Luca
GA McVean
Galina Vershubsky
GF De Stefano
GH Perry
Giuseppina Bubba
GK Wong
H Liu
H Magalon
HJ Bandelt
I Cascorbi
J Rozas
J Vander Molen
J Wakeley
JC Fay
JC Stephens
JF Solus
JF Wilson
JH McDonald
JM Akey
KL Bubb
L Excoffier
L Excoffier
LE Jensen
Lluis Quintana-Murci
M Bamshad
M Currat
M Przeworski
M Stephens
Massimo Basile
MH Schierup
Olga Rickards
PC Sabeti
PC Sabeti
PD Soloway
PS Pennings
R Nielsen
R Nielsen
R Nielsen
R Scozzari
Radim Brdicka
RF Minchin
RM Harding
RR Hudson
S Biswas
S Boukouvala
S Fuselli
T Tamura
Vincent Macaulay
VV Bakayev
WW Weber
WW Weber
Y Zang
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Background: Genetic variation at NAT2 has been long recognized as the cause of differential ability to metabolize a wide variety of drugs of therapeutic use. Here, we explore the pattern of genetic variation in 12 human populations that significantly extend the geographic range and resolution of previous surveys, to test the hypothesis that different dietary regimens and lifestyles may explain inter-population differences in NAT2 variation. Methodology/Principal Findings: The entire coding region was resequenced in 98 subjects and six polymorphic positions were genotyped in 150 additional subjects. A single previously undescribed variant was found (34T>C; 12Y>H). Several aspects of the data do not fit the expectations of a neutral model, as assessed by coalescent simulations. Tajima's D is positive in all populations, indicating an excess of intermediate alleles. The level of between-population differentiation is low, and is mainly accounted for by the proportion of fast vs. slow acetylators. However, haplotype frequencies significantly differ across groups of populations with different subsistence. Conclusions/Significance: Data on the structure of haplotypes and their frequencies are compatible with a model in which slow-causing variants were present in widely dispersed populations before major shifts to pastoralism and/or agriculture. In this model, slow-causing mutations gained a selective advantage in populations shifting from hunting-gathering to pastoralism/agriculture. We suggest the diminished dietary availability of folates resulting from the nutritional shift, as the possible cause of the fitness increase associated to haplotypes carrying mutations that reduce enzymatic activity. Â© 2008 Luca et al

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

ART

A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data

Author: Sohn Kyung-Ah
Xing Eric P.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

The perennial problem of "how many clusters?" remains an issue of substantial interest in data mining and machine learning communities, and becomes particularly salient in large data sets such as populational genomic data where the number of clusters needs to be relatively large and open-ended. This problem gets further complicated in a co-clustering scenario in which one needs to solve multiple clustering problems simultaneously because of the presence of common centroids (e.g., ancestors) shared by clusters (e.g., possible descents from a certain ancestor) from different multiple-cluster samples (e.g., different human subpopulations). In this paper we present a hierarchical nonparametric Bayesian model to address this problem in the context of multi-population haplotype inference. Uncovering the haplotypes of single nucleotide polymorphisms is essential for many biological and medical applications. While it is uncommon for the genotype data to be pooled from multiple ethnically distinct populations, few existing programs have explicitly leveraged the individual ethnic information for haplotype inference. In this paper we present a new haplotype inference program, Haploi, which makes use of such information and is readily applicable to genotype sequences with thousands of SNPs from heterogeneous populations, with competent and sometimes superior speed and accuracy comparing to the state-of-the-art programs. Underlying Haploi is a new haplotype distribution model based on a nonparametric Bayesian formalism known as the hierarchical Dirichlet process, which represents a tractable surrogate to the coalescent process. The proposed model is exchangeable, unbounded, and capable of coupling demographic information of different populations.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS225 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Decay of linkage disequilibrium within genes across HGDP-CEPH human samples: most population isolates do not show increased LD

Author: Bertranpetit Jaume
Bosch Elena
Calafell Francesc
Casals Ferran
Comas David
Ferrer Admetlla Anna
Gardner Michelle
Graffelman Jan
Laayouni Hafid
Morcillo-Suárez Carlos
Moreno Estrada Andrés
Navarro Arcadi
Rosa Araceli
Publication venue: BioMed Central
Publication date: 28/07/2009
Field of study

9 pages, 2 figures, 4 additional files.[Background] It is well known that the pattern of linkage disequilibrium varies between human populations, with remarkable geographical stratification. Indirect association studies routinely exploit linkage disequilibrium around genes, particularly in isolated populations where it is assumed to be higher. Here, we explore both the amount and the decay of linkage disequilibrium with physical distance along 211 gene regions, most of them related to complex diseases, across 39 HGDP-CEPH population samples, focusing particularly on the populations defined as isolates. Within each gene region and population we use r2 between all possible single nucleotide polymorphism (SNP) pairs as a measure of linkage disequilibrium and focus on the proportion of SNP pairs with r2 greater than 0.8.[Results] Although the average r2 was found to be significantly different both between and within continental regions, a much higher proportion of r2 variance could be attributed to differences between continental regions (2.8% vs. 0.5%, respectively). Similarly, while the proportion of SNP pairs with r2 > 0.8 was significantly different across continents for all distance classes, it was generally much more homogenous within continents, except in the case of Africa and the Americas. The only isolated populations with consistently higher LD in all distance classes with respect to their continent are the Kalash (Central South Asia) and the Surui (America). Moreover, isolated populations showed only slightly higher proportions of SNP pairs with r2 > 0.8 per gene region than non-isolated populations in the same continent. Thus, the number of SNPs in isolated populations that need to be genotyped may be only slightly less than in non-isolates.[Conclusion] The "isolated population" label by itself does not guarantee a greater genotyping efficiency in association studies, and properties other than increased linkage disequilibrium may make these populations interesting in genetic epidemiology.This research was supported by "Fundación Genoma España" (proyectos piloto CEGEN 2004–2005), Dirección General de Investigación, Ministerio de Educación y Ciencia of Spain (grants BFU2005-00243, BFU2006-01235, BFU2006-15413-CO2-01, SEJ2006-13537) and Direcció General de Recerca, Generalitat de Catalunya (2005SGR00608). SNP genotyping services were provided by the Spanish "Centro Nacional de Genotipado"Peer reviewe

Springer - Publisher Connector

PubMed Central

Digital.CSIC

Diposit Digital de la Universitat de Barcelona

A log-ratio biplot approach for exploring genetic relatedness based on identity by state

Author: Barceló Vidal Carles
de Cid Rafael
Galván Femenía Iván
Graffelman Jan
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2019
Field of study

The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degreePostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC