4,002 research outputs found
Haplotype reconstruction error as a classical misclassification problem
Statistically reconstructing haplotypes from single nucleotide polymorphism (SNP) genotypes, can lead to falsely classified haplotypes. This can be an issue when interpreting haplotype association results or when selecting subjects with certain haplotypes for subsequent functional studies. It was our aim to quantify haplotype reconstruction error and to provide tools for it.
By numerous simulation scenarios, we systematically investigated several error measures, including discrepancy, error rate, and R(2), and introduced the sensitivity and specificity to this context. We exemplified several measures in the KORA study, a large population-based study from Southern Germany. We find that the specificity is slightly reduced only for common haplotypes, while the sensitivity was decreased for some, but not all rare haplotypes. The overall error rate was generally increasing with increasing number of loci, increasing minor allele frequency of SNPs, decreasing correlation between the alleles and increasing ambiguity.
We conclude that, with the analytical approach presented here, haplotype-specific error measures can be computed to gain insight into the haplotype uncertainty. This method provides the information, if a specific risk haplotype can be expected to be reconstructed with rather no or high misclassification and thus on the magnitude of expected bias in association estimates. We also illustrate that sensitivity and specificity separate two dimensions of the haplotype reconstruction error, which completely describe the misclassification matrix and thus provide the prerequisite for methods accounting for misclassification
Evolutionary and demographic correlates of Pleistocene coastline changes in the Sicilian wall lizard Podarcis wagleriana
Aim
Emergence of coastal lowlands during Pleistocene ice ages might have provided conditions for glacial expansions (demographic and spatial), rather than contraction, of coastal populations of temperate species. Here, we tested these predictions in the insular endemic Sicilian wall lizard Podarcis wagleriana.
Location
Sicily and neighbouring islands.
Methods
We sampled 179 individuals from 45 localities across the whole range of P. wagleriana. We investigated demographic and spatial variations through time using Bayesian coalescent models (Bayesian phylogeographic reconstruction, Extended Bayesian Skyline plots, Isolation‐with‐migration models) based on multilocus DNA sequence data. We used species distribution modelling to reconstruct present and past habitat suitability.
Results
We found two main lineages distributed in the east and west portions of the current species range and a third lineage restricted to a small area in the north of Sicily. Multiple lines of evidence from palaeogeographic (shorelines), palaeoclimatic (species distribution models), and multilocus genetic data (demographic and spatial Bayesian reconstructions) indicate that these lineages originated in distinct refugia, located in the north‐western and south‐eastern coastal lowlands, during Middle Pleistocene interglacial phases, and came into secondary contact following demographic and spatial expansions during the last glacial phase.
Main conclusions
This scenario of interglacial contraction and glacial expansion is in sharp contrast with patterns commonly observed in temperate species on the continent but parallels recent findings on other Mediterranean island endemics. Such a reverse expansion–contraction (EC) dynamic has been likely associated with glacial increases of climatically suitable coastal lowlands, suggesting this might be a general pattern in Mediterranean island species and also in other coastal regions strongly affected by glacial marine regressions during glacial episodes. This study provides explicit predictions and some methodological recommendations for testing the reverse EC model in other region and taxa
Conflation of short identity-by-descent segments bias their inferred length distribution
Identity-by-descent (IBD) is a fundamental concept in genetics with many
applications. In a common definition, two haplotypes are said to contain an IBD
segment if they share a segment that is inherited from a recent shared common
ancestor without intervening recombination. Long IBD segments (> 1cM) can be
efficiently detected by a number of algorithms using high-density SNP array
data from a population sample. However, these approaches detect IBD based on
contiguous segments of identity-by-state, and such segments may exist due to
the conflation of smaller, nearby IBD segments. We quantified this effect using
coalescent simulations, finding that nearly 40% of inferred segments 1-2cM long
are results of conflations of two or more shorter segments, under demographic
scenarios typical for modern humans. This biases the inferred IBD segment
length distribution, and so can affect downstream inferences. We observed this
conflation effect universally across different IBD detection programs and human
demographic histories, and found inference of segments longer than 2cM to be
much more reliable (less than 5% conflation rate). As an example of how this
can negatively affect downstream analyses, we present and analyze a novel
estimator of the de novo mutation rate using IBD segments, and demonstrate that
the biased length distribution of the IBD segments due to conflation can lead
to inflated estimates if the conflation is not modeled. Understanding the
conflation effect in detail will make its correction in future methods more
tractable
Haplotype Reconstruction Error as a Classical Misclassification Problem: Introducing Sensitivity and Specificity as Error Measures
BACKGROUND: Statistically reconstructing haplotypes from single nucleotide polymorphism (SNP) genotypes, can lead to falsely classified haplotypes. This can be an issue when interpreting haplotype association results or when selecting subjects with certain haplotypes for subsequent functional studies. It was our aim to quantify haplotype reconstruction error and to provide tools for it. METHODS AND RESULTS: By numerous simulation scenarios, we systematically investigated several error measures, including discrepancy, error rate, and R(2), and introduced the sensitivity and specificity to this context. We exemplified several measures in the KORA study, a large population-based study from Southern Germany. We find that the specificity is slightly reduced only for common haplotypes, while the sensitivity was decreased for some, but not all rare haplotypes. The overall error rate was generally increasing with increasing number of loci, increasing minor allele frequency of SNPs, decreasing correlation between the alleles and increasing ambiguity. CONCLUSIONS: We conclude that, with the analytical approach presented here, haplotype-specific error measures can be computed to gain insight into the haplotype uncertainty. This method provides the information, if a specific risk haplotype can be expected to be reconstructed with rather no or high misclassification and thus on the magnitude of expected bias in association estimates. We also illustrate that sensitivity and specificity separate two dimensions of the haplotype reconstruction error, which completely describe the misclassification matrix and thus provide the prerequisite for methods accounting for misclassification
Populations in statistical genetic modelling and inference
What is a population? This review considers how a population may be defined
in terms of understanding the structure of the underlying genetics of the
individuals involved. The main approach is to consider statistically
identifiable groups of randomly mating individuals, which is well defined in
theory for any type of (sexual) organism. We discuss generative models using
drift, admixture and spatial structure, and the ancestral recombination graph.
These are contrasted with statistical models for inference, principle component
analysis and other `non-parametric' methods. The relationships between these
approaches are explored with both simulated and real-data examples. The
state-of-the-art practical software tools are discussed and contrasted. We
conclude that populations are a useful theoretical construct that can be well
defined in theory and often approximately exist in practice
Incorporating Single-Locus Tests into Haplotype Cladistic Analysis in Case-Control Studies
In case-control studies, genetic associations for complex diseases may be probed either with single-locus tests or with haplotype-based tests. Although there are different views on the relative merits and preferences of the two test strategies, haplotype-based analyses are generally believed to be more powerful to detect genes with modest effects. However, a main drawback of haplotype-based association tests is the large number of distinct haplotypes, which increases the degrees of freedom for corresponding test statistics and thus reduces the statistical power. To decrease the degrees of freedom and enhance the efficiency and power of haplotype analysis, we propose an improved haplotype clustering method that is based on the haplotype cladistic analysis developed by Durrant et al. In our method, we attempt to combine the strengths of single-locus analysis and haplotype-based analysis into one single test framework. Novel in our method is that we develop a more informative haplotype similarity measurement by using p-values obtained from single-locus association tests to construct a measure of weight, which to some extent incorporates the information of disease outcomes. The weights are then used in computation of similarity measures to construct distance metrics between haplotype pairs in haplotype cladistic analysis. To assess our proposed new method, we performed simulation analyses to compare the relative performances of (1) conventional haplotype-based analysis using original haplotype, (2) single-locus allele-based analysis, (3) original haplotype cladistic analysis (CLADHC) by Durrant et al., and (4) our weighted haplotype cladistic analysis method, under different scenarios. Our weighted cladistic analysis method shows an increased statistical power and robustness, compared with the methods of haplotype cladistic analysis, single-locus test, and the traditional haplotype-based analyses. The real data analyses also show that our proposed method has practical significance in the human genetics field
Evaluation of Haplotype Inference Using Definitive Haplotype Data Obtained from Complete Hydatidiform Moles, and Its Significance for the Analyses of Positively Selected Regions
The haplotype map constructed by the HapMap Project is a valuable resource in the genetic studies of disease genes, population structure, and evolution. In the Project, Caucasian and African haplotypes are fairly accurately inferred, based mainly on the rules of Mendelian inheritance using the genotypes of trios. However, the Asian haplotypes are inferred from the genotypes of unrelated individuals based on population genetics, and are less accurate. Thus, the effects of this inaccuracy on downstream analyses needs to be assessed. We determined true Japanese haplotypes by genotyping 100 complete hydatidiform moles (CHM), each carrying a genome derived from a single sperm, using Affymetrix 500 K Arrays. We then assessed how inferred haplotypes can differ from true haplotypes, by phasing pseudo-individualized true haplotypes using the programs PHASE, fastPHASE, and Beagle. We found that, at various genomic regions, especially the MHC locus, the expansion of extended haplotype homozygosity (EHH), which is a measure of positive selection, is obscured when inferred Asian haplotype data is used to detect the expansion. We then mapped the genome using a new statistic, XDiHH, which directly detects the difference between the true and inferred haplotypes, in the determination of EHH expansion. We also show that the true haplotype data presented here is useful to assess and improve the accuracy of phasing of Asian genotypes
Understanding the accuracy of statistical haplotype inference with sequence data of known phase
Statistical methods for haplotype inference from multi-site genotypes of unrelated individuals have important application in association studies and population genetics. Understanding the factors that affect the accuracy of this inference is important, but their assessment has been restricted by the limited availability of biological data with known phase. We created hybrid cell lines monosomic for human chromosome 19 and produced single-chromosome complete sequences of a 48 kb genomic region in 39 individuals of African American (AA) and European American (EA) origin. We employ these phase-known genotypes and coalescent simulations to assess the accuracy of statistical haplotype reconstruction by several algorithms. Accuracy of phase inference was considerably low in our biological data even for regions as short as 25–50 kb, suggesting that caution is needed when analyzing reconstructed haplotypes. Moreover, the reliability of estimated confidence in phase inference is not high enough to allow for a reliable incorporation of site-specific uncertainty information in subsequent analyses. We show that, in samples of certain mixed ancestry (AA and EA populations), the most accurate haplotypes are probably obtained when increasing sample size by considering the largest, pooled sample, despite the hypothetical problems associated with pooling across those heterogeneous samples. Strategies to improve confidence in reconstructed haplotypes, and realistic alternatives to the analysis of inferred haplotypes, are discussed. Genet. Epidemiol . © 2007 Wiley-Liss, Inc.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/57366/1/20185_ftp.pd
- …