1,998 research outputs found
Recommended from our members
The quest for a donor: probability based methods offer help
When a patient in need of a stem cell transplant has no compatible donor within his or her closest family, and no matched unrelated donor can be found, a remaining option is to search within the patient’s extended family. This situation often arises when the patient is of an ethnic minority, originating from a country that lacks a well-developed stem cell donor program, and has HLA haplotypes that are rare in his or her country of residence. Searching within the extended family may be time-consuming and expensive, and tools to calculate the probability of a match within groups of untested relatives would facilitate the search. We present a general approach to calculating the probability of a match in a given relative, or group of relatives, based on the pedigree, and on knowledge of the genotypes of some of the individuals. The method extends previous approaches by allowing the pedigrees to be consanguineous and arbitrarily complex, with deviations from Hardy-Weinberg equilibrium. We show how this extension has a considerable effect on results, in particular for rare haplotypes. The methods are exemplified using freeware programs to solve a case of practical importance
Quantifying the Fraction of Missing Information for Hypothesis Testing in Statistical and Genetic Studies
Many practical studies rely on hypothesis testing procedures applied to data
sets with missing information. An important part of the analysis is to
determine the impact of the missing data on the performance of the test, and
this can be done by properly quantifying the relative (to complete data) amount
of available information. The problem is directly motivated by applications to
studies, such as linkage analyses and haplotype-based association projects,
designed to identify genetic contributions to complex diseases. In the genetic
studies the relative information measures are needed for the experimental
design, technology comparison, interpretation of the data, and for
understanding the behavior of some of the inference tools. The central
difficulties in constructing such information measures arise from the multiple,
and sometimes conflicting, aims in practice. For large samples, we show that a
satisfactory, likelihood-based general solution exists by using appropriate
forms of the relative Kullback--Leibler information, and that the proposed
measures are computationally inexpensive given the maximized likelihoods with
the observed data. Two measures are introduced, under the null and alternative
hypothesis respectively. We exemplify the measures on data coming from mapping
studies on the inflammatory bowel disease and diabetes. For small-sample
problems, which appear rather frequently in practice and sometimes in disguised
forms (e.g., measuring individual contributions to a large study), the robust
Bayesian approach holds great promise, though the choice of a general-purpose
"default prior" is a very challenging problem.Comment: Published in at http://dx.doi.org/10.1214/07-STS244 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Haplotype Inference on Pedigrees with Recombinations, Errors, and Missing Genotypes via SAT solvers
The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been
highly successful in providing a sound combinatorial formulation for the
important problem of genotype phasing on pedigrees. Despite several algorithmic
advances and refinements that led to some efficient algorithms, its
applicability to real datasets has been limited by the absence of some
important characteristics of these data in its formulation, such as mutations,
genotyping errors, and missing data.
In this work, we propose the Haplotype Configuration with Recombinations and
Errors problem (HCRE), which generalizes the original MRHC formulation by
incorporating the two most common characteristics of real data: errors and
missing genotypes (including untyped individuals). Although HCRE is
computationally hard, we propose an exact algorithm for the problem based on a
reduction to the well-known Satisfiability problem. Our reduction exploits
recent progresses in the constraint programming literature and, combined with
the use of state-of-the-art SAT solvers, provides a practical solution for the
HCRE problem. Biological soundness of the phasing model and effectiveness (on
both accuracy and performance) of the algorithm are experimentally demonstrated
under several simulated scenarios and on a real dairy cattle population.Comment: 14 pages, 1 figure, 4 tables, the associated software reHCstar is
available at http://www.algolab.eu/reHCsta
Modelling dependencies in genetic-marker data and its application to haplotype analysis
The objective of this thesis is to develop new methods to reconstruct haplotypes from phaseunknown
genotypes. The need for new methodologies is motivated by the increasing avail¬
ability of high-resolution marker data for many species. Such markers typically exhibit
correlations, a phenomenon known as Linkage Disequilibrium (LD). It is believed that re¬
constructed haplotypes for markers in high LD can be valuable for a variety of application
areas in population genetics, including reconstructing population history and identifying
genetic disease variantsTraditionally, haplotype reconstruction methods can be categorized according to whether
they operate on a single pedigree or a collection of unrelated individuals. The thesis begins
with a critical assessment of the limitations of existing methods, and then presents a uni¬
fied statistical framework that can accommodate pedigree data, unrelated individuals and
tightly linked markers. The framework makes use of graphical models, where inference
entails representing the relevant joint probability distribution as a graph and then using
associated algorithms to facilitate computation. The graphical model formalism provides
invaluable tools to facilitate model specification, visualization, and inference.Once the unified framework is developed, a broad range of simulation studies are conducted
using previously published haplotype data. Important contributions include demonstrating
the different ways in which the haplotype frequency distribution can impact the accuracy of
both the phase assignments and haplotype frequency estimates; evaluating the effectiveness
of using family data to improve accuracy for different frequency profiles; and, assessing the
dangers of treating related individuals as unrelated in an association study
Methods and Algorithms for Inference Problems in Population Genetics
Inference of population history is a central problem of population genetics. The advent of large genetic data brings us not only opportunities on developing more accurate methods for inference problems, but also computational challenges. Thus, we aim at developing accurate method and fast algorithm for problems in population genetics.
Inference of admixture proportions is a classical statistical problem. We particularly focus on the problem of ancestry inference for ancestors. Standard methods implicitly assume that both parents of an individual have the same admixture fraction. However, this is rarely the case in real data. We develop a Hidden Markov Model (HMM) framework for estimating the admixture proportions of the immediate ancestors of an individual, i.e. a type of appropriation of an individual\u27s admixture proportions into further subsets of ancestral proportions in the ancestors. Based on a genealogical model for admixture tracts, we develop an efficient algorithm for computing the sampling probability of the genome from a single individual, as a function of the admixture proportions of the ancestors of this individual. We show that the distribution and lengths of admixture tracts in a genome contain information about the admixture proportions of the ancestors of an individual. This allows us to perform probabilistic inference of admixture proportions of ancestors only using the genome of an extant individual.
To better understand population, we further study the species delimitation problem. It is a problem of determining the boundary between population and species. We propose a classification-based method to assign a set of populations to a number of species. Our new method uses summary statistics generated from genetic data to classify pairwise populations as either \u27same species\u27 or \u27different species\u27. We show that machine learning can be used for species delimitation and scaled for large genomic data. It can also outperform Bayesian approaches, especially when gene flow involves in the evolutionary process
Rapid haplotype inference for nuclear families
Hapi is a new dynamic programming algorithm that ignores uninformative states and state transitions in order to efficiently compute minimum-recombinant and maximum likelihood haplotypes. When applied to a dataset containing 103 families, Hapi performs 3.8 and 320 times faster than state-of-the-art algorithms. Because Hapi infers both minimum-recombinant and maximum likelihood haplotypes and applies to related individuals, the haplotypes it infers are highly accurate over extended genomic distances.National Institutes of Health (U.S.) (NIH grant 5-T90-DK070069)National Institutes of Health (U.S.) (Grant 5-P01-NS055923)National Science Foundation (U.S.) (Graduate Research Fellowship
- …