1,369 research outputs found
An FPT haplotyping algorithm on pedigrees with a small number of sites
<p>Abstract</p> <p>Background</p> <p>Genetic disease studies investigate relationships between changes in chromosomes and genetic diseases. Single haplotypes provide useful information for these studies but extracting single haplotypes directly by biochemical methods is expensive. A computational method to infer haplotypes from genotype data is therefore important. We investigate the problem of computing the minimum number of recombination events for general pedigrees with a small number of sites for all members.</p> <p>Results</p> <p>We show that this NP-hard problem can be parametrically reduced to the Bipartization by Edge Removal problem with additional parity constraints. We solve this problem with an exact algorithm that runs in <inline-formula><graphic file="1748-7188-6-8-i1.gif"/></inline-formula> time, where <it>n </it>is the number of members, <it>m </it>is the number of sites, and <it>k </it>is the number of recombination events.</p> <p>Conclusions</p> <p>This algorithm infers haplotypes for a small number of sites, which can be useful for genetic disease studies to track down how changes in haplotypes such as recombinations relate to genetic disease.</p
Haplotype inference in general pedigrees with two sites
<p>Abstract</p> <p>Background</p> <p>Genetic disease studies investigate relationships between changes in chromosomes and genetic diseases. Single haplotypes provide useful information for these studies but extracting single haplotypes directly by biochemical methods is expensive. A computational method to infer haplotypes from genotype data is therefore important. We investigate the problem of computing the minimum number of recombination events for general pedigrees with two sites for all members.</p> <p>Results</p> <p>We show that this NP-hard problem can be parametrically reduced to the Bipartization by Edge Removal problem and therefore can be solved by an <it>O</it>(2<it><sup>k</sup></it> · <it>n</it><sup>2</sup>) exact algorithm, where <it>n</it> is the number of members and <it>k</it> is the number of recombination events.</p> <p>Conclusions</p> <p>Our work can therefore be useful for genetic disease studies to track down how changes in haplotypes such as recombinations relate to genetic disease.</p
Haplotype Inference on Pedigrees with Recombinations, Errors, and Missing Genotypes via SAT solvers
The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been
highly successful in providing a sound combinatorial formulation for the
important problem of genotype phasing on pedigrees. Despite several algorithmic
advances and refinements that led to some efficient algorithms, its
applicability to real datasets has been limited by the absence of some
important characteristics of these data in its formulation, such as mutations,
genotyping errors, and missing data.
In this work, we propose the Haplotype Configuration with Recombinations and
Errors problem (HCRE), which generalizes the original MRHC formulation by
incorporating the two most common characteristics of real data: errors and
missing genotypes (including untyped individuals). Although HCRE is
computationally hard, we propose an exact algorithm for the problem based on a
reduction to the well-known Satisfiability problem. Our reduction exploits
recent progresses in the constraint programming literature and, combined with
the use of state-of-the-art SAT solvers, provides a practical solution for the
HCRE problem. Biological soundness of the phasing model and effectiveness (on
both accuracy and performance) of the algorithm are experimentally demonstrated
under several simulated scenarios and on a real dairy cattle population.Comment: 14 pages, 1 figure, 4 tables, the associated software reHCstar is
available at http://www.algolab.eu/reHCsta
Design and Association Methods for Next-generation Sequencing Studies for Quantitative Traits.
Advances in exome sequencing and the development of exome genotyping arrays are enabling explorations of association between rare coding variants and complex traits using sequencing-based GWAS. However, the cost of sequencing remains high, optimal study design for sequencing-based association studies is an open question, powerful association methods and software to detect trait-associated rare and low-frequency variants are in great need. Containing 5% of information in human genome, chromosome X analysis has been largely neglected in routine GWAS analysis. In this dissertation, I focus on three topics:
First, I describe a computationally efficient approach to re-construct gene-level association test statistics from single-variant summary statistics and their covariance matrices for single studies and meta-analyses. By simulation and real data examples, I evaluate our methods under the null, investigate scenarios when family samples have larger power than population samples, compare power of different types of gene-level tests under various trait-generating models, and demonstrate the usage of our methods and the C++ software, RAREMETAL, by meta-analyzing SardiNIA and HUNT data on lipids levels.
Second, I describe a variance component approach and a series of gene-level tests for X-linked rare variants analysis. By simulations, I demonstrate that our methods are well controlled under the null. I evaluate power to detect an autosomal or X-linked gene of same effect size, and investigate the effect of sex ratio in a sample to power of detecting an X-linked gene. Finally I demonstrate usage of our method and the C++ software by analyzing various quantitative traits measured in the SardiNIA study and report detected X-linked variants and genes.
Third, I describe a novel likelihood-based approach and the C++ software, RAREFY, to prioritize samples that are more likely to be carriers of trait-associated variants in a sample, with limited budget. I first describe the statistical method for small pedigrees and then describe an MCMC approach to make our method computationally feasible for large pedigrees. By simulations and real data analysis, I compare our approach with other methods in both trait-associated allele discovery power and association power, and demonstrate the usage of our method on pedigrees from the SardiNIA study.PhDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113521/1/sfengsph_1.pd
KELVIN: A Software Package for Rigorous Measurement of Statistical Evidence in Human Genetics
This paper describes the software package KELVIN, which supports the PPL (posterior probability of linkage) framework for the measurement of statistical evidence in human (or more generally, diploid) genetic studies. In terms of scope, KELVIN supports two-point (trait-marker or marker-marker) and multipoint linkage analysis, based on either sex-averaged or sex-specific genetic maps, with an option to allow for imprinting; trait-marker linkage disequilibrium (LD), or association analysis, in case-control data, trio data, and/or multiplex family data, with options for joint linkage and trait-marker LD or conditional LD given linkage; dichotomous trait, quantitative trait and quantitative trait threshold models; and certain types of gene-gene interactions and covariate effects. Features and data (pedigree) structures can be freely mixed and matched within analyses. The statistical framework is specifically tailored to accumulate evidence in a mathematically rigorous way across multiple data sets or data subsets while allowing for multiple sources of heterogeneity, and KELVIN itself utilizes sophisticated software engineering to provide a powerful and robust platform for studying the genetics of complex disorders
Conflation of short identity-by-descent segments bias their inferred length distribution
Identity-by-descent (IBD) is a fundamental concept in genetics with many
applications. In a common definition, two haplotypes are said to contain an IBD
segment if they share a segment that is inherited from a recent shared common
ancestor without intervening recombination. Long IBD segments (> 1cM) can be
efficiently detected by a number of algorithms using high-density SNP array
data from a population sample. However, these approaches detect IBD based on
contiguous segments of identity-by-state, and such segments may exist due to
the conflation of smaller, nearby IBD segments. We quantified this effect using
coalescent simulations, finding that nearly 40% of inferred segments 1-2cM long
are results of conflations of two or more shorter segments, under demographic
scenarios typical for modern humans. This biases the inferred IBD segment
length distribution, and so can affect downstream inferences. We observed this
conflation effect universally across different IBD detection programs and human
demographic histories, and found inference of segments longer than 2cM to be
much more reliable (less than 5% conflation rate). As an example of how this
can negatively affect downstream analyses, we present and analyze a novel
estimator of the de novo mutation rate using IBD segments, and demonstrate that
the biased length distribution of the IBD segments due to conflation can lead
to inflated estimates if the conflation is not modeled. Understanding the
conflation effect in detail will make its correction in future methods more
tractable
- …