3,230 research outputs found
The minimum-entropy set cover problem
AbstractWe consider the minimum entropy principle for learning data generated by a random source and observed with random noise.In our setting we have a sequence of observations of objects drawn uniformly at random from a population. Each object in the population belongs to one class. We perform an observation for each object which determines that it belongs to one of a given set of classes. Given these observations, we are interested in assigning the most likely class to each of the objects.This scenario is a very natural one that appears in many real life situations. We show that under reasonable assumptions finding the most likely assignment is equivalent to the following variant of the set cover problem. Given a universe U and a collection S=(S1,…,St) of subsets of U, we wish to find an assignment f:U→S such that u∈f(u) and the entropy of the distribution defined by the values |f-1(Si)| is minimized.We show that this problem is NP-hard and that the greedy algorithm for set cover s with an additive constant error with respect to the optimal cover. This sheds a new light on the behavior of the greedy set cover algorithm. We further enhance the greedy algorithm and show that the problem admits a polynomial time approximation scheme (PTAS).Finally, we demonstrate how this model and the greedy algorithm can be useful in real life scenarios, and in particular, in problems arising naturally in computational biology
The Binary Perfect Phylogeny with Persistent characters
The binary perfect phylogeny model is too restrictive to model biological
events such as back mutations. In this paper we consider a natural
generalization of the model that allows a special type of back mutation. We
investigate the problem of reconstructing a near perfect phylogeny over a
binary set of characters where characters are persistent: characters can be
gained and lost at most once. Based on this notion, we define the problem of
the Persistent Perfect Phylogeny (referred as P-PP). We restate the P-PP
problem as a special case of the Incomplete Directed Perfect Phylogeny, called
Incomplete Perfect Phylogeny with Persistent Completion, (refereed as IP-PP),
where the instance is an incomplete binary matrix M having some missing
entries, denoted by symbol ?, that must be determined (or completed) as 0 or 1
so that M admits a binary perfect phylogeny. We show that the IP-PP problem can
be reduced to a problem over an edge colored graph since the completion of each
column of the input matrix can be represented by a graph operation. Based on
this graph formulation, we develop an exact algorithm for solving the P-PP
problem that is exponential in the number of characters and polynomial in the
number of species.Comment: 13 pages, 3 figure
Haplotyping a Quantitative Trait with a High-Density Map in Experimental Crosses
BACKGROUND: The ultimate goal of genetic mapping of quantitative trait loci (QTL) is the positional cloning of genes involved in any agriculturally or medically important phenotype. However, only a small portion (< or = 1%) of the QTL detected have been characterized at the molecular level, despite the report of hundreds of thousands of QTL for different traits and populations. METHODS/RESULTS: We develop a statistical model for detecting and characterizing the nucleotide structure and organization of haplotypes that underlie QTL responsible for a quantitative trait in an F2 pedigree. The discovery of such haplotypes by the new model will facilitate the molecular cloning of a QTL. Our model is founded on population genetic properties of genes that are segregating in a pedigree, constructed with the mixture-based maximum likelihood context and implemented with the EM algorithm. The closed forms have been derived to estimate the linkage and linkage disequilibria among different molecular markers, such as single nucleotide polymorphisms, and quantitative genetic effects of haplotypes constructed by non-alleles of these markers. Results from the analysis of a real example in mouse have validated the usefulness and utilization of the model proposed. CONCLUSION: The model is flexible to be extended to model a complex network of genetic regulation that includes the interactions between different haplotypes and between haplotypes and environments
Recommended from our members
Efficient analysis and storage of large-scale genomic data
The impending advent of population-scaled sequencing cohorts involving tens of millions of individuals with matched phenotypic measurements will produce unprecedented volumes of genetic data. Storing and analysing such gargantuan datasets places computational performance at a pivotal position in medical genomics. In this thesis, I explore the potential for accelerating and parallelizing standard genetics workflows, file formats, and algorithms using both hardware-accelerated vectorization, parallel and distributed
algorithms, and heterogeneous computing.
First, I describe a novel bit-counting operation termed the positional population-count, which can be used together with succinct representations and standard efficient operations to accelerate many genetic calculations. In order to enable the use of this new operator and the canonical population count on any target machine I developed a unified low-level library using CPU dispatching to select the optimal method contingent on the available
instruction set architecture and the given input size at run-time. As a proof-of-principle application, I apply the positional population-count operator to computing quality control-related summary statistics for terabyte-scaled sequencing readsets with >3,800-fold speed improvements. As another application, I describe a framework for efficiently computing the cardinality of set intersection using these operators and applied this framework to efficiently compute genome-wide linkage-disequilibrium in datasets with up to 67 million samples resulting in up to >60-fold improvements in speed for dense genotypic vectors and up to >250,000-fold savings in memory and >100,000-fold improvement in speed for sparse genotypic vectors. I next describe a framework for handling the terabytes of compressed output data and describe graphical routines for visualizing long-range linkage-disequilibrium blocks as seen over many human centromeres. Finally, I describe efficient algorithms for storing and querying very large genetic datasets and specialized algorithms for the genotype component of such datasets with >10,000-fold savings in memory compared to the current interchange format.Wellcome Trus
Shape-IT: new rapid and accurate algorithm for haplotype inference
<p>Abstract</p> <p>Background</p> <p>We have developed a new computational algorithm, Shape-IT, to infer haplotypes under the genetic model of coalescence with recombination developed by Stephens et al in Phase v2.1. It runs much faster than Phase v2.1 while exhibiting the same accuracy. The major algorithmic improvements rely on the use of binary trees to represent the sets of candidate haplotypes for each individual. These binary tree representations: (1) speed up the computations of posterior probabilities of the haplotypes by avoiding the redundant operations made in Phase v2.1, and (2) overcome the exponential aspect of the haplotypes inference problem by the smart exploration of the most plausible pathways (ie. haplotypes) in the binary trees.</p> <p>Results</p> <p>Our results show that Shape-IT is several orders of magnitude faster than Phase v2.1 while being as accurate. For instance, Shape-IT runs 50 times faster than Phase v2.1 to compute the haplotypes of 200 subjects on 6,000 segments of 50 SNPs extracted from a standard Illumina 300 K chip (13 days instead of 630 days). We also compared Shape-IT with other widely used software, Gerbil, PL-EM, Fastphase, 2SNP, and Ishape in various tests: Shape-IT and Phase v2.1 were the most accurate in all cases, followed by Ishape and Fastphase. As a matter of speed, Shape-IT was faster than Ishape and Fastphase for datasets smaller than 100 SNPs, but Fastphase became faster -but still less accurate- to infer haplotypes on larger SNP datasets.</p> <p>Conclusion</p> <p>Shape-IT deserves to be extensively used for regular haplotype inference but also in the context of the new high-throughput genotyping chips since it permits to fit the genetic model of Phase v2.1 on large datasets. This new algorithm based on tree representations could be used in other HMM-based haplotype inference software and may apply more largely to other fields using HMM.</p
Recommended from our members
Polygenic Adaptation to an Environmental Shift: Temporal Dynamics of Variation Under Gaussian Stabilizing Selection and Additive Effects on a Single Trait.
Predictions about the effect of natural selection on patterns of linked neutral variation are largely based on models involving the rapid fixation of unconditionally beneficial mutations. However, when phenotypes adapt to a new optimum trait value, the strength of selection on individual mutations decreases as the population adapts. Here, I use explicit forward simulations of a single trait with additive-effect mutations adapting to an "optimum shift." Detectable "hitchhiking" patterns are only apparent if (i) the optimum shifts are large with respect to equilibrium variation for the trait, (ii) mutation rates to large-effect mutations are low, and (iii) large-effect mutations rapidly increase in frequency and eventually reach fixation, which typically occurs after the population reaches the new optimum. For the parameters simulated here, partial sweeps do not appreciably affect patterns of linked variation, even when the mutations are strongly selected. The contribution of new mutations vs. standing variation to fixation depends on the mutation rate affecting trait values. Given the fixation of a strongly selected variant, patterns of hitchhiking are similar on average for the two classes of sweeps because sweeps from standing variation involving large-effect mutations are rare when the optimum shifts. The distribution of effect sizes of new mutations has little effect on the time to reach the new optimum, but reducing the mutational variance increases the magnitude of hitchhiking patterns. In general, populations reach the new optimum prior to the completion of any sweeps, and the times to fixation are longer for this model than for standard models of directional selection. The long fixation times are due to a combination of declining selection pressures during adaptation and the possibility of interference among weakly selected sites for traits with high mutation rates
Haplotype-based quantitative trait mapping using a clustering algorithm
BACKGROUND: With the availability of large-scale, high-density single-nucleotide polymorphism (SNP) markers, substantial effort has been made in identifying disease-causing genes using linkage disequilibrium (LD) mapping by haplotype analysis of unrelated individuals. In addition to complex diseases, many continuously distributed quantitative traits are of primary clinical and health significance. However the development of association mapping methods using unrelated individuals for quantitative traits has received relatively less attention. RESULTS: We recently developed an association mapping method for complex diseases by mining the sharing of haplotype segments (i.e., phased genotype pairs) in affected individuals that are rarely present in normal individuals. In this paper, we extend our previous work to address the problem of quantitative trait mapping from unrelated individuals. The method is non-parametric in nature, and statistical significance can be obtained by a permutation test. It can also be incorporated into the one-way ANCOVA (analysis of covariance) framework so that other factors and covariates can be easily incorporated. The effectiveness of the approach is demonstrated by extensive experimental studies using both simulated and real data sets. The results show that our haplotype-based approach is more robust than two statistical methods based on single markers: a single SNP association test (SSA) and the Mann-Whitney U-test (MWU). The algorithm has been incorporated into our existing software package called HapMiner, which is available from our website at . CONCLUSION: For QTL (quantitative trait loci) fine mapping, to identify QTNs (quantitative trait nucleotides) with realistic effects (the contribution of each QTN less than 10% of total variance of the trait), large samples sizes (≥ 500) are needed for all the methods. The overall performance of HapMiner is better than that of the other two methods. Its effectiveness further depends on other factors such as recombination rates and the density of typed SNPs. Haplotype-based methods might provide higher power than methods based on a single SNP when using tag SNPs selected from a small number of samples or some other sources (such as HapMap data). Rank-based statistics usually have much lower power, as shown in our study
- …