107 research outputs found

    An analytical comparison of coalescent-based multilocus methods: The three-taxon case

    Full text link
    Incomplete lineage sorting (ILS) is a common source of gene tree incongruence in multilocus analyses. A large number of methods have been developed to infer species trees in the presence of ILS. Here we provide a mathematical analysis of several coalescent-based methods. Our analysis is performed on a three-taxon species tree and assumes that the gene trees are correctly reconstructed along with their branch lengths

    Phylogenetic mixtures: Concentration of measure in the large-tree limit

    Get PDF
    The reconstruction of phylogenies from DNA or protein sequences is a major task of computational evolutionary biology. Common phenomena, notably variations in mutation rates across genomes and incongruences between gene lineage histories, often make it necessary to model molecular data as originating from a mixture of phylogenies. Such mixed models play an increasingly important role in practice. Using concentration of measure techniques, we show that mixtures of large trees are typically identifiable. We also derive sequence-length requirements for high-probability reconstruction.Comment: Published in at http://dx.doi.org/10.1214/11-AAP837 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Generalized least squares can overcome the critical threshold in respondent-driven sampling

    Full text link
    In order to sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like O(nβˆ’1)O(n^{-1}), where nn is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is O(nβˆ’1)O(n^{-1}). We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from the sampled observations. Simulations on empirical social networks show that the feasible GLS (fGLS) estimators can have drastically smaller error and rarely increase the error. A diagnostic plot helps to identify where fGLS will aid estimation. The fGLS estimators continue to outperform standard estimators even when they are built from a misspecified model and when there is preferential recruitment.Comment: Submitte

    On the inference of large phylogenies with long branches: How long is too long?

    Get PDF
    Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In [Daskalakis et al.'09], building on the work of [Mossel'04], a tight sequence-length requirement was obtained for the CFN model. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from O(log⁑n)O(\log n) to poly(n)\hbox{poly}(n), where nn is the number of leaves) at the "critical" branch length \critmlq (if it exists) of the ancestral reconstruction problem. Here we consider the GTR model. For this model, recent results of [Roch'09] show that the tree can be accurately reconstructed with sequences of length O(log⁑(n))O(\log(n)) when the branch lengths are below \critksq, known as the Kesten-Stigum (KS) bound. Although for the CFN model \critmlq = \critksq, it is known that for the more general GTR models one has \critmlq \geq \critksq with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models QQ and a phylogenetic reconstruction algorithm which recovers the tree from O(log⁑n)O(\log n)-length sequences for some branch lengths in the range (\critksq,\critmlq). Second we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above \critmlq
    • …