7,549 research outputs found

    Efficient Quartet Representations of Trees and Applications to Supertree and Summary Methods

    Full text link
    Quartet trees displayed by larger phylogenetic trees have long been used as inputs for species tree and supertree reconstruction. Computational constraints prevent the use of all displayed quartets in many practical problems due to the number of taxa. We introduce the notion of an Efficient Quartet System (EQS) to represent a phylogenetic tree with a subset of the quartets displayed by the tree. We show mathematically that the set of quartets obtained from a tree via an EQS contains all of the combinatorial information of the tree itself. Using performance tests on simulated datasets, we also demonstrate that using an EQS to reduce the number of quartets in pipelines for summary methods of species tree inference and supertree inference results in only small reductions in accuracy.Comment: 7 pages, minor revisio

    Sequence-Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier

    Full text link
    We introduce a new distance-based phylogeny reconstruction technique which provably achieves, at sufficiently short branch lengths, a polylogarithmic sequence-length requirement -- improving significantly over previous polynomial bounds for distance-based methods. The technique is based on an averaging procedure that implicitly reconstructs ancestral sequences. In the same token, we extend previous results on phase transitions in phylogeny reconstruction to general time-reversible models. More precisely, we show that in the so-called Kesten-Stigum zone (roughly, a region of the parameter space where ancestral sequences are well approximated by ``linear combinations'' of the observed sequences) sequences of length \poly(\log n) suffice for reconstruction when branch lengths are discretized. Here nn is the number of extant species. Our results challenge, to some extent, the conventional wisdom that estimates of evolutionary distances alone carry significantly less information about phylogenies than full sequence datasets

    Dynamic Ordered Sets with Exponential Search Trees

    Full text link
    We introduce exponential search trees as a novel technique for converting static polynomial space search structures for ordered sets into fully-dynamic linear space data structures. This leads to an optimal bound of O(sqrt(log n/loglog n)) for searching and updating a dynamic set of n integer keys in linear space. Here searching an integer y means finding the maximum key in the set which is smaller than or equal to y. This problem is equivalent to the standard text book problem of maintaining an ordered set (see, e.g., Cormen, Leiserson, Rivest, and Stein: Introduction to Algorithms, 2nd ed., MIT Press, 2001). The best previous deterministic linear space bound was O(log n/loglog n) due Fredman and Willard from STOC 1990. No better deterministic search bound was known using polynomial space. We also get the following worst-case linear space trade-offs between the number n, the word length w, and the maximal key U < 2^w: O(min{loglog n+log n/log w, (loglog n)(loglog U)/(logloglog U)}). These trade-offs are, however, not likely to be optimal. Our results are generalized to finger searching and string searching, providing optimal results for both in terms of n.Comment: Revision corrects some typoes and state things better for applications in subsequent paper

    A Fast and Accurate Unconstrained Face Detector

    Full text link
    We propose a method to address challenges in unconstrained face detection, such as arbitrary pose variations and occlusions. First, a new image feature called Normalized Pixel Difference (NPD) is proposed. NPD feature is computed as the difference to sum ratio between two pixel values, inspired by the Weber Fraction in experimental psychology. The new feature is scale invariant, bounded, and is able to reconstruct the original image. Second, we propose a deep quadratic tree to learn the optimal subset of NPD features and their combinations, so that complex face manifolds can be partitioned by the learned rules. This way, only a single soft-cascade classifier is needed to handle unconstrained face detection. Furthermore, we show that the NPD features can be efficiently obtained from a look up table, and the detection template can be easily scaled, making the proposed face detector very fast. Experimental results on three public face datasets (FDDB, GENKI, and CMU-MIT) show that the proposed method achieves state-of-the-art performance in detecting unconstrained faces with arbitrary pose variations and occlusions in cluttered scenes.Comment: This paper has been accepted by TPAMI. The source code is available on the project page http://www.cbsr.ia.ac.cn/users/scliao/projects/npdface/index.htm

    Coalescent-based species tree estimation: a stochastic Farris transform

    Full text link
    The reconstruction of a species phylogeny from genomic data faces two significant hurdles: 1) the trees describing the evolution of each individual gene--i.e., the gene trees--may differ from the species phylogeny and 2) the molecular sequences corresponding to each gene often provide limited information about the gene trees themselves. In this paper we consider an approach to species tree reconstruction that addresses both these hurdles. Specifically, we propose an algorithm for phylogeny reconstruction under the multispecies coalescent model with a standard model of site substitution. The multispecies coalescent is commonly used to model gene tree discordance due to incomplete lineage sorting, a well-studied population-genetic effect. In previous work, an information-theoretic trade-off was derived in this context between the number of loci, mm, needed for an accurate reconstruction and the length of the locus sequences, kk. It was shown that to reconstruct an internal branch of length ff, one needs mm to be of the order of 1/[f2k]1/[f^{2} \sqrt{k}]. That previous result was obtained under the molecular clock assumption, i.e., under the assumption that mutation rates (as well as population sizes) are constant across the species phylogeny. Here we generalize this result beyond the restrictive molecular clock assumption, and obtain a new reconstruction algorithm that has the same data requirement (up to log factors). Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with n≥3n \geq 3 species, the rooted species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.Comment: Submitted. 49 page

    The hard-core model on random graphs revisited

    Full text link
    We revisit the classical hard-core model, also known as independent set and dual to vertex cover problem, where one puts particles with a first-neighbor hard-core repulsion on the vertices of a random graph. Although the case of random graphs with small and very large average degrees respectively are quite well understood, they yield qualitatively different results and our aim here is to reconciliate these two cases. We revisit results that can be obtained using the (heuristic) cavity method and show that it provides a closed-form conjecture for the exact density of the densest packing on random regular graphs with degree K>=20, and that for K>16 the nature of the phase transition is the same as for large K. This also shows that the hard-code model is the simplest mean-field lattice model for structural glasses and jamming.Comment: 9 pages, 2 figures, International Meeting on "Inference, Computation, and Spin Glasses" (ICSG2013), Sapporo, Japa

    The Complexity of Phylogeny Constraint Satisfaction Problems

    Full text link
    We systematically study the computational complexity of a broad class of computational problems in phylogenetic reconstruction. The class contains for example the rooted triple consistency problem, forbidden subtree problems, the quartet consistency problem, and many other problems studied in the bioinformatics literature. The studied problems can be described as \emph{constraint satisfaction problems} where the constraints have a first-order definition over the rooted triple relation. We show that every such phylogeny problem can be solved in polynomial time or is NP-complete. On the algorithmic side, we generalize a well-known polynomial-time algorithm of Aho, Sagiv, Szymanski, and Ullman for the rooted triple consistency problem. Our algorithm repeatedly solves linear equation systems to construct a solution in polynomial time. We then show that every phylogeny problem that cannot be solved by our algorithm is NP-complete. Our classification establishes a dichotomy for a large class of infinite structures that we believe is of independent interest in universal algebra, model theory, and topology. The proof of our main result combines results and techniques from various research areas: a recent classification of the model-complete cores of the reducts of the homogeneous binary branching C-relation, Leeb's Ramsey theorem for rooted trees, and universal algebra.Comment: 48 pages, 2 figures. In this version we fix several bugs in the proofs of the previous version

    On Learning a Hidden Directed Graph with Path Queries

    Full text link
    In this paper, we consider the problem of reconstructing a directed graph using path queries. In this query model of learning, a graph is hidden from the learner, and the learner can access information about it with path queries. For a source and destination node, a path query returns whether there is a directed path from the source to the destination node in the hidden graph. In this paper we first give bounds for learning graphs on nn vertices and kk strongly connected components. We then study the case of bounded degree directed trees and give new algorithms for learning "almost-trees" -- directed trees to which extra edges have been added. We also give some lower bound constructions justifying our approach.Comment: 11 page

    ITCM: A Real Time Internet Traffic Classifier Monitor

    Full text link
    The continual growth of high speed networks is a challenge for real-time network analysis systems. The real time traffic classification is an issue for corporations and ISPs (Internet Service Providers). This work presents the design and implementation of a real time flow-based network traffic classification system. The classifier monitor acts as a pipeline consisting of three modules: packet capture and pre-processing, flow reassembly, and classification with Machine Learning (ML). The modules are built as concurrent processes with well defined data interfaces between them so that any module can be improved and updated independently. In this pipeline, the flow reassembly function becomes the bottleneck of the performance. In this implementation, was used a efficient method of reassembly which results in a average delivery delay of 0.49 seconds, approximately. For the classification module, the performances of the K-Nearest Neighbor (KNN), C4.5 Decision Tree, Naive Bayes (NB), Flexible Naive Bayes (FNB) and AdaBoost Ensemble Learning Algorithm are compared in order to validate our approach.Comment: 16 pages, 3 figures, 7 tables, International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 201

    Particle Control in Phase Space by Global K-Means Clustering

    Full text link
    We devise and explore an iterative optimization procedure for controlling particle populations in particle-in-cell (PIC) codes via merging and splitting of computational macro-particles. Our approach, is to compute an optimal representation of the global particle phase space structure while decreasing or increasing the entire particle population, based on k-means clustering of the data. In essence the procedure amounts to merging or splitting particles by statistical means, throughout the entire simulation volume in question, while minimizing a 6-dimensional total distance measure to preserve the physics. Particle merging is by far the most demanding procedure when considering conservation laws of physics; it amounts to lossy compression of particle phase space data. We demonstrate that our k-means approach conserves energy and momentum to high accuracy, even for high compression ratios, R≈3\mathcal{R} \approx 3 --- \emph{i.e.}, Nf≲0.33NiN_{f} \lesssim 0.33N_{i}. Interestingly, we find that an accurate particle splitting step can be performed using k-means as well; this from an argument of symmetry. The split solution, using k-means, places splitted particles optimally, to obtain maximal spanning on the phase space manifold. Implementation and testing is done using an electromagnetic PIC code, the \ppcode. Nonetheless, the k-means framework is general; it is not limited to Vlasov-Maxwell type PIC codes. We discuss advantages and drawbacks of this optimal phase space reconstruction.Comment: Revision 1. Major revisions. Added discussion. 18 pages, 22 figures, submitted to Journal of Computational Physic
    • …
    corecore