7,549 research outputs found
Efficient Quartet Representations of Trees and Applications to Supertree and Summary Methods
Quartet trees displayed by larger phylogenetic trees have long been used as
inputs for species tree and supertree reconstruction. Computational constraints
prevent the use of all displayed quartets in many practical problems due to the
number of taxa. We introduce the notion of an Efficient Quartet System (EQS) to
represent a phylogenetic tree with a subset of the quartets displayed by the
tree. We show mathematically that the set of quartets obtained from a tree via
an EQS contains all of the combinatorial information of the tree itself. Using
performance tests on simulated datasets, we also demonstrate that using an EQS
to reduce the number of quartets in pipelines for summary methods of species
tree inference and supertree inference results in only small reductions in
accuracy.Comment: 7 pages, minor revisio
Sequence-Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier
We introduce a new distance-based phylogeny reconstruction technique which
provably achieves, at sufficiently short branch lengths, a polylogarithmic
sequence-length requirement -- improving significantly over previous polynomial
bounds for distance-based methods. The technique is based on an averaging
procedure that implicitly reconstructs ancestral sequences.
In the same token, we extend previous results on phase transitions in
phylogeny reconstruction to general time-reversible models. More precisely, we
show that in the so-called Kesten-Stigum zone (roughly, a region of the
parameter space where ancestral sequences are well approximated by ``linear
combinations'' of the observed sequences) sequences of length \poly(\log n)
suffice for reconstruction when branch lengths are discretized. Here is the
number of extant species.
Our results challenge, to some extent, the conventional wisdom that estimates
of evolutionary distances alone carry significantly less information about
phylogenies than full sequence datasets
Dynamic Ordered Sets with Exponential Search Trees
We introduce exponential search trees as a novel technique for converting
static polynomial space search structures for ordered sets into fully-dynamic
linear space data structures.
This leads to an optimal bound of O(sqrt(log n/loglog n)) for searching and
updating a dynamic set of n integer keys in linear space. Here searching an
integer y means finding the maximum key in the set which is smaller than or
equal to y. This problem is equivalent to the standard text book problem of
maintaining an ordered set (see, e.g., Cormen, Leiserson, Rivest, and Stein:
Introduction to Algorithms, 2nd ed., MIT Press, 2001).
The best previous deterministic linear space bound was O(log n/loglog n) due
Fredman and Willard from STOC 1990. No better deterministic search bound was
known using polynomial space.
We also get the following worst-case linear space trade-offs between the
number n, the word length w, and the maximal key U < 2^w: O(min{loglog n+log
n/log w, (loglog n)(loglog U)/(logloglog U)}). These trade-offs are, however,
not likely to be optimal.
Our results are generalized to finger searching and string searching,
providing optimal results for both in terms of n.Comment: Revision corrects some typoes and state things better for
applications in subsequent paper
A Fast and Accurate Unconstrained Face Detector
We propose a method to address challenges in unconstrained face detection,
such as arbitrary pose variations and occlusions. First, a new image feature
called Normalized Pixel Difference (NPD) is proposed. NPD feature is computed
as the difference to sum ratio between two pixel values, inspired by the Weber
Fraction in experimental psychology. The new feature is scale invariant,
bounded, and is able to reconstruct the original image. Second, we propose a
deep quadratic tree to learn the optimal subset of NPD features and their
combinations, so that complex face manifolds can be partitioned by the learned
rules. This way, only a single soft-cascade classifier is needed to handle
unconstrained face detection. Furthermore, we show that the NPD features can be
efficiently obtained from a look up table, and the detection template can be
easily scaled, making the proposed face detector very fast. Experimental
results on three public face datasets (FDDB, GENKI, and CMU-MIT) show that the
proposed method achieves state-of-the-art performance in detecting
unconstrained faces with arbitrary pose variations and occlusions in cluttered
scenes.Comment: This paper has been accepted by TPAMI. The source code is available
on the project page
http://www.cbsr.ia.ac.cn/users/scliao/projects/npdface/index.htm
Coalescent-based species tree estimation: a stochastic Farris transform
The reconstruction of a species phylogeny from genomic data faces two
significant hurdles: 1) the trees describing the evolution of each individual
gene--i.e., the gene trees--may differ from the species phylogeny and 2) the
molecular sequences corresponding to each gene often provide limited
information about the gene trees themselves. In this paper we consider an
approach to species tree reconstruction that addresses both these hurdles.
Specifically, we propose an algorithm for phylogeny reconstruction under the
multispecies coalescent model with a standard model of site substitution. The
multispecies coalescent is commonly used to model gene tree discordance due to
incomplete lineage sorting, a well-studied population-genetic effect.
In previous work, an information-theoretic trade-off was derived in this
context between the number of loci, , needed for an accurate reconstruction
and the length of the locus sequences, . It was shown that to reconstruct an
internal branch of length , one needs to be of the order of . That previous result was obtained under the molecular clock
assumption, i.e., under the assumption that mutation rates (as well as
population sizes) are constant across the species phylogeny.
Here we generalize this result beyond the restrictive molecular clock
assumption, and obtain a new reconstruction algorithm that has the same data
requirement (up to log factors). Our main contribution is a novel reduction to
the molecular clock case under the multispecies coalescent. As a corollary, we
also obtain a new identifiability result of independent interest: for any
species tree with species, the rooted species tree can be identified
from the distribution of its unrooted weighted gene trees even in the absence
of a molecular clock.Comment: Submitted. 49 page
The hard-core model on random graphs revisited
We revisit the classical hard-core model, also known as independent set and
dual to vertex cover problem, where one puts particles with a first-neighbor
hard-core repulsion on the vertices of a random graph. Although the case of
random graphs with small and very large average degrees respectively are quite
well understood, they yield qualitatively different results and our aim here is
to reconciliate these two cases. We revisit results that can be obtained using
the (heuristic) cavity method and show that it provides a closed-form
conjecture for the exact density of the densest packing on random regular
graphs with degree K>=20, and that for K>16 the nature of the phase transition
is the same as for large K. This also shows that the hard-code model is the
simplest mean-field lattice model for structural glasses and jamming.Comment: 9 pages, 2 figures, International Meeting on "Inference, Computation,
and Spin Glasses" (ICSG2013), Sapporo, Japa
The Complexity of Phylogeny Constraint Satisfaction Problems
We systematically study the computational complexity of a broad class of
computational problems in phylogenetic reconstruction. The class contains for
example the rooted triple consistency problem, forbidden subtree problems, the
quartet consistency problem, and many other problems studied in the
bioinformatics literature. The studied problems can be described as
\emph{constraint satisfaction problems} where the constraints have a
first-order definition over the rooted triple relation. We show that every such
phylogeny problem can be solved in polynomial time or is NP-complete. On the
algorithmic side, we generalize a well-known polynomial-time algorithm of Aho,
Sagiv, Szymanski, and Ullman for the rooted triple consistency problem. Our
algorithm repeatedly solves linear equation systems to construct a solution in
polynomial time. We then show that every phylogeny problem that cannot be
solved by our algorithm is NP-complete. Our classification establishes a
dichotomy for a large class of infinite structures that we believe is of
independent interest in universal algebra, model theory, and topology. The
proof of our main result combines results and techniques from various research
areas: a recent classification of the model-complete cores of the reducts of
the homogeneous binary branching C-relation, Leeb's Ramsey theorem for rooted
trees, and universal algebra.Comment: 48 pages, 2 figures. In this version we fix several bugs in the
proofs of the previous version
On Learning a Hidden Directed Graph with Path Queries
In this paper, we consider the problem of reconstructing a directed graph
using path queries. In this query model of learning, a graph is hidden from the
learner, and the learner can access information about it with path queries. For
a source and destination node, a path query returns whether there is a directed
path from the source to the destination node in the hidden graph. In this paper
we first give bounds for learning graphs on vertices and strongly
connected components. We then study the case of bounded degree directed trees
and give new algorithms for learning "almost-trees" -- directed trees to which
extra edges have been added. We also give some lower bound constructions
justifying our approach.Comment: 11 page
ITCM: A Real Time Internet Traffic Classifier Monitor
The continual growth of high speed networks is a challenge for real-time
network analysis systems. The real time traffic classification is an issue for
corporations and ISPs (Internet Service Providers). This work presents the
design and implementation of a real time flow-based network traffic
classification system. The classifier monitor acts as a pipeline consisting of
three modules: packet capture and pre-processing, flow reassembly, and
classification with Machine Learning (ML). The modules are built as concurrent
processes with well defined data interfaces between them so that any module can
be improved and updated independently. In this pipeline, the flow reassembly
function becomes the bottleneck of the performance. In this implementation, was
used a efficient method of reassembly which results in a average delivery delay
of 0.49 seconds, approximately. For the classification module, the performances
of the K-Nearest Neighbor (KNN), C4.5 Decision Tree, Naive Bayes (NB), Flexible
Naive Bayes (FNB) and AdaBoost Ensemble Learning Algorithm are compared in
order to validate our approach.Comment: 16 pages, 3 figures, 7 tables, International Journal of Computer
Science & Information Technology (IJCSIT) Vol 6, No 6, December 201
Particle Control in Phase Space by Global K-Means Clustering
We devise and explore an iterative optimization procedure for controlling
particle populations in particle-in-cell (PIC) codes via merging and splitting
of computational macro-particles. Our approach, is to compute an optimal
representation of the global particle phase space structure while decreasing or
increasing the entire particle population, based on k-means clustering of the
data. In essence the procedure amounts to merging or splitting particles by
statistical means, throughout the entire simulation volume in question, while
minimizing a 6-dimensional total distance measure to preserve the physics.
Particle merging is by far the most demanding procedure when considering
conservation laws of physics; it amounts to lossy compression of particle phase
space data. We demonstrate that our k-means approach conserves energy and
momentum to high accuracy, even for high compression ratios, --- \emph{i.e.}, . Interestingly, we find
that an accurate particle splitting step can be performed using k-means as
well; this from an argument of symmetry. The split solution, using k-means,
places splitted particles optimally, to obtain maximal spanning on the phase
space manifold. Implementation and testing is done using an electromagnetic PIC
code, the \ppcode. Nonetheless, the k-means framework is general; it is not
limited to Vlasov-Maxwell type PIC codes. We discuss advantages and drawbacks
of this optimal phase space reconstruction.Comment: Revision 1. Major revisions. Added discussion. 18 pages, 22 figures,
submitted to Journal of Computational Physic
- …