4,815 research outputs found
A Fast Quartet Tree Heuristic for Hierarchical Clustering
The Minimum Quartet Tree Cost problem is to construct an optimal weight tree
from the weighted quartet topologies on objects, where
optimality means that the summed weight of the embedded quartet topologies is
optimal (so it can be the case that the optimal tree embeds all quartets as
nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized
hill climbing, for approximating the optimal weight tree, given the quartet
topology weights. The method repeatedly transforms a dendrogram, with all
objects involved as leaves, achieving a monotonic approximation to the exact
single globally optimal tree. The problem and the solution heuristic has been
extensively used for general hierarchical clustering of nontree-like
(non-phylogeny) data in various domains and across domains with heterogeneous
data. We also present a greatly improved heuristic, reducing the running time
by a factor of order a thousand to ten thousand. All this is implemented and
available, as part of the CompLearn package. We compare performance and running
time of the original and improved versions with those of UPGMA, BioNJ, and NJ,
as implemented in the SplitsTree package on genomic data for which the latter
are optimized.
Keywords: Data and knowledge visualization, Pattern
matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering,
Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with
arXiv:cs/0606048 in cs.D
Symbolic Partial-Order Execution for Testing Multi-Threaded Programs
We describe a technique for systematic testing of multi-threaded programs. We
combine Quasi-Optimal Partial-Order Reduction, a state-of-the-art technique
that tackles path explosion due to interleaving non-determinism, with symbolic
execution to handle data non-determinism. Our technique iteratively and
exhaustively finds all executions of the program. It represents program
executions using partial orders and finds the next execution using an
underlying unfolding semantics. We avoid the exploration of redundant program
traces using cutoff events. We implemented our technique as an extension of
KLEE and evaluated it on a set of large multi-threaded C programs. Our
experiments found several previously undiscovered bugs and undefined behaviors
in memcached and GNU sort, showing that the new method is capable of finding
bugs in industrial-size benchmarks.Comment: Extended version of a paper presented at CAV'2
Unfolding-based Partial Order Reduction
Partial order reduction (POR) and net unfoldings are two alternative methods
to tackle state-space explosion caused by concurrency. In this paper, we
propose the combination of both approaches in an effort to combine their
strengths. We first define, for an abstract execution model, unfolding
semantics parameterized over an arbitrary independence relation. Based on it,
our main contribution is a novel stateless POR algorithm that explores at most
one execution per Mazurkiewicz trace, and in general, can explore exponentially
fewer, thus achieving a form of super-optimality. Furthermore, our
unfolding-based POR copes with non-terminating executions and incorporates
state-caching. Over benchmarks with busy-waits, among others, our experiments
show a dramatic reduction in the number of executions when compared to a
state-of-the-art DPOR.Comment: Long version of a paper with the same title appeared on the
proceedings of CONCUR 201
Dynamic load balancing for the distributed mining of molecular structures
In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of
methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the
past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially
render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to
discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no
reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic
partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated
load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer
Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed
approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable
for large-scale, multi-domain, heterogeneous environments, such as computational grids
- …