222 research outputs found
NP-hardness of hypercube 2-segmentation
The hypercube 2-segmentation problem is a certain biclustering problem that
was previously claimed to be NP-hard, but for which there does not appear to be
a publicly available proof of NP-hardness. This manuscript provides such a
proof
Clustering Boolean Tensors
Tensor factorizations are computationally hard problems, and in particular,
are often significantly harder than their matrix counterparts. In case of
Boolean tensor factorizations -- where the input tensor and all the factors are
required to be binary and we use Boolean algebra -- much of that hardness comes
from the possibility of overlapping components. Yet, in many applications we
are perfectly happy to partition at least one of the modes. In this paper we
investigate what consequences does this partitioning have on the computational
complexity of the Boolean tensor factorizations and present a new algorithm for
the resulting clustering problem. This algorithm can alternatively be seen as a
particularly regularized clustering algorithm that can handle extremely
high-dimensional observations. We analyse our algorithms with the goal of
maximizing the similarity and argue that this is more meaningful than
minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient
0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm
for Boolean tensor clustering achieves high scalability, high similarity, and
good generalization to unseen data with both synthetic and real-world data
sets
Correlation Clustering with Low-Rank Matrices
Correlation clustering is a technique for aggregating data based on
qualitative information about which pairs of objects are labeled 'similar' or
'dissimilar.' Because the optimization problem is NP-hard, much of the previous
literature focuses on finding approximation algorithms. In this paper we
explore how to solve the correlation clustering objective exactly when the data
to be clustered can be represented by a low-rank matrix. We prove in particular
that correlation clustering can be solved in polynomial time when the
underlying matrix is positive semidefinite with small constant rank, but that
the task remains NP-hard in the presence of even one negative eigenvalue. Based
on our theoretical results, we develop an algorithm for efficiently "solving"
low-rank positive semidefinite correlation clustering by employing a procedure
for zonotope vertex enumeration. We demonstrate the effectiveness and speed of
our algorithm by using it to solve several clustering problems on both
synthetic and real-world data
Weakly Submodular Functions
Submodular functions are well-studied in combinatorial optimization, game
theory and economics. The natural diminishing returns property makes them
suitable for many applications. We study an extension of monotone submodular
functions, which we call {\em weakly submodular functions}. Our extension
includes some (mildly) supermodular functions. We show that several natural
functions belong to this class and relate our class to some other recent
submodular function extensions.
We consider the optimization problem of maximizing a weakly submodular
function subject to uniform and general matroid constraints. For a uniform
matroid constraint, the "standard greedy algorithm" achieves a constant
approximation ratio where the constant (experimentally) converges to 5.95 as
the cardinality constraint increases. For a general matroid constraint, a
simple local search algorithm achieves a constant approximation ratio where the
constant (analytically) converges to 10.22 as the rank of the matroid
increases
A QPTAS for Gapless MEC
We consider the problem Minimum Error Correction (MEC). A MEC instance is an n x m matrix M with entries from {0,1,-}. Feasible solutions are composed of two binary m-bit strings, together with an assignment of each row of M to one of the two strings. The objective is to minimize the number of mismatches (errors) where the row has a value that differs from the assigned solution string. The symbol "-" is a wildcard that matches both 0 and 1. A MEC instance is gapless, if in each row of M all binary entries are consecutive.
Gapless-MEC is a relevant problem in computational biology, and it is closely related to segmentation problems that were introduced by {[}Kleinberg-Papadimitriou-Raghavan STOC\u2798{]} in the context of data mining.
Without restrictions, it is known to be UG-hard to compute an O(1)-approximate solution to MEC. For both MEC and Gapless-MEC, the best polynomial time approximation algorithm has a logarithmic performance guarantee. We partially settle the approximation status of Gapless-MEC by providing a quasi-polynomial time approximation scheme (QPTAS). Additionally, for the relevant case where the binary part of a row is not contained in the binary part of another row, we provide a polynomial time approximation scheme (PTAS)
Clustering {Boolean} Tensors
Tensor factorizations are computationally hard problems, and in particular, are often significantly harder than their matrix counterparts. In case of Boolean tensor factorizations -- where the input tensor and all the factors are required to be binary and we use Boolean algebra -- much of that hardness comes from the possibility of overlapping components. Yet, in many applications we are perfectly happy to partition at least one of the modes. In this paper we investigate what consequences does this partitioning have on the computational complexity of the Boolean tensor factorizations and present a new algorithm for the resulting clustering problem. This algorithm can alternatively be seen as a particularly regularized clustering algorithm that can handle extremely high-dimensional observations. We analyse our algorithms with the goal of maximizing the similarity and argue that this is more meaningful than minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient 0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm for Boolean tensor clustering achieves high scalability, high similarity, and good generalization to unseen data with both synthetic and real-world data sets
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
- …