5,575 research outputs found
On Computing Centroids According to the p-Norms of Hamming Distance Vectors
In this paper we consider the p-Norm Hamming Centroid problem which asks to determine whether some given strings have a centroid with a bound on the p-norm of its Hamming distances to the strings. Specifically, given a set S of strings and a real k, we consider the problem of determining whether there exists a string s^* with (sum_{s in S} d^{p}(s^*,s))^(1/p) <=k, where d(,) denotes the Hamming distance metric. This problem has important applications in data clustering and multi-winner committee elections, and is a generalization of the well-known polynomial-time solvable Consensus String (p=1) problem, as well as the NP-hard Closest String (p=infty) problem.
Our main result shows that the problem is NP-hard for all fixed rational p > 1, closing the gap for all rational values of p between 1 and infty. Under standard complexity assumptions the reduction also implies that the problem has no 2^o(n+m)-time or 2^o(k^(p/(p+1)))-time algorithm, where m denotes the number of input strings and n denotes the length of each string, for any fixed p > 1. The first bound matches a straightforward brute-force algorithm. The second bound is tight in the sense that for each fixed epsilon > 0, we provide a 2^(k^(p/((p+1))+epsilon))-time algorithm. In the last part of the paper, we complement our hardness result by presenting a fixed-parameter algorithm and a factor-2 approximation algorithm for the problem
A directed isoperimetric inequality with application to Bregman near neighbor lower bounds
Bregman divergences are a class of divergences parametrized by a
convex function and include well known distance functions like
and the Kullback-Leibler divergence. There has been extensive
research on algorithms for problems like clustering and near neighbor search
with respect to Bregman divergences, in all cases, the algorithms depend not
just on the data size and dimensionality , but also on a structure
constant that depends solely on and can grow without bound
independently.
In this paper, we provide the first evidence that this dependence on
might be intrinsic. We focus on the problem of approximate near neighbor search
for Bregman divergences. We show that under the cell probe model, any
non-adaptive data structure (like locality-sensitive hashing) for
-approximate near-neighbor search that admits probes must use space
. In contrast, for LSH under the best
bound is .
Our new tool is a directed variant of the standard boolean noise operator. We
show that a generalization of the Bonami-Beckner hypercontractivity inequality
exists "in expectation" or upon restriction to certain subsets of the Hamming
cube, and that this is sufficient to prove the desired isoperimetric inequality
that we use in our data structure lower bound.
We also present a structural result reducing the Hamming cube to a Bregman
cube. This structure allows us to obtain lower bounds for problems under
Bregman divergences from their analog. In particular, we get a
(weaker) lower bound for approximate near neighbor search of the form
for an -query non-adaptive data structure,
and new cell probe lower bounds for a number of other near neighbor questions
in Bregman space.Comment: 27 page
Clustering, Hamming Embedding, Generalized LSH and the Max Norm
We study the convex relaxation of clustering and hamming embedding, focusing
on the asymmetric case (co-clustering and asymmetric hamming embedding),
understanding their relationship to LSH as studied by (Charikar 2002) and to
the max-norm ball, and the differences between their symmetric and asymmetric
versions.Comment: 17 page
A sticky HDP-HMM with application to speaker diarization
We consider the problem of speaker diarization, the problem of segmenting an
audio recording of a meeting into temporal segments corresponding to individual
speakers. The problem is rendered particularly difficult by the fact that we
are not allowed to assume knowledge of the number of people participating in
the meeting. To address this problem, we take a Bayesian nonparametric approach
to speaker diarization that builds on the hierarchical Dirichlet process hidden
Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006)
1566--1581]. Although the basic HDP-HMM tends to over-segment the audio
data---creating redundant states and rapidly switching among them---we describe
an augmented HDP-HMM that provides effective control over the switching rate.
We also show that this augmentation makes it possible to treat emission
distributions nonparametrically. To scale the resulting architecture to
realistic diarization problems, we develop a sampling algorithm that employs a
truncated approximation of the Dirichlet process to jointly resample the full
state sequence, greatly improving mixing rates. Working with a benchmark NIST
data set, we show that our Bayesian nonparametric architecture yields
state-of-the-art speaker diarization results.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS395 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …