166,078 research outputs found
Min-Sum Clustering (With Outliers)
We give a constant factor polynomial time pseudo-approximation algorithm for min-sum clustering with or without outliers. The algorithm is allowed to exclude an arbitrarily small constant fraction of the points. For instance, we show how to compute a solution that clusters 98% of the input data points and pays no more than a constant factor times the optimal solution that clusters 99% of the input data points. More generally, we give the following bicriteria approximation: For any ? > 0, for any instance with n input points and for any positive integer n\u27 ? n, we compute in polynomial time a clustering of at least (1-?) n\u27 points of cost at most a constant factor greater than the optimal cost of clustering n\u27 points. The approximation guarantee grows with 1/(?). Our results apply to instances of points in real space endowed with squared Euclidean distance, as well as to points in a metric space, where the number of clusters, and also the dimension if relevant, is arbitrary (part of the input, not an absolute constant)
Inertial particles distribute in turbulence as Poissonian points with random intensity inducing clustering and supervoiding
This work considers the distribution of inertial particles in turbulence
using the point-particle approximation. We demonstrate that the random point
process formed by the positions of particles in space is a Poisson point
process with log-normal random intensity ("log Gaussian Cox process" or LGCP).
The probability of having a finite number of particles in a small volume is
given in terms of the characteristic function of a log-normal distribution.
Corrections due to discreteness of the number of particles to the previously
derived statistics of particle concentration in the continuum limit are
provided. These are relevant for dealing with experimental or numerical data.
The probability of having regions without particles, i.e. voids, is larger for
inertial particles than for tracer particles where voids are distributed
according to Poisson processes. Further, the probability of having large voids
decays only log-normally with size. This shows that particles cluster, leaving
voids behind. At scales where there is no clustering there can still be an
increase of the void probability so that turbulent voiding is stronger than
clustering. The demonstrated double stochasticity of the distribution
originates in the two-step formation of fluctuations. First, turbulence brings
the particles randomly close together which happens with Poisson-type
probability. Then, turbulence compresses the particles' volume in the
observation volume. We confirm the theory of the statistics of the number of
particles in small volumes by numerical observations of inertial particle
motion in a chaotic ABC flow. The improved understanding of clustering
processes can be applied to predict the long-time survival probability of
reacting particles. Our work implies that the particle distribution in weakly
compressible flow with finite time correlations is a LGCP, independently of the
details of the flow statistics
Deep clustering: Discriminative embeddings for segmentation and separation
We address the problem of acoustic source separation in a deep learning
framework we call "deep clustering." Rather than directly estimating signals or
masking functions, we train a deep network to produce spectrogram embeddings
that are discriminative for partition labels given in training data. Previous
deep network approaches provide great advantages in terms of learning power and
speed, but previously it has been unclear how to use them to separate signals
in a class-independent way. In contrast, spectral clustering approaches are
flexible with respect to the classes and number of items to be segmented, but
it has been unclear how to leverage the learning power and speed of deep
networks. To obtain the best of both worlds, we use an objective function that
to train embeddings that yield a low-rank approximation to an ideal pairwise
affinity matrix, in a class-independent way. This avoids the high cost of
spectral factorization and instead produces compact clusters that are amenable
to simple clustering methods. The segmentations are therefore implicitly
encoded in the embeddings, and can be "decoded" by clustering. Preliminary
experiments show that the proposed method can separate speech: when trained on
spectrogram features containing mixtures of two speakers, and tested on
mixtures of a held-out set of speakers, it can infer masking functions that
improve signal quality by around 6dB. We show that the model can generalize to
three-speaker mixtures despite training only on two-speaker mixtures. The
framework can be used without class labels, and therefore has the potential to
be trained on a diverse set of sound types, and to generalize to novel sources.
We hope that future work will lead to segmentation of arbitrary sounds, with
extensions to microphone array methods as well as image segmentation and other
domains.Comment: Originally submitted on June 5, 201
Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation
In all state-of-the-art sketching and coreset techniques for clustering, as
well as in the best known fixed-parameter tractable approximation algorithms,
randomness plays a key role. For the classic -median and -means problems,
there are no known deterministic dimensionality reduction procedure or coreset
construction that avoid an exponential dependency on the input dimension ,
the precision parameter or . Furthermore, there is no
coreset construction that succeeds with probability and whose size does
not depend on the number of input points, . This has led researchers in the
area to ask what is the power of randomness for clustering sketches [Feldman,
WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio
achievable deterministically without a complexity exponential in the dimension
are for both -median and -means, even when allowing a
complexity FPT in the number of clusters . This stands in sharp contrast
with the -approximation achievable in that case, when allowing
randomization.
In this paper, we provide deterministic sketches constructions for
clustering, whose size bounds are close to the best-known randomized ones. We
also construct a deterministic algorithm for computing
-approximation to -median and -means in high dimensional
Euclidean spaces in time , close to the
best randomized complexity.
Furthermore, our new insights on sketches also yield a randomized coreset
construction that uses uniform sampling, that immediately improves over the
recent results of [Braverman et al. FOCS '22] by a factor .Comment: FOCS 2023. Abstract reduced for arxiv requirement
Resolving Conflicts for Lower-Bounded Clustering
This paper considers the effect of non-metric distances for lower-bounded clustering, i.e., the problem of computing a partition for a given set of objects with pairwise distance, such that each set has a certain minimum cardinality (as required for anonymisation or balanced facility location problems). We discuss lower-bounded clustering with the objective to minimise the maximum radius or diameter of the clusters. For these problems there exists a 2-approximation but only if the pairwise distance on the objects satisfies the triangle inequality, without this property no polynomial-time constant factor approximation is possible, unless P=NP. We try to resolve or at least soften this effect of non-metric distances by devising particular strategies to deal with violations of the triangle inequality (conflicts). With parameterised algorithmics, we find that if the number of such conflicts is not too large, constant factor approximations can still be computed efficiently.
In particular, we introduce parameterised approximations with respect to not just the number of conflicts but also for the vertex cover number of the conflict graph (graph induced by conflicts). Interestingly, we salvage the approximation ratio of 2 for diameter while for radius it is only possible to show a ratio of 3. For the parameter vertex cover number of the conflict graph this worsening in ratio is shown to be unavoidable, unless FPT=W[2]. We further discuss improvements for diameter by choosing the (induced) P_3-cover number of the conflict graph as parameter and complement these by showing that, unless FPT=W[1], there exists no constant factor parameterised approximation with respect to the parameter split vertex deletion set
Light-Based Sample Reduction Methods for Interactive Relighting of Scenes with Minute Geometric Scale
Rendering production-quality cinematic scenes requires high computational and temporal costs. From an artist\u27s perspective, one must wait for several hours for feedback on even minute changes of light positions and parameters. Previous work approximates scenes so that adjustments on lights may be carried out with interactive feedback, so long as geometry and materials remain constant. We build on these methods by proposing means by which objects with high geometric complexity at the subpixel level, such as hair and foliage, can be approximated for real-time cinematic relighting. Our methods make no assumptions about the geometry or shaders in a scene, and as such are fully generalized. We show that clustering techniques can greatly reduce multisampling, while still maintaining image fidelity at an error significantly lower than sparsely sampling without clustering, provided that no shadows are computed. Scenes that produce noise-like shadow patterns when sparse shadow samples are taken suffer from additional error introduced by those shadows. We present a viable solution to scalable scene approximation for lower sampling reolutions, provided a robust solution to shadow approximation for sub-pixel geomery can be provided in the future
- …