166,078 research outputs found

    Min-Sum Clustering (With Outliers)

    Get PDF
    We give a constant factor polynomial time pseudo-approximation algorithm for min-sum clustering with or without outliers. The algorithm is allowed to exclude an arbitrarily small constant fraction of the points. For instance, we show how to compute a solution that clusters 98% of the input data points and pays no more than a constant factor times the optimal solution that clusters 99% of the input data points. More generally, we give the following bicriteria approximation: For any ? > 0, for any instance with n input points and for any positive integer n\u27 ? n, we compute in polynomial time a clustering of at least (1-?) n\u27 points of cost at most a constant factor greater than the optimal cost of clustering n\u27 points. The approximation guarantee grows with 1/(?). Our results apply to instances of points in real space endowed with squared Euclidean distance, as well as to points in a metric space, where the number of clusters, and also the dimension if relevant, is arbitrary (part of the input, not an absolute constant)

    Inertial particles distribute in turbulence as Poissonian points with random intensity inducing clustering and supervoiding

    Full text link
    This work considers the distribution of inertial particles in turbulence using the point-particle approximation. We demonstrate that the random point process formed by the positions of particles in space is a Poisson point process with log-normal random intensity ("log Gaussian Cox process" or LGCP). The probability of having a finite number of particles in a small volume is given in terms of the characteristic function of a log-normal distribution. Corrections due to discreteness of the number of particles to the previously derived statistics of particle concentration in the continuum limit are provided. These are relevant for dealing with experimental or numerical data. The probability of having regions without particles, i.e. voids, is larger for inertial particles than for tracer particles where voids are distributed according to Poisson processes. Further, the probability of having large voids decays only log-normally with size. This shows that particles cluster, leaving voids behind. At scales where there is no clustering there can still be an increase of the void probability so that turbulent voiding is stronger than clustering. The demonstrated double stochasticity of the distribution originates in the two-step formation of fluctuations. First, turbulence brings the particles randomly close together which happens with Poisson-type probability. Then, turbulence compresses the particles' volume in the observation volume. We confirm the theory of the statistics of the number of particles in small volumes by numerical observations of inertial particle motion in a chaotic ABC flow. The improved understanding of clustering processes can be applied to predict the long-time survival probability of reacting particles. Our work implies that the particle distribution in weakly compressible flow with finite time correlations is a LGCP, independently of the details of the flow statistics

    Deep clustering: Discriminative embeddings for segmentation and separation

    Full text link
    We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.Comment: Originally submitted on June 5, 201

    Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

    Full text link
    In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic kk-median and kk-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension dd, the precision parameter ε−1\varepsilon^{-1} or kk. Furthermore, there is no coreset construction that succeeds with probability 1−1/n1-1/n and whose size does not depend on the number of input points, nn. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman, WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are Ω(1)\Omega(1) for both kk-median and kk-means, even when allowing a complexity FPT in the number of clusters kk. This stands in sharp contrast with the (1+ε)(1+\varepsilon)-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We also construct a deterministic algorithm for computing (1+ε)(1+\varepsilon)-approximation to kk-median and kk-means in high dimensional Euclidean spaces in time 2k2/εO(1)poly(nd)2^{k^2/\varepsilon^{O(1)}} poly(nd), close to the best randomized complexity. Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS '22] by a factor kk.Comment: FOCS 2023. Abstract reduced for arxiv requirement

    Resolving Conflicts for Lower-Bounded Clustering

    Get PDF
    This paper considers the effect of non-metric distances for lower-bounded clustering, i.e., the problem of computing a partition for a given set of objects with pairwise distance, such that each set has a certain minimum cardinality (as required for anonymisation or balanced facility location problems). We discuss lower-bounded clustering with the objective to minimise the maximum radius or diameter of the clusters. For these problems there exists a 2-approximation but only if the pairwise distance on the objects satisfies the triangle inequality, without this property no polynomial-time constant factor approximation is possible, unless P=NP. We try to resolve or at least soften this effect of non-metric distances by devising particular strategies to deal with violations of the triangle inequality (conflicts). With parameterised algorithmics, we find that if the number of such conflicts is not too large, constant factor approximations can still be computed efficiently. In particular, we introduce parameterised approximations with respect to not just the number of conflicts but also for the vertex cover number of the conflict graph (graph induced by conflicts). Interestingly, we salvage the approximation ratio of 2 for diameter while for radius it is only possible to show a ratio of 3. For the parameter vertex cover number of the conflict graph this worsening in ratio is shown to be unavoidable, unless FPT=W[2]. We further discuss improvements for diameter by choosing the (induced) P_3-cover number of the conflict graph as parameter and complement these by showing that, unless FPT=W[1], there exists no constant factor parameterised approximation with respect to the parameter split vertex deletion set

    Light-Based Sample Reduction Methods for Interactive Relighting of Scenes with Minute Geometric Scale

    Get PDF
    Rendering production-quality cinematic scenes requires high computational and temporal costs. From an artist\u27s perspective, one must wait for several hours for feedback on even minute changes of light positions and parameters. Previous work approximates scenes so that adjustments on lights may be carried out with interactive feedback, so long as geometry and materials remain constant. We build on these methods by proposing means by which objects with high geometric complexity at the subpixel level, such as hair and foliage, can be approximated for real-time cinematic relighting. Our methods make no assumptions about the geometry or shaders in a scene, and as such are fully generalized. We show that clustering techniques can greatly reduce multisampling, while still maintaining image fidelity at an error significantly lower than sparsely sampling without clustering, provided that no shadows are computed. Scenes that produce noise-like shadow patterns when sparse shadow samples are taken suffer from additional error introduced by those shadows. We present a viable solution to scalable scene approximation for lower sampling reolutions, provided a robust solution to shadow approximation for sub-pixel geomery can be provided in the future
    • …
    corecore