100,413 research outputs found

    Representation Learning for Clustering: A Statistical Framework

    Full text link
    We address the problem of communicating domain knowledge from a user to the designer of a clustering algorithm. We propose a protocol in which the user provides a clustering of a relatively small random sample of a data set. The algorithm designer then uses that sample to come up with a data representation under which kk-means clustering results in a clustering (of the full data set) that is aligned with the user's clustering. We provide a formal statistical model for analyzing the sample complexity of learning a clustering representation with this paradigm. We then introduce a notion of capacity of a class of possible representations, in the spirit of the VC-dimension, showing that classes of representations that have finite such dimension can be successfully learned with sample size error bounds, and end our discussion with an analysis of that dimension for classes of representations induced by linear embeddings.Comment: To be published in Proceedings of UAI 201

    ε\varepsilon-Coresets for Clustering (with Outliers) in Doubling Metrics

    Full text link
    We study the problem of constructing ε\varepsilon-coresets for the (k,z)(k, z)-clustering problem in a doubling metric M(X,d)M(X, d). An ε\varepsilon-coreset is a weighted subset SXS\subseteq X with weight function w:SR0w : S \rightarrow \mathbb{R}_{\geq 0}, such that for any kk-subset C[X]kC \in [X]^k, it holds that xSw(x)dz(x,C)(1±ε)xXdz(x,C)\sum_{x \in S}{w(x) \cdot d^z(x, C)} \in (1 \pm \varepsilon) \cdot \sum_{x \in X}{d^z(x, C)}. We present an efficient algorithm that constructs an ε\varepsilon-coreset for the (k,z)(k, z)-clustering problem in M(X,d)M(X, d), where the size of the coreset only depends on the parameters k,z,εk, z, \varepsilon and the doubling dimension ddim(M)\mathsf{ddim}(M). To the best of our knowledge, this is the first efficient ε\varepsilon-coreset construction of size independent of X|X| for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of M(X,d)M(X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance dd. Such a relation was not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1±ϵ)(1\pm\epsilon)-distortion of the distance function dd, and consider the notion of τ\tau-error probabilistic shattering dimension, we can prove an upper bound of O(ddim(M)log(1/ε)+loglog1τ)O( \mathsf{ddim}(M)\cdot \log(1/\varepsilon) +\log\log{\frac{1}{\tau}} ) for the probabilistic shattering dimension for even weighted doubling metrics. We believe this new relation is of independent interest and may find other applications. We also study the robust coresets and centroid sets in doubling metrics. Our robust coreset construction leads to new results in clustering and property testing, and the centroid sets can be used to accelerate the local search algorithms for clustering problems.Comment: Appeared in FOCS 2018, this is the full versio

    Importance of small earthquakes for stress transfers and earthquake triggering

    Full text link
    We estimate the relative importance of small and large earthquakes for static stress changes and for earthquake triggering, assuming that earthquakes are triggered by static stress changes and that earthquakes are located on a fractal network of dimension D. This model predicts that both the number of events triggered by an earthquake of magnitude m and the stress change induced by this earthquake at the location of other earthquakes increase with m as \~10^(Dm/2). The stronger the spatial clustering, the larger the influence of small earthquakes on stress changes at the location of a future event as well as earthquake triggering. If earthquake magnitudes follow the Gutenberg-Richter law with b>D/2, small earthquakes collectively dominate stress transfer and earthquake triggering, because their greater frequency overcomes their smaller individual triggering potential. Using a Southern-California catalog, we observe that the rate of seismicity triggered by an earthquake of magnitude m increases with m as 10^(alpha m), where alpha=1.00+-0.05. We also find that the magnitude distribution of triggered earthquakes is independent of the triggering earthquake magnitude m. When alpha=b, small earthquakes are roughly as important to earthquake triggering as larger ones. We evaluate the fractal correlation dimension of hypocenters D=2 using two relocated catalogs for Southern California, and removing the effect of short-term clustering. Thus D=2alpha as predicted by assuming that earthquake triggering is due to static stress. The value D=2 implies that small earthquakes are as important as larger ones for stress transfers between earthquakes.Comment: 14 pages, 7 eps figures, latex. In press in J. Geophys. Re

    Ewald Sums for One Dimension

    Full text link
    We derive analytic solutions for the potential and field in a one-dimensional system of masses or charges with periodic boundary conditions, in other words Ewald sums for one dimension. We also provide a set of tools for exploring the system evolution and show that it's possible to construct an efficient algorithm for carrying out simulations. In the cosmological setting we show that two approaches for satisfying periodic boundary conditions, one overly specified and the other completely general, provide a nearly identical clustering evolution until the number of clusters becomes small, at which time the influence of any size-dependent boundary cannot be ignored. Finally we compare the results with other recent work with the hope of providing clarification over differences these issues have induced. We explain that modern formulations of physics require a well defined potential which is not available if the forces are screened directly.Comment: 2 figures added references expanded discussion of algorithm corrected figures added discussion of screened forc
    corecore