19 research outputs found

    Contents

    Get PDF

    Efficient Classification for Metric Data

    Full text link
    Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generalization error. We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points, and can thus achieve superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is guaranteed via the fat-shattering dimension of Lipschitz classifiers, and we present experimental evidence of its superiority to some common kernel methods. As a by-product, we offer a new perspective on the nearest neighbor classifier, which yields significantly sharper risk asymptotics than the classic analysis of Cover and Hart [IEEE Trans. Info. Theory, 1967].Comment: This is the full version of an extended abstract that appeared in Proceedings of the 23rd COLT, 201

    Faster Clustering via Preprocessing

    Full text link
    We examine the efficiency of clustering a set of points, when the encompassing metric space may be preprocessed in advance. In computational problems of this genre, there is a first stage of preprocessing, whose input is a collection of points MM; the next stage receives as input a query set QβŠ‚MQ\subset M, and should report a clustering of QQ according to some objective, such as 1-median, in which case the answer is a point a∈Ma\in M minimizing βˆ‘q∈QdM(a,q)\sum_{q\in Q} d_M(a,q). We design fast algorithms that approximately solve such problems under standard clustering objectives like pp-center and pp-median, when the metric MM has low doubling dimension. By leveraging the preprocessing stage, our algorithms achieve query time that is near-linear in the query size n=∣Q∣n=|Q|, and is (almost) independent of the total number of points m=∣M∣m=|M|.Comment: 24 page
    corecore