Search CORE

11,706 research outputs found

Far-Field Compression for Fast Kernel Summation Methods in High Dimensions

Author: Biros George
March William B.
Publication venue
Publication date: 12/02/2015
Field of study

We consider fast kernel summations in high dimensions: given a large set of points in

d

dimensions (with

d \gg 3

) and a pair-potential function (the {\em kernel} function), we compute a weighted sum of all pairwise kernel interactions for each point in the set. Direct summation is equivalent to a (dense) matrix-vector multiplication and scales quadratically with the number of points. Fast kernel summation algorithms reduce this cost to log-linear or linear complexity. Treecodes and Fast Multipole Methods (FMMs) deliver tremendous speedups by constructing approximate representations of interactions of points that are far from each other. In algebraic terms, these representations correspond to low-rank approximations of blocks of the overall interaction matrix. Existing approaches require an excessive number of kernel evaluations with increasing

d

and number of points in the dataset. To address this issue, we use a randomized algebraic approach in which we first sample the rows of a block and then construct its approximate, low-rank interpolative decomposition. We examine the feasibility of this approach theoretically and experimentally. We provide a new theoretical result showing a tighter bound on the reconstruction error from uniformly sampling rows than the existing state-of-the-art. We demonstrate that our sampling approach is competitive with existing (but prohibitively expensive) methods from the literature. We also construct kernel matrices for the Laplacian, Gaussian, and polynomial kernels -- all commonly used in physics and data analysis. We explore the numerical properties of blocks of these matrices, and show that they are amenable to our approach. Depending on the data set, our randomized algorithm can successfully compute low rank approximations in high dimensions. We report results for data sets with ambient dimensions from four to 1,000.Comment: 43 pages, 21 figure

arXiv.org e-Print Archive

CiteSeerX

Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics

Author: Bellet Aurélien
Clémençon Stéphan
Colin Igor
Publication venue
Publication date: 01/01/2016
Field of study

In a wide range of statistical learning problems such as ranking, clustering or metric learning among others, the risk is accurately estimated by

U

-statistics of degree

d\geq 1

, i.e. functionals of the training data with low variance that take the form of averages over

k

-tuples. From a computational perspective, the calculation of such statistics is highly expensive even for a moderate sample size

n

, as it requires averaging

O(n^d)

terms. This makes learning procedures relying on the optimization of such data functionals hardly feasible in practice. It is the major goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically computationally simpler Monte-Carlo estimates based on

O(n)

terms only, usually referred to as incomplete

U

-statistics, without damaging the

O_{\mathbb{P}}(1/\sqrt{n})

learning rate of Empirical Risk Minimization (ERM) procedures. For this purpose, we establish uniform deviation results describing the error made when approximating a

U

-process by its incomplete version under appropriate complexity assumptions. Extensions to model selection, fast rate situations and various sampling techniques are also considered, as well as an application to stochastic gradient descent for ERM. Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques.Comment: To appear in Journal of Machine Learning Research. 34 pages. v2: minor correction to Theorem 4 and its proof, added 1 reference. v3: typo corrected in Proposition 3. v4: improved presentation, added experiments on model selection for clustering, fixed minor typo

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Tree-Independent Dual-Tree Algorithms

Author: Anderson David V.
Curtin Ryan R.
Gray Alexander G.
Isbell Jr Charles L.
March William B.
Ram Parikshit
Publication venue
Publication date: 16/04/2013
Field of study

Dual-tree algorithms are a widely used class of branch-and-bound algorithms. Unfortunately, developing dual-tree algorithms for use with different trees and problems is often complex and burdensome. We introduce a four-part logical split: the tree, the traversal, the point-to-point base case, and the pruning rule. We provide a meta-algorithm which allows development of dual-tree algorithms in a tree-independent manner and easy extension to entirely new types of trees. Representations are provided for five common algorithms; for k-nearest neighbor search, this leads to a novel, tighter pruning bound. The meta-algorithm also allows straightforward extensions to massively parallel settings.Comment: accepted in ICML 201

arXiv.org e-Print Archive

CiteSeerX

An ontology enhanced parallel SVM for scalable spam filter training

Author: Bauer
Blanco
Blanzieri
Blei
Breiman
Cao
Caruana
Chawla
Colas
Cristianini
Dean
Do
Gansterer
Godwin Caruana
Graf
Hall
Huang
Kearns
Kim
Maozhen Li
Mei
Platt
Suykens
Taura
Vapnik
Wang
Woodsend
Yang Liu
Zanghirati
Zhang
Publication venue: 'Elsevier BV'
Publication date: 01/05/2013
Field of study

This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

Crossref

Brunel University Research Archive