For supervised and unsupervised learning, positive definite kernels allow to
use large and potentially infinite dimensional feature spaces with a
computational cost that only depends on the number of observations. This is
usually done through the penalization of predictor functions by Euclidean or
Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing
norms such as the l1-norm or the block l1-norm. We assume that the kernel
decomposes into a large sum of individual basis kernels which can be embedded
in a directed acyclic graph; we show that it is then possible to perform kernel
selection through a hierarchical multiple kernel learning framework, in
polynomial time in the number of selected kernels. This framework is naturally
applied to non linear variable selection; our extensive simulations on
synthetic datasets and datasets from the UCI repository show that efficiently
exploring the large feature space through sparsity-inducing norms leads to
state-of-the-art predictive performance