433 research outputs found
Domain adaptation for sequence labeling using hidden Markov models
Most natural language processing systems based on machine learning are not
robust to domain shift. For example, a state-of-the-art syntactic dependency
parser trained on Wall Street Journal sentences has an absolute drop in
performance of more than ten points when tested on textual data from the Web.
An efficient solution to make these methods more robust to domain shift is to
first learn a word representation using large amounts of unlabeled data from
both domains, and then use this representation as features in a supervised
learning algorithm. In this paper, we propose to use hidden Markov models to
learn word representations for part-of-speech tagging. In particular, we study
the influence of using data from the source, the target or both domains to
learn the representation and the different ways to represent words using an
HMM.Comment: New Directions in Transfer and Multi-Task: Learning Across Domains
and Tasks (NIPS Workshop) (2013
On the Equivalence between Herding and Conditional Gradient Algorithms
We show that the herding procedure of Welling (2009) takes exactly the form
of a standard convex optimization algorithm--namely a conditional gradient
algorithm minimizing a quadratic moment discrepancy. This link enables us to
invoke convergence results from convex optimization and to consider faster
alternatives for the task of approximating integrals in a reproducing kernel
Hilbert space. We study the behavior of the different variants through
numerical simulations. The experiments indicate that while we can improve over
herding on the task of approximating integrals, the original herding algorithm
tends to approach more often the maximum entropy distribution, shedding more
light on the learning bias behind herding
Optimization with Sparsity-Inducing Penalties
Sparse estimation methods are aimed at using or obtaining parsimonious
representations of data or models. They were first dedicated to linear variable
selection but numerous extensions have now emerged such as structured sparsity
or kernel selection. It turns out that many of the related estimation problems
can be cast as convex optimization problems by regularizing the empirical risk
with appropriate non-smooth norms. The goal of this paper is to present from a
general perspective optimization tools and techniques dedicated to such
sparsity-inducing penalties. We cover proximal methods, block-coordinate
descent, reweighted -penalized techniques, working-set and homotopy
methods, as well as non-convex formulations and extensions, and provide an
extensive set of experiments to compare various algorithms from a computational
point of view
Convex and Network Flow Optimization for Structured Sparsity
We consider a class of learning problems regularized by a structured
sparsity-inducing norm defined as the sum of l_2- or l_infinity-norms over
groups of variables. Whereas much effort has been put in developing fast
optimization techniques when the groups are disjoint or embedded in a
hierarchy, we address here the case of general overlapping groups. To this end,
we present two different strategies: On the one hand, we show that the proximal
operator associated with a sum of l_infinity-norms can be computed exactly in
polynomial time by solving a quadratic min-cost flow problem, allowing the use
of accelerated proximal gradient methods. On the other hand, we use proximal
splitting techniques, and address an equivalent formulation with
non-overlapping groups, but in higher dimension and with additional
constraints. We propose efficient and scalable algorithms exploiting these two
strategies, which are significantly faster than alternative approaches. We
illustrate these methods with several problems such as CUR matrix
factorization, multi-task learning of tree-structured dictionaries, background
subtraction in video sequences, image denoising with wavelets, and topographic
dictionary learning of natural image patches.Comment: to appear in the Journal of Machine Learning Research (JMLR
Network Flow Algorithms for Structured Sparsity
We consider a class of learning problems that involve a structured
sparsity-inducing norm defined as the sum of -norms over groups of
variables. Whereas a lot of effort has been put in developing fast optimization
methods when the groups are disjoint or embedded in a specific hierarchical
structure, we address here the case of general overlapping groups. To this end,
we show that the corresponding optimization problem is related to network flow
optimization. More precisely, the proximal problem associated with the norm we
consider is dual to a quadratic min-cost flow problem. We propose an efficient
procedure which computes its solution exactly in polynomial time. Our algorithm
scales up to millions of variables, and opens up a whole new range of
applications for structured sparse models. We present several experiments on
image and video data, demonstrating the applicability and scalability of our
approach for various problems.Comment: accepted for publication in Adv. Neural Information Processing
Systems, 201
Proximal Methods for Hierarchical Sparse Coding
Sparse coding consists in representing signals as sparse linear combinations
of atoms selected from a dictionary. We consider an extension of this framework
where the atoms are further assumed to be embedded in a tree. This is achieved
using a recently introduced tree-structured sparse regularization norm, which
has proven useful in several applications. This norm leads to regularized
problems that are difficult to optimize, and we propose in this paper efficient
algorithms for solving them. More precisely, we show that the proximal operator
associated with this norm is computable exactly via a dual approach that can be
viewed as the composition of elementary proximal operators. Our procedure has a
complexity linear, or close to linear, in the number of atoms, and allows the
use of accelerated gradient techniques to solve the tree-structured sparse
approximation problem at the same computational cost as traditional ones using
the L1-norm. Our method is efficient and scales gracefully to millions of
variables, which we illustrate in two types of applications: first, we consider
fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we
apply our optimization tools in the context of dictionary learning, where
learned dictionary elements naturally organize in a prespecified arborescent
structure, leading to a better performance in reconstruction of natural image
patches. When applied to text documents, our method learns hierarchies of
topics, thus providing a competitive alternative to probabilistic topic models
Convex Relaxation for Combinatorial Penalties
In this paper, we propose an unifying view of several recently proposed
structured sparsity-inducing norms. We consider the situation of a model
simultaneously (a) penalized by a set- function de ned on the support of the
unknown parameter vector which represents prior knowledge on supports, and (b)
regularized in Lp-norm. We show that the natural combinatorial optimization
problems obtained may be relaxed into convex optimization problems and
introduce a notion, the lower combinatorial envelope of a set-function, that
characterizes the tightness of our relaxations. We moreover establish links
with norms based on latent representations including the latent group Lasso and
block-coding, and with norms obtained from submodular functions.Comment: 35 pag
Trace Lasso: a trace norm regularization for correlated designs
Using the -norm to regularize the estimation of the parameter vector
of a linear model leads to an unstable estimator when covariates are highly
correlated. In this paper, we introduce a new penalty function which takes into
account the correlation of the design matrix to stabilize the estimation. This
norm, called the trace Lasso, uses the trace norm, which is a convex surrogate
of the rank, of the selected covariates as the criterion of model complexity.
We analyze the properties of our norm, describe an optimization algorithm based
on reweighted least-squares, and illustrate the behavior of this norm on
synthetic data, showing that it is more adapted to strong correlations than
competing methods such as the elastic net
Structured Sparse Principal Component Analysis
We present an extension of sparse PCA, or sparse dictionary learning, where
the sparsity patterns of all dictionary elements are structured and constrained
to belong to a prespecified set of shapes. This \emph{structured sparse PCA} is
based on a structured regularization recently introduced by [1]. While
classical sparse priors only deal with \textit{cardinality}, the regularization
we use encodes higher-order information about the data. We propose an efficient
and simple optimization procedure to solve this problem. Experiments with two
practical tasks, face recognition and the study of the dynamics of a protein
complex, demonstrate the benefits of the proposed structured approach over
unstructured approaches
Learning Hierarchical and Topographic Dictionaries with Structured Sparsity
Recent work in signal processing and statistics have focused on defining new
regularization functions, which not only induce sparsity of the solution, but
also take into account the structure of the problem. We present in this paper a
class of convex penalties introduced in the machine learning community, which
take the form of a sum of l_2 and l_infinity-norms over groups of variables.
They extend the classical group-sparsity regularization in the sense that the
groups possibly overlap, allowing more flexibility in the group design. We
review efficient optimization methods to deal with the corresponding inverse
problems, and their application to the problem of learning dictionaries of
natural image patches: On the one hand, dictionary learning has indeed proven
effective for various signal processing tasks. On the other hand, structured
sparsity provides a natural framework for modeling dependencies between
dictionary elements. We thus consider a structured sparse regularization to
learn dictionaries embedded in a particular structure, for instance a tree or a
two-dimensional grid. In the latter case, the results we obtain are similar to
the dictionaries produced by topographic independent component analysis
- …