499,210 research outputs found
Feature learning in feature-sample networks using multi-objective optimization
Data and knowledge representation are fundamental concepts in machine
learning. The quality of the representation impacts the performance of the
learning model directly. Feature learning transforms or enhances raw data to
structures that are effectively exploited by those models. In recent years,
several works have been using complex networks for data representation and
analysis. However, no feature learning method has been proposed for such
category of techniques. Here, we present an unsupervised feature learning
mechanism that works on datasets with binary features. First, the dataset is
mapped into a feature--sample network. Then, a multi-objective optimization
process selects a set of new vertices to produce an enhanced version of the
network. The new features depend on a nonlinear function of a combination of
preexisting features. Effectively, the process projects the input data into a
higher-dimensional space. To solve the optimization problem, we design two
metaheuristics based on the lexicographic genetic algorithm and the improved
strength Pareto evolutionary algorithm (SPEA2). We show that the enhanced
network contains more information and can be exploited to improve the
performance of machine learning methods. The advantages and disadvantages of
each optimization strategy are discussed.Comment: 7 pages, 4 figure
A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning
Learning sparse combinations is a frequent theme in machine learning. In this
paper, we study its associated optimization problem in the distributed setting
where the elements to be combined are not centrally located but spread over a
network. We address the key challenges of balancing communication costs and
optimization errors. To this end, we propose a distributed Frank-Wolfe (dFW)
algorithm. We obtain theoretical guarantees on the optimization error
and communication cost that do not depend on the total number of
combining elements. We further show that the communication cost of dFW is
optimal by deriving a lower-bound on the communication cost required to
construct an -approximate solution. We validate our theoretical
analysis with empirical studies on synthetic and real-world data, which
demonstrate that dFW outperforms both baselines and competing methods. We also
study the performance of dFW when the conditions of our analysis are relaxed,
and show that dFW is fairly robust.Comment: Extended version of the SIAM Data Mining 2015 pape
Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
As the field of data science continues to grow, there will be an
ever-increasing demand for tools that make machine learning accessible to
non-experts. In this paper, we introduce the concept of tree-based pipeline
optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement an open source Tree-based Pipeline
Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a
series of simulated and real-world benchmark data sets. In particular, we show
that TPOT can design machine learning pipelines that provide a significant
improvement over a basic machine learning analysis while requiring little to no
input nor prior knowledge from the user. We also address the tendency for TPOT
to design overly complex pipelines by integrating Pareto optimization, which
produces compact pipelines without sacrificing classification accuracy. As
such, this work represents an important step toward fully automating machine
learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet
made from reviewer comment
- …