6,552 research outputs found
Automating biomedical data science through tree-based pipeline optimization
Over the past decade, data science and machine learning has grown from a
mysterious art form to a staple tool across a variety of fields in academia,
business, and government. In this paper, we introduce the concept of tree-based
pipeline optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement a Tree-based Pipeline Optimization
Tool (TPOT) and demonstrate its effectiveness on a series of simulated and
real-world genetic data sets. In particular, we show that TPOT can build
machine learning pipelines that achieve competitive classification accuracy and
discover novel pipeline operators---such as synthetic feature
constructors---that significantly improve classification accuracy on these data
sets. We also highlight the current challenges to pipeline optimization, such
as the tendency to produce pipelines that overfit the data, and suggest future
research paths to overcome these challenges. As such, this work represents an
early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding
Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science
As the field of data science continues to grow, there will be an
ever-increasing demand for tools that make machine learning accessible to
non-experts. In this paper, we introduce the concept of tree-based pipeline
optimization for automating one of the most tedious parts of machine
learning---pipeline design. We implement an open source Tree-based Pipeline
Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a
series of simulated and real-world benchmark data sets. In particular, we show
that TPOT can design machine learning pipelines that provide a significant
improvement over a basic machine learning analysis while requiring little to no
input nor prior knowledge from the user. We also address the tendency for TPOT
to design overly complex pipelines by integrating Pareto optimization, which
produces compact pipelines without sacrificing classification accuracy. As
such, this work represents an important step toward fully automating machine
learning pipeline design.Comment: 8 pages, 5 figures, preprint to appear in GECCO 2016, edits not yet
made from reviewer comment
Automatic Bayesian Density Analysis
Making sense of a dataset in an automatic and unsupervised fashion is a
challenging problem in statistics and AI. Classical approaches for {exploratory
data analysis} are usually not flexible enough to deal with the uncertainty
inherent to real-world data: they are often restricted to fixed latent
interaction models and homogeneous likelihoods; they are sensitive to missing,
corrupt and anomalous data; moreover, their expressiveness generally comes at
the price of intractable inference. As a result, supervision from statisticians
is usually needed to find the right model for the data. However, since domain
experts are not necessarily also experts in statistics, we propose Automatic
Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible
at large. Specifically, ABDA allows for automatic and efficient missing value
estimation, statistical data type and likelihood discovery, anomaly detection
and dependency structure mining, on top of providing accurate density
estimation. Extensive empirical evidence shows that ABDA is a suitable tool for
automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
Exploration of User Groups in VEXUS
We introduce VEXUS, an interactive visualization framework for exploring user
data to fulfill tasks such as finding a set of experts, forming discussion
groups and analyzing collective behaviors. User data is characterized by a
combination of demographics like age and occupation, and actions such as rating
a movie, writing a paper, following a medical treatment or buying groceries.
The ubiquity of user data requires tools that help explorers, be they
specialists or novice users, acquire new insights. VEXUS lets explorers
interact with user data via visual primitives and builds an exploration profile
to recommend the next exploration steps. VEXUS combines state-of-the-art
visualization techniques with appropriate indexing of user data to provide fast
and relevant exploration
Detecting domestic violence.
Over 90% of the case data from police inquiries is stored as unstructured text in police databases. We use the combination of Formal Concept Analysis and Emergent Self Organizing Maps for exploring a dataset of unstructured police reports out of the Amsterdam-Amstelland police region in the Netherlands. In this paper, we specifically aim at making the reader familiar with how we used these two tools for browsing the dataset and how we discovered useful patterns for labelling cases as domestic or as non-domestic violence.Formal concept analysis (FCA); Emergent SOM; Domestic violence; Knowledge discovery in databases; Text mining; Exploratory data analysis;
Automating Data Science: Prospects and Challenges
Given the complexity of typical data science projects and the associated
demand for human expertise, automation has the potential to transform the data
science process.
Key insights:
* Automation in data science aims to facilitate and transform the work of
data scientists, not to replace them.
* Important parts of data science are already being automated, especially in
the modeling stages, where techniques such as automated machine learning
(AutoML) are gaining traction.
* Other aspects are harder to automate, not only because of technological
challenges, but because open-ended and context-dependent tasks require human
interaction.Comment: 19 pages, 3 figures. v1 accepted for publication (April 2021) in
Communications of the AC
- …