21 research outputs found
A Collaborative Approach to Computational Reproducibility
Although a standard in natural science, reproducibility has been only
episodically applied in experimental computer science. Scientific papers often
present a large number of tables, plots and pictures that summarize the
obtained results, but then loosely describe the steps taken to derive them. Not
only can the methods and the implementation be complex, but also their
configuration may require setting many parameters and/or depend on particular
system configurations. While many researchers recognize the importance of
reproducibility, the challenge of making it happen often outweigh the benefits.
Fortunately, a plethora of reproducibility solutions have been recently
designed and implemented by the community. In particular, packaging tools
(e.g., ReproZip) and virtualization tools (e.g., Docker) are promising
solutions towards facilitating reproducibility for both authors and reviewers.
To address the incentive problem, we have implemented a new publication model
for the Reproducibility Section of Information Systems Journal. In this
section, authors submit a reproducibility paper that explains in detail the
computational assets from a previous published manuscript in Information
Systems
Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar
Automatic machine learning is an important problem in the forefront of
machine learning. The strongest AutoML systems are based on neural networks,
evolutionary algorithms, and Bayesian optimization. Recently AlphaD3M reached
state-of-the-art results with an order of magnitude speedup using reinforcement
learning with self-play. In this work we extend AlphaD3M by using a pipeline
grammar and a pre-trained model which generalizes from many different datasets
and similar tasks. Our results demonstrate improved performance compared with
our earlier work and existing methods on AutoML benchmark datasets for
classification and regression tasks. In the spirit of reproducible research we
make our data, models, and code publicly available.Comment: ICML Workshop on Automated Machine Learnin
AlphaD3M: An Open-Source AutoML Library for Multiple ML Tasks
peer reviewedWe present AlphaD3M, an open-source Python library that supports a wide range of machine learning tasks over different data types. We discuss the challenges involved in supporting multiple tasks and how AlphaD3M addresses them by combining deep reinforcement learning and meta-learning to construct pipelines over a large collection of primitives effectively. To better integrate the use of AutoML within the data science lifecycle, we have built an ecosystem of tools around AlphaD3M that support user-in-the-loop tasks, including selecting suitable pipelines and developing custom solutions for complex problems. We present use cases that demonstrate some of these features. We report the results of a detailed experimental evaluation showing that AlphaD3M is effective and derives highquality pipelines for a diverse set of problems with performance comparable or superior to state-of-the-art AutoML systems
AlphaD3M: Machine Learning Pipeline Synthesis
peer reviewedWe introduce AlphaD3M, an automatic machine learning (AutoML) system based on
meta reinforcement learning using sequence models with self play. AlphaD3M is
based on edit operations performed over machine learning pipeline primitives
providing explainability. We compare AlphaD3M with state-of-the-art AutoML
systems: Autosklearn, Autostacker, and TPOT, on OpenML datasets. AlphaD3M
achieves competitive performance while being an order of magnitude faster,
reducing computation time from hours to minutes, and is explainable by design
Using ReproZip for Reproducibility and Library Services
This is a pre-print of a manuscript pending publication.
Achieving research reproducibility is challenging in many ways: there are social and cultural obstacles as well as a constantly changing technical landscape that makes replicating and reproducing research difficult. Users face challenges in reproducing research across different operating systems, in using different versions of software across long projects and among collaborations, and in using publicly available work. The dependencies required to reproduce the computational environments in which research happens can be exceptionally hard to track – in many cases, these dependencies are hidden or nested too deeply to discover, and thus impossible to install on a new machine, which means adoption remains low.
In this paper, we present ReproZip, an open source tool to help overcome the technical difficulties involved in preserving and replicating research, applications, databases, software, and more. We examine the current use cases of ReproZip, ranging from digital humanities to machine learning. We also explore potential library use cases for ReproZip, particularly in digital libraries and archives, liaison librarianship, and other library services. We believe that libraries and archives can leverage ReproZip to deliver more robust reproducibility services, repository services, as well as enhanced discoverability and preservation of research materials, applications, software, and computational environments