3 research outputs found
How Developers Iterate on Machine Learning Workflows -- A Survey of the Applied Machine Learning Literature
Machine learning workflow development is anecdotally regarded to be an
iterative process of trial-and-error with humans-in-the-loop. However, we are
not aware of quantitative evidence corroborating this popular belief. A
quantitative characterization of iteration can serve as a benchmark for machine
learning workflow development in practice, and can aid the development of
human-in-the-loop machine learning systems. To this end, we conduct a
small-scale survey of the applied machine learning literature from five
distinct application domains. We collect and distill statistics on the role of
iteration within machine learning workflow development, and report preliminary
trends and insights from our investigation, as a starting point towards this
benchmark. Based on our findings, we finally describe desiderata for effective
and versatile human-in-the-loop machine learning systems that can cater to
users in diverse domains
Demystifying a Dark Art: Understanding Real-World Machine Learning Model Development
It is well-known that the process of developing machine learning (ML)
workflows is a dark-art; even experts struggle to find an optimal workflow
leading to a high accuracy model. Users currently rely on empirical
trial-and-error to obtain their own set of battle-tested guidelines to inform
their modeling decisions. In this study, we aim to demystify this dark art by
understanding how people iterate on ML workflows in practice. We analyze over
475k user-generated workflows on OpenML, an open-source platform for tracking
and sharing ML workflows. We find that users often adopt a manual, automated,
or mixed approach when iterating on their workflows. We observe that manual
approaches result in fewer wasted iterations compared to automated approaches.
Yet, automated approaches often involve more preprocessing and hyperparameter
options explored, resulting in higher performance overall--suggesting potential
benefits for a human-in-the-loop ML system that appropriately recommends a
clever combination of the two strategies
Helix: Holistic Optimization for Accelerating Iterative Machine Learning
Machine learning workflow development is a process of trial-and-error:
developers iterate on workflows by testing out small modifications until the
desired accuracy is achieved. Unfortunately, existing machine learning systems
focus narrowly on model training---a small fraction of the overall development
time---and neglect to address iterative development. We propose Helix, a
machine learning system that optimizes the execution across
iterations---intelligently caching and reusing, or recomputing intermediates as
appropriate. Helix captures a wide variety of application needs within its
Scala DSL, with succinct syntax defining unified processes for data
preprocessing, model specification, and learning. We demonstrate that the reuse
problem can be cast as a Max-Flow problem, while the caching problem is
NP-Hard. We develop effective lightweight heuristics for the latter. Empirical
evaluation shows that Helix is not only able to handle a wide variety of use
cases in one unified workflow but also much faster, providing run time
reductions of up to 19x over state-of-the-art systems, such as DeepDive or
KeystoneML, on four real-world applications in natural language processing,
computer vision, social and natural sciences