3 research outputs found
Community Organizations: Changing the Culture in Which Research Software Is Developed and Sustained
Software is the key crosscutting technology that enables advances in
mathematics, computer science, and domain-specific science and engineering to
achieve robust simulations and analysis for science, engineering, and other
research fields. However, software itself has not traditionally received
focused attention from research communities; rather, software has evolved
organically and inconsistently, with its development largely as by-products of
other initiatives. Moreover, challenges in scientific software are expanding
due to disruptive changes in computer hardware, increasing scale and complexity
of data, and demands for more complex simulations involving multiphysics,
multiscale modeling and outer-loop analysis. In recent years, community members
have established a range of grass-roots organizations and projects to address
these growing technical and social challenges in software productivity,
quality, reproducibility, and sustainability. This article provides an overview
of such groups and discusses opportunities to leverage their synergistic
activities while nurturing work toward emerging software ecosystems
Recommended from our members
The role of model implementation in neuroscientific applications of machine learning
In modern neuroscience, large scale machine learning models are becoming increasingly critical components of data analysis. Despite the accelerating adoption of these large scale machine learning tools, there are fundamental challenges to their use in scientific applications that remain largely unaddressed. In this thesis, I focus on one such challenge: variability in the predictions of large scale machine learning models relative to seemingly trivial differences in their implementation.
Existing research has shown that the performance of large scale machine learning models (more so than traditional model like linear regression) is meaningfully entangled with design choices such as the hardware components, operating system, software dependencies, and random seed that the corresponding model depends upon. Within the bounds of current practice, there are few ways of controlling this kind of implementation variability across the broad community of neuroscience researchers (making data analysis less reproducible), and little understanding of how data analyses might be designed to mitigate these issues (making data analysis unreliable). This dissertation will present two broad research directions that address these shortcomings.
First, I will describe a novel, cloud-based platform for sharing data analysis tools reproducibly and at scale. This platform, called NeuroCAAS, enables developers of novel data analyses to precisely specify an implementation of their entire data analysis, which can then be used automatically by any other user on custom built cloud resources. I show that this approach is able to efficiently support a wide variety of existing data analysis tools, as well as novel tools which would not be feasible to build and share outside of a platform like NeuroCAAS.
Second, I conduct two large-scale studies on the behavior of deep ensembles. Deep ensembles are a class of machine learning model which uses implementation variability to improve the quality of model predictions; in particular, by aggregating the predictions of deep networks over stochastic initialization and training. Deep ensembles simultaneously provide a way to control the impact of implementation variability (by aggregating predictions across random seeds) and also to understand what kind of predictive diversity is generated by this particular form of implementation variability. I present a number of surprising results that contradict widely held intuitions about the performance of deep ensembles as well as the mechanisms behind their success, and show that in many aspects, the behavior of deep ensembles is similar to that of an appropriately chosen single neural network. As a whole, this dissertation presents novel methods and insights focused on the role of implementation variability in large scale machine learning models, and more generally upon the challenges of working with such large models in neuroscience data analysis. I conclude by discussing other ongoing efforts to improve the reproducibility and accessibility of large scale machine learning in neuroscience, as well as long term goals to speed the adoption and reliability of such methods in a scientific context