5 research outputs found
How do users design scientific workflows? The Case of Snakemake
Scientific workflows automate the analysis of large-scale scientific data,
fostering the reuse of data processing operators as well as the reproducibility
and traceability of analysis results. In exploratory research, however,
workflows are continuously adapted, utilizing a wide range of tools and
software libraries, to test scientific hypotheses. Script-based workflow
engines cater to the required flexibility through direct integration of
programming primitives but lack abstractions for interactive exploration of the
workflow design by a user during workflow execution. To derive requirements for
such interactive workflows, we conduct an empirical study on the use of
Snakemake, a popular Python-based workflow engine. Based on workflows collected
from 1602 GitHub repositories, we present insights on common structures of
Snakemake workflows, as well as the language features typically adopted in
their specification
Recommended from our members
BioEarth: Envisioning and developing a new regional earth system model to inform natural and agricultural resource management
As managers of agricultural and natural resources are confronted with uncertainties in global change impacts, the complexities associated with the interconnected cycling of nitrogen, carbon, and water present daunting management challenges. Existing models provide detailed information on specific sub-systems (e.g., land, air, water, and economics). An increasing awareness of the unintended consequences of management decisions resulting from interconnectedness of these sub-systems, however, necessitates coupled regional earth system models (EaSMs). Decision makersâ needs and priorities can be integrated into the model design and development processes to enhance decision-making relevance and âusabilityâ of EaSMs. BioEarth is a research initiative currently under development with a focus on the U.S. Pacific Northwest region that explores the coupling of multiple stand-alone EaSMs to generate usable information for resource decision-making. Direct engagement between model developers and non-academic stakeholders involved in resource and environmental management decisions throughout the model development process is a critical component of this effort. BioEarth utilizes a bottom-up approach for its land surface model that preserves fine spatial-scale sensitivities and lateral hydrologic connectivity, which makes it unique among many regional EaSMs. This paper describes the BioEarth initiative and highlights opportunities and challenges associated with coupling multiple stand-alone models to generate usable information for agricultural and natural resource decision-making
Big Data Management Using Scientific Workflows
Humanity is rapidly approaching a new era, where every sphere of activity will be informed by the ever-increasing amount of data. Making use of big data has the potential to improve numerous avenues of human activity, including scientific research, healthcare, energy, education, transportation, environmental science, and urban planning, just to name a few. However, making such progress requires managing terabytes and even petabytes of data, generated by billions of devices, products, and events, often in real time, in different protocols, formats and types. The volume, velocity, and variety of big data, known as the 3 Vs , present formidable challenges, unmet by the traditional data management approaches. Traditionally, many data analyses have been performed using scientific workflows, tools for formalizing and structuring complex computational processes. While scientific workflows have been used extensively in structuring complex scientific data analysis processes, little work has been done to enable scientific workflows to cope with the three big data challenges on the one hand, and to leverage the dynamic resource provisioning capability of cloud computing to analyze big data on the other hand.
In this dissertation, to facilitate efficient composition, verification, and execution of distributed large-scale scientific workflows, we first propose a formal approach to scientific workflow verification, including a workflow model, and the notion of a well-typed workflow. Our approach translates a scientific workflow into an equivalent typed lambda expression, and typechecks the workflow. We then propose a typetheoretic approach to the shimming problem in scientific workflows, which occurs when connecting related but incompatible components. We reduce the shimming problem to a runtime coercion problem in the theory of type systems, and propose a fully automated and transparent solution. Our technique algorithmically inserts invisible shims into the workflow specification, thereby resolving the shimming problem for any well-typed workflow. Next, we identify a set of important challenges for running big data workflows in the cloud. We then propose a generic, implementation-independent system architecture that addresses many of these challenges. Finally, we develop a cloud-enabled big data workflow management system, called DATAVIEW, that delivers a specific implementation of our proposed architecture. To further validate our proposed architecture, we conduct a case study in which we design and run a big data workflow from the automotive domain using the Amazon EC2 cloud environment
Recommended from our members
Design and implementation of Kepler workflows for BioEarth
BioEarth is an ongoing research initiative for the development of a regional-scale Earth System Model (EaSM) for the U.S. Pacific Northwest. In order to build such a model, we need to couple multiple stand-alone EaSMs, which were originally developed independently, for capturing processes within different realms of the biosphere. Given the complexity of such coupled modeling, and the need to manage numerous complex simulations, the design and deployment of automated workflows becomes essential. The goal of this thesis to report on the design and development of automated scientific workflows for the Regional HydroEcologic Simulation System (RHESSys) model, using the Kepler workflow development tool. RHESSys is a hydrological model that is at the core of BioEarthaÌAÌZÌs model integration requirements. Design of these Kepler workflows is aimed at enabling the use of RHESSys in two different modes: i) in a standalone mode (both sequentially and in parallel), and ii) for calibration runs that involve exploring parametric space through iterative executions. Various Kepler features are utilized, including (but not limited to) its user-friendly interface design functions, and its support for parallel execution in cluster-based environments. Experimental results on a 16-core compute cluster demonstrate performance speedups ranging iv from 7x to 12x over the default standalone sequential runs, while also showing the general effectiveness of the newly designed workflows to streamline and mange processes efficiently. This study has shown the potential of Kepler to serve as the primary operational software platform for the BioEarth project, with implications for other data- and compute-intensive Earth systems modeling project