5 research outputs found
Big data workflows: Locality-aware orchestration using software containers
The emergence of the Edge computing paradigm has shifted data processing from centralised infrastructures to heterogeneous and geographically distributed infrastructures. Therefore, data processing solutions must consider data locality to reduce the performance penalties from data transfers among remote data centres. Existing Big Data processing solutions provide limited support for handling data locality and are inefficient in processing small and frequent events specific to the Edge environments. This article proposes a novel architecture and a proof-of-concept implementation for software container-centric Big Data workflow orchestration that puts data locality at the forefront. The proposed solution considers the available data locality information, leverages long-lived containers to execute workflow steps, and handles the interaction with different data sources through containers. We compare the proposed solution with Argo Workflows and demonstrate a significant performance improvement in the execution speed for processing the same data units. Finally, we carry out experiments with the proposed solution under different configurations and analyze individual aspects affecting the performance of the overall solution.publishedVersio
Big data workflows: Locality-aware orchestration using software containers
The emergence of the Edge computing paradigm has shifted data processing from centralised infrastructures to heterogeneous and geographically distributed infrastructures. Therefore, data processing solutions must consider data locality to reduce the performance penalties from data transfers among remote data centres. Existing Big Data processing solutions provide limited support for handling data locality and are inefficient in processing small and frequent events specific to the Edge environments. This article proposes a novel architecture and a proof-of-concept implementation for software container-centric Big Data workflow orchestration that puts data locality at the forefront. The proposed solution considers the available data locality information, leverages long-lived containers to execute workflow steps, and handles the interaction with different data sources through containers. We compare the proposed solution with Argo Workflows and demonstrate a significant performance improvement in the execution speed for processing the same data units. Finally, we carry out experiments with the proposed solution under different configurations and analyze individual aspects affecting the performance of the overall solution.publishedVersio
StreamFlow: cross-breeding cloud with HPC
Workflows are among the most commonly used tools in a variety of execution
environments. Many of them target a specific environment; few of them make it
possible to execute an entire workflow in different environments, e.g.
Kubernetes and batch clusters. We present a novel approach to workflow
execution, called StreamFlow, that complements the workflow graph with the
declarative description of potentially complex execution environments, and that
makes it possible the execution onto multiple sites not sharing a common data
space. StreamFlow is then exemplified on a novel bioinformatics pipeline for
single-cell transcriptomic data analysis workflow.Comment: 30 pages - 2020 IEEE Transactions on Emerging Topics in Computin