15 research outputs found
Smart Data Placement for Big Data Pipelines: An Approach based on the Storage-as-a-Service Model
The development of big data pipelines is a challenging task, especially when data storage is considered as part of the data pipelines. Local storage is expensive, hard to maintain, comes with several challenges (e.g., data availability, data security, and backup). The use of cloud storage, i.e., Storageas-a-Service (StaaS), instead of local storage has the potential of providing more flexibility in terms of such as scalability, fault tolerance, and availability. In this paper, we propose a generic approach to integrate StaaS with data pipelines, i.e., computation on an on-premise server or on a specific cloud, but integration with StaaS, and develop a ranking method for available storage options based on five key parameters: cost, proximity, network performance, the impact of server-side encryption, and user weights. The evaluation carried out demonstrates the effectiveness of the proposed approach in terms of data transfer performance and the feasibility of dynamic selection of a storage option based on four primary user scenarios.acceptedVersio
Locality-Aware Workflow Orchestration for Big Data
The development of the Edge computing paradigm shifts data processing from centralised infrastructures to heterogeneous and geographically distributed infrastructure. Such a paradigm requires data processing solutions that consider data locality in order to reduce the performance penalties from data transfers between remote (in network terms) data centres. However, existing Big Data processing solutions have limited support for handling data locality and are inefficient in processing small and frequent events specific to Edge environments. This paper proposes a novel architecture and a proof-of-concept implementation for software container-centric Big Data workflow orchestration that puts data locality at the forefront. Our solution considers any available data locality information by default, leverages long-lived containers to execute workflow steps, and handles the interaction with different data sources through containers. We compare our system with Argo workflow and show significant performance improvements in terms of speed of execution for processing units of data using our data locality aware Big Data workflow approach.acceptedVersio
Conceptualization and scalable execution of big data workflows using domain-specific languages and software containers
Big Data processing, especially with the increasing proliferation of Internet of Things (IoT) technologies and convergence of IoT, edge and cloud computing technologies, involves handling massive and complex data sets on heterogeneous resources and incorporating different tools, frameworks, and processes to help organizations make sense of their data collected from various sources. This set of operations, referred to as Big Data workflows, requires taking advantage of Cloud infrastructures’ elasticity for scalability. In this article, we present the design and prototype implementation of a Big Data workflow approach based on the use of software container technologies, message-oriented middleware (MOM), and a domain-specific language (DSL) to enable highly scalable workflow execution and abstract workflow definition. We demonstrate our system in a use case and a set of experiments that show the practical applicability of the proposed approach for the specification and scalable execution of Big Data workflows. Furthermore, we compare our proposed approach’s scalability with that of Argo Workflows – one of the most prominent tools in the area of Big Data workflows – and provide a qualitative evaluation of the proposed DSL and overall approach with respect to the existing literature.publishedVersio
Conceptualization and scalable execution of big data workflows using domain-specific languages and software containers
Big Data processing, especially with the increasing proliferation of Internet of Things (IoT) technologies and convergence of IoT, edge and cloud computing technologies, involves handling massive and complex data sets on heterogeneous resources and incorporating different tools, frameworks, and processes to help organizations make sense of their data collected from various sources. This set of operations, referred to as Big Data workflows, requires taking advantage of Cloud infrastructures’ elasticity for scalability. In this article, we present the design and prototype implementation of a Big Data workflow approach based on the use of software container technologies, message-oriented middleware (MOM), and a domain-specific language (DSL) to enable highly scalable workflow execution and abstract workflow definition. We demonstrate our system in a use case and a set of experiments that show the practical applicability of the proposed approach for the specification and scalable execution of Big Data workflows. Furthermore, we compare our proposed approach’s scalability with that of Argo Workflows – one of the most prominent tools in the area of Big Data workflows – and provide a qualitative evaluation of the proposed DSL and overall approach with respect to the existing literature.publishedVersio
Big data workflows: Locality-aware orchestration using software containers
The emergence of the Edge computing paradigm has shifted data processing from centralised infrastructures to heterogeneous and geographically distributed infrastructures. Therefore, data processing solutions must consider data locality to reduce the performance penalties from data transfers among remote data centres. Existing Big Data processing solutions provide limited support for handling data locality and are inefficient in processing small and frequent events specific to the Edge environments. This article proposes a novel architecture and a proof-of-concept implementation for software container-centric Big Data workflow orchestration that puts data locality at the forefront. The proposed solution considers the available data locality information, leverages long-lived containers to execute workflow steps, and handles the interaction with different data sources through containers. We compare the proposed solution with Argo Workflows and demonstrate a significant performance improvement in the execution speed for processing the same data units. Finally, we carry out experiments with the proposed solution under different configurations and analyze individual aspects affecting the performance of the overall solution.publishedVersio
Big data workflows: Locality-aware orchestration using software containers
The emergence of the Edge computing paradigm has shifted data processing from centralised infrastructures to heterogeneous and geographically distributed infrastructures. Therefore, data processing solutions must consider data locality to reduce the performance penalties from data transfers among remote data centres. Existing Big Data processing solutions provide limited support for handling data locality and are inefficient in processing small and frequent events specific to the Edge environments. This article proposes a novel architecture and a proof-of-concept implementation for software container-centric Big Data workflow orchestration that puts data locality at the forefront. The proposed solution considers the available data locality information, leverages long-lived containers to execute workflow steps, and handles the interaction with different data sources through containers. We compare the proposed solution with Argo Workflows and demonstrate a significant performance improvement in the execution speed for processing the same data units. Finally, we carry out experiments with the proposed solution under different configurations and analyze individual aspects affecting the performance of the overall solution.publishedVersio
Smart Data Placement for Big Data Pipelines with Storage-as-a-Service Integration
Big data pipelines are developed to process big data and turn it into useful information. They are designed to support one or more of the three big data features commonly known as the three Vs (Volume, Velocity, and Variety). With big data pipelines, massive volumes of data may be extracted, transformed, and loaded, unlike with smaller data pipelines. The implementation of a big data pipeline includes several aspects of the computing continuum such as computing resources, data transmission channels, triggers, data transfer methods, integration of message queues, etc., making the implementation process difficult. The design gets even more complex if a data pipeline is coupled to data storage, such as a distributed file system, which comes with additional challenges such as data maintenance, security, scalability, etc. On the contrary, many cloud storage services, such as Amazon \textit{Simple File Service} offer nearly infinite storage with great fault tolerance, giving possible solutions to handle big data storage concerns. Thus, moving data to cloud storage/storage-as-a-service (StaaS) will move the extra overhead of data redundancy, backup, scalability, security, etc. to the cloud service provider, which makes the implementation process of a big data pipeline relatively easier.
The work presented in this thesis aims to 1) realize big data pipelines with hybrid infrastructure, i.e., computation on a local server but integration with storage-as-a-service; 2) develop a ranking algorithm to find the most suitable storage facility in real-time based on the user's requirements; 3) develop and entail the use of a domain-specific language to deploy a large data pipeline using StaaS. A novel architecture is proposed to realize the big data pipeline with hybrid infrastructure. In addition, an evaluation matrix is proposed to rank all available storage options based on five parameters; cost, physical distance, network performance, impact of server-side encryption, and user weights. Further, to simplify the deployment process, a new domain-specific language has been developed with an extensive vocabulary to cover all major aspects of the cloud continuum in general, and cloud storage in particular. This thesis proves the effectiveness of cloud storage by implementing the big data pipeline using the newly proposed architecture. Moreover, it justifies the importance of individual parameters involved in the evaluation matrix, such as cost, physical distance, network performance, and the impact of server-side encryption by the execution of a number of experiments. It also talks about different ways that the new ranking algorithm or evaluation matrix can be shown to work
Towards Graph-based Cloud Cost Modelling and Optimisation
Cloud computing has become an increasingly popular choice for businesses and individuals due to its flexibility, scalability, and convenience; however, the rising cost of cloud resources has become a significant concern for many. The pay-per-use model used in cloud computing means that costs can accumulate quickly, and the lack of visibility and control can result in unexpected expenses. The cost structure becomes even more complicated when dealing with hybrid or multi-cloud environments. For businesses, the cost of cloud computing can be a significant portion of their IT budget, and any savings can lead to better financial stability and competitiveness. In this respect, it is essential to manage cloud costs effectively. This requires a deep understanding of current resource utilization, forecasting future needs, and optimising resource utilization to control costs. To address this challenge, new tools and techniques are being developed to provide more visibility and control over cloud computing costs. In this respect, this paper explores a graph-based solution for modelling cost elements and cloud resources and potential ways to solve the resulting constraint problem of cost optimisation. We primarily consider utilization, cost, performance, and availability in this context. Such an approach will eventually help organizations make informed decisions about cloud resource placement and manage the costs of software applications and data workflows deployed in single, hybrid, or multi-cloud environments
Locality-Aware Workflow Orchestration for Big Data
The development of the Edge computing paradigm shifts data processing from centralised infrastructures to heterogeneous and geographically distributed infrastructure. Such a paradigm requires data processing solutions that consider data locality in order to reduce the performance penalties from data transfers between remote (in network terms) data centres. However, existing Big Data processing solutions have limited support for handling data locality and are inefficient in processing small and frequent events specific to Edge environments. This paper proposes a novel architecture and a proof-of-concept implementation for software container-centric Big Data workflow orchestration that puts data locality at the forefront. Our solution considers any available data locality information by default, leverages long-lived containers to execute workflow steps, and handles the interaction with different data sources through containers. We compare our system with Argo workflow and show significant performance improvements in terms of speed of execution for processing units of data using our data locality aware Big Data workflow approach
Conceptualization and scalable execution of big data workflows using domain-specific languages and software containers
Big Data processing, especially with the increasing proliferation of Internet of Things (IoT) technologies and convergence of IoT, edge and cloud computing technologies, involves handling massive and complex data sets on heterogeneous resources and incorporating different tools, frameworks, and processes to help organizations make sense of their data collected from various sources. This set of operations, referred to as Big Data workflows, requires taking advantage of Cloud infrastructures’ elasticity for scalability. In this article, we present the design and prototype implementation of a Big Data workflow approach based on the use of software container technologies, message-oriented middleware (MOM), and a domain-specific language (DSL) to enable highly scalable workflow execution and abstract workflow definition. We demonstrate our system in a use case and a set of experiments that show the practical applicability of the proposed approach for the specification and scalable execution of Big Data workflows. Furthermore, we compare our proposed approach’s scalability with that of Argo Workflows – one of the most prominent tools in the area of Big Data workflows – and provide a qualitative evaluation of the proposed DSL and overall approach with respect to the existing literature