172 research outputs found

    Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows

    Get PDF
    Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. Although the scientific community has addressed this challenge from both theoretical and practical approaches, failure prediction, detection, and recovery still raise many research questions. In this paper, we propose an approach inspired by the control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach is inspired on the proportional–integral–derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, where the controller will react to adjust its output to mitigate faults. PID controllers aim to detect the possibility of a non-steady state far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of large scale data-intensive workflows—data storage overload and memory overflow. We developed a simulator, which implements and evaluates simple standalone PID-inspired controllers to autonomously manage data and memory usage of a data-intensive bioinformatics workflow that consumes/produces over 4.4 TB of data, and requires over 24 TB of memory to run all tasks concurrently. Experimental results obtained via simulation indicate that workflow executions may significantly benefit from the controller-inspired approach, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence

    Online Self-Healing Control Loop to Prevent and Mitigate Faults in Scientific Workflows

    Get PDF
    Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflow in distributed systems is failure prediction, detection, and recovery. In this paper, we present a novel online self-healing framework, where failures are predicted before they happen, and are mitigated when possible. The proposed approach is to use control theory developed as part of autonomic computing, and in particular apply the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the mechanism. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era—data footprint and memory usage. We define, implement, and evaluate PID controllers to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that workflow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed control loop, and faults are detected and mitigated far in advance

    Trusted resource allocation in volunteer edge-cloud computing for scientific applications

    Get PDF
    Data-intensive science applications in fields such as e.g., bioinformatics, health sciences, and material discovery are becoming increasingly dynamic and demanding with resource requirements. Researchers using these applications which are based on advanced scientific workflows frequently require a diverse set of resources that are often not available within private servers or a single Cloud Service Provider (CSP). For example, a user working with Precision Medicine applications would prefer only those CSPs who follow guidelines from HIPAA (Health Insurance Portability and Accountability Act) for implementing their data services and might want services from other CSPs for economic viability. With the generation of more and more data these workflows often require deployment and dynamic scaling of multi-cloud resources in an efficient and high-performance manner (e.g., quick setup, reduced computation time, and increased application throughput). At the same time, users seek to minimize the costs of configuring the related multi-cloud resources. While performance and cost are among the key factors to decide upon CSP resource selection, the scientific workflows often process proprietary/confidential data that introduces additional constraints of security postures. Thus, users have to make an informed decision on the selection of resources that are most suited for their applications while trading off between the key factors of resource selection which are performance, agility, cost, and security (PACS). Furthermore, even with the most efficient resource allocation across multi-cloud, the cost to solution might not be economical for all users which have led to the development of new paradigms of computing such as volunteer computing where users utilize volunteered cyber resources to meet their computing requirements. For economical and readily available resources, it is essential that such volunteered resources can integrate well with cloud resources for providing the most efficient computing infrastructure for users. In this dissertation, individual stages such as user requirement collection, user's resource preferences, resource brokering and task scheduling, in lifecycle of resource brokering for users are tackled. For collection of user requirements, a novel approach through an iterative design interface is proposed. In addition, fuzzy interference-based approach is proposed to capture users' biases and expertise for guiding their resource selection for their applications. The results showed improvement in performance i.e. time to execute in 98 percent of the studied applications. The data collected on user's requirements and preferences is later used by optimizer engine and machine learning algorithms for resource brokering. For resource brokering, a new integer linear programming based solution (OnTimeURB) is proposed which creates multi-cloud template solutions for resource allocation while also optimizing performance, agility, cost, and security. The solution was further improved by the addition of a machine learning model based on naive bayes classifier which captures the true QoS of cloud resources for guiding template solution creation. The proposed solution was able to improve the time to execute for as much as 96 percent of the largest applications. As discussed above, to fulfill necessity of economical computing resources, a new paradigm of computing viz-a-viz Volunteer Edge Computing (VEC) is proposed which reduces cost and improves performance and security by creating edge clusters comprising of volunteered computing resources close to users. The initial results have shown improved time of execution for application workflows against state-of-the-art solutions while utilizing only the most secure VEC resources. Consequently, we have utilized reinforcement learning based solutions to characterize volunteered resources for their availability and flexibility towards implementation of security policies. The characterization of volunteered resources facilitates efficient allocation of resources and scheduling of workflows tasks which improves performance and throughput of workflow executions. VEC architecture is further validated with state-of-the-art bioinformatics workflows and manufacturing workflows.Includes bibliographical references

    GRID Portal Application Visualization

    Get PDF
    Parameter studies are useful applications for researchers; however, these programs, although helpful, tend to be computationally expensive and due to their long execution time become tedious to execute. In this project we explored a method of implementing a parameter study module for the P-GRADE Portal at MTA-SZTAKI; Budapest, Hungary, an existing parallel application that allows users to create and execute a parallel program in an efficient manner without knowledge of MPI or PVM programming

    Programming and parallelising applications for distributed infrastructures

    Get PDF
    The last decade has witnessed unprecedented changes in parallel and distributed infrastructures. Due to the diminished gains in processor performance from increasing clock frequency, manufacturers have moved from uniprocessor architectures to multicores; as a result, clusters of computers have incorporated such new CPU designs. Furthermore, the ever-growing need of scienti c applications for computing and storage capabilities has motivated the appearance of grids: geographically-distributed, multi-domain infrastructures based on sharing of resources to accomplish large and complex tasks. More recently, clouds have emerged by combining virtualisation technologies, service-orientation and business models to deliver IT resources on demand over the Internet. The size and complexity of these new infrastructures poses a challenge for programmers to exploit them. On the one hand, some of the di culties are inherent to concurrent and distributed programming themselves, e.g. dealing with thread creation and synchronisation, messaging, data partitioning and transfer, etc. On the other hand, other issues are related to the singularities of each scenario, like the heterogeneity of Grid middleware and resources or the risk of vendor lock-in when writing an application for a particular Cloud provider. In the face of such a challenge, programming productivity - understood as a tradeo between programmability and performance - has become crucial for software developers. There is a strong need for high-productivity programming models and languages, which should provide simple means for writing parallel and distributed applications that can run on current infrastructures without sacri cing performance. In that sense, this thesis contributes with Java StarSs, a programming model and runtime system for developing and parallelising Java applications on distributed infrastructures. The model has two key features: first, the user programs in a fully-sequential standard-Java fashion - no parallel construct, API call or pragma must be included in the application code; second, it is completely infrastructure-unaware, i.e. programs do not contain any details about deployment or resource management, so that the same application can run in di erent infrastructures with no changes. The only requirement for the user is to select the application tasks, which are the model's unit of parallelism. Tasks can be either regular Java methods or web service operations, and they can handle any data type supported by the Java language, namely les, objects, arrays and primitives. For the sake of simplicity of the model, Java StarSs shifts the burden of parallelisation from the programmer to the runtime system. The runtime is responsible from modifying the original application to make it create asynchronous tasks and synchronise data accesses from the main program. Moreover, the implicit inter-task concurrency is automatically found as the application executes, thanks to a data dependency detection mechanism that integrates all the Java data types. This thesis provides a fairly comprehensive evaluation of Java StarSs on three di erent distributed scenarios: Grid, Cluster and Cloud. For each of them, a runtime system was designed and implemented to exploit their particular characteristics as well as to address their issues, while keeping the infrastructure unawareness of the programming model. The evaluation compares Java StarSs against state-of-the-art solutions, both in terms of programmability and performance, and demonstrates how the model can bring remarkable productivity to programmers of parallel distributed applications

    Decentralized Resource Scheduling in Grid/Cloud Computing

    Get PDF
    In the Grid/Cloud environment, applications or services and resources belong to different organizations with different objectives. Entities in the Grid/Cloud are autonomous and self-interested; however, they are willing to share their resources and services to achieve their individual and collective goals. In such open environment, the scheduling decision is a challenge given the decentralized nature of the environment. Each entity has specific requirements and objectives that need to achieve. In this thesis, we review the Grid/Cloud computing technologies, environment characteristics and structure and indicate the challenges within the resource scheduling. We capture the Grid/Cloud scheduling model based on the complete requirement of the environment. We further create a mapping between the Grid/Cloud scheduling problem and the combinatorial allocation problem and propose an adequate economic-based optimization model based on the characteristic and the structure nature of the Grid/Cloud. By adequacy, we mean that a comprehensive view of required properties of the Grid/Cloud is captured. We utilize the captured properties and propose a bidding language that is expressive where entities have the ability to specify any set of preferences in the Grid/Cloud and simple as entities have the ability to express structured preferences directly. We propose a winner determination model and mechanism that utilizes the proposed bidding language and finds a scheduling solution. Our proposed approach integrates concepts and principles of mechanism design and classical scheduling theory. Furthermore, we argue that in such open environment privacy concerns by nature is part of the requirement in the Grid/Cloud. Hence, any scheduling decision within the Grid/Cloud computing environment is to incorporate the feasibility of privacy protection of an entity. Each entity has specific requirements in terms of scheduling and privacy preferences. We analyze the privacy problem in the Grid/Cloud computing environment and propose an economic based model and solution architecture that provides a scheduling solution given privacy concerns in the Grid/Cloud. Finally, as a demonstration of the applicability of the approach, we apply our solution by integrating with Globus toolkit (a well adopted tool to enable Grid/Cloud computing environment). We also, created simulation experimental results to capture the economic and time efficiency of the proposed solution

    Doctor of Philosophy

    Get PDF
    dissertationDataflow pipeline models are widely used in visualization systems. Despite recent advancements in parallel architecture, most systems still support only a single CPU or a small collection of CPUs such as a SMP workstation. Even for systems that are specifically tuned towards parallel visualization, their execution models only provide support for data-parallelism while ignoring taskparallelism and pipeline-parallelism. With the recent popularization of machines equipped with multicore CPUs and multi-GPU units, these visualization systems are undoubtedly falling further behind in reaching maximum efficiency. On the other hand, there exist several libraries that can schedule program executions on multiple CPUs and/or multiple GPUs. However, due to differences in executing a task graph and a pipeline along with their APIs being considerably low-level, it still remains a challenge to integrate these run-time libraries into current visualization systems. Thus, there is a need for a redesigned dataflow architecture to fully support and exploit the power of highly parallel machines in large-scale visualization. The new design must be able to schedule executions on heterogeneous platforms while at the same time supporting arbitrarily large datasets through the use of streaming data structures. The primary goal of this dissertation work is to develop a parallel dataflow architecture for streaming large-scale visualizations. The framework includes supports for platforms ranging from multicore processors to clusters consisting of thousands CPUs and GPUs. We achieve this in our system by introducing the notion of Virtual Processing Elements and Task-Oriented Modules along with a highly customizable scheduler that controls the assignment of tasks to elements dynamically. This creates an intuitive way to maintain multiple CPU/GPU kernels yet still provide coherency and synchronization across module executions. We have implemented these techniques into HyperFlow which is made of an API with all basic dataflow constructs described in the dissertation, and a distributed run-time library that can be used to deploy those pipelines on multicore, multi-GPU and cluster-based platforms

    Causal Consistency Verification in Restful Systems

    Get PDF
    Replicated systems cannot maintain both availability and (strong) consistency when exposed to network partitions. Strong consistency requires every read to return the last written value, which can lead clients to experience high latency or even timeout errors. Replicated applications usually rely on weak consistency, since clients can perform operations contacting a single replica, leading to decreased latency and increased availability. Causal consistency is a weak consistency model, however, it is the strongest one for highly available systems. Many applications are switching to this particular consistency model, since it ensures users never observe data items before they observe the ones that influenced their creation. Verifying if applications satisfy the consistency they claim to provide is no easy task. In this dissertation, we propose an algorithm to verify causal consistency in RESTful applications. Our approach adopts a black box testing, where multiple concurrent clients execute operations in a service and records the log of interactions. This log of interactions is then processed to verify if the results respect causal consistency. The key challenge is to infer causal dependencies among operations executed in different clients without adding any additional metadata to the data maintained by the service. When considering a particular operation, the algorithm builds a new dependency graph that considers one of the possible justifications the operation might have, but if this justification results in failure further ahead in the processing, it is necessary to build another graph considering another justification of that same operation. The algorithm relies on recursion in order to achieve this backtracking behaviour. If the algorithm is able to build a graph containing every operation present in the log, where the chosen justifications remain valid until the end of the processing, it outputs that the execution corresponding to that log satisfies causal consistency. The evaluation confirms that the algorithm is able to detect violations when feeding either small or large logs representing executions of RESTful applications that do not satisfy causal consistency.Os sistemas replicados não podem manter a disponibilidade e a consistência (forte) quando expostos a partições de rede. A consistência forte exige que cada leitura retorne o último valor escrito, o que pode levar os clientes a experienciar alta latência ou até mesmo erros de tempo limite. As aplicações replicados geralmente usam consistência fraca, pois os clientes podem realizar operações contactando uma única réplica, levando a latências baixas e maior disponibilidade. A consistência causal é um modelo de consistência fraco, mas é o mais forte para sistemas altamente disponíveis. Muitas aplicações usam este modelo, pois garante que os clientes nunca observem dados antes de observar os que influenciaram a sua criação. Verificar se as aplicações satisfazem a consistência que alegam fornecer não é fácil. Nesta dissertação, propomos um algoritmo para verificar a consistência causal em aplicações RESTful. A nossa abordagem adota um teste de caixa negra, onde vários clientes concurrentes executam operações num serviço, onde as interações são documentadas num ficheiro. Este ficheiro é processado para verificar se os resultados respeitam a consistência causal. O principal desafio é inferir as dependências causais entre as operações executadas em diferentes clientes sem adicionar metadados adicionais aos dados mantidos pelo serviço. Ao considerar uma determinada operação, o algoritmo constrói um novo grafo de dependências que considera uma das possíveis justificações que a operação possa ter, mas se esta justificação resultar em erro mais tarde no processamento, é necessário construir outro grafo considerando outra justificação dessa mesma operação. O algoritmo é recursivo de modo a alcançar esse comportamento de retrocesso. Se o algoritmo conseguir construir um grafo que contém todas as operações presentes no ficheiro, onde as justificações escolhidas permanecem válidas até o final do processamento, indica que a execução correspondente a este ficheiro satisfaz a consistência causal. Aavaliação confirma que o algoritmo é capaz de detectar violações ao fornecer ficheiros pequenos ou grandes representando execuções de aplicações RESTful que não satisfazem a consistência causal

    Situation-aware Edge Computing

    Get PDF
    Future wireless networks must cope with an increasing amount of data that needs to be transmitted to or from mobile devices. Furthermore, novel applications, e.g., augmented reality games or autonomous driving, require low latency and high bandwidth at the same time. To address these challenges, the paradigm of edge computing has been proposed. It brings computing closer to the users and takes advantage of the capabilities of telecommunication infrastructures, e.g., cellular base stations or wireless access points, but also of end user devices such as smartphones, wearables, and embedded systems. However, edge computing introduces its own challenges, e.g., economic and business-related questions or device mobility. Being aware of the current situation, i.e., the domain-specific interpretation of environmental information, makes it possible to develop approaches targeting these challenges. In this thesis, the novel concept of situation-aware edge computing is presented. It is divided into three areas: situation-aware infrastructure edge computing, situation-aware device edge computing, and situation-aware embedded edge computing. Therefore, the concepts of situation and situation-awareness are introduced. Furthermore, challenges are identified for each area, and corresponding solutions are presented. In the area of situation-aware infrastructure edge computing, economic and business-related challenges are addressed, since companies offering services and infrastructure edge computing facilities have to find agreements regarding the prices for allowing others to use them. In the area of situation-aware device edge computing, the main challenge is to find suitable nodes that can execute a service and to predict a node’s connection in the near future. Finally, to enable situation-aware embedded edge computing, two novel programming and data analysis approaches are presented that allow programmers to develop situation-aware applications. To show the feasibility, applicability, and importance of situation-aware edge computing, two case studies are presented. The first case study shows how situation-aware edge computing can provide services for emergency response applications, while the second case study presents an approach where network transitions can be implemented in a situation-aware manner

    Scientific Workflows: Moving Across Paradigms

    Get PDF
    Modern scientific collaborations have opened up the opportunity to solve complex problems that require both multidisciplinary expertise and large-scale computational experiments. These experiments typically consist of a sequence of processing steps that need to be executed on selected computing platforms. Execution poses a challenge, however, due to (1) the complexity and diversity of applications, (2) the diversity of analysis goals, (3) the heterogeneity of computing platforms, and (4) the volume and distribution of data. A common strategy to make these in silico experiments more manageable is to model them as workflows and to use a workflow management system to organize their execution. This article looks at the overall challenge posed by a new order of scientific experiments and the systems they need to be run on, and examines how this challenge can be addressed by workflows and workflow management systems. It proposes a taxonomy of workflow management system (WMS) characteristics, including aspects previously overlooked. This frames a review of prevalent WMSs used by the scientific community, elucidates their evolution to handle the challenges arising with the emergence of the “fourth paradigm,” and identifies research needed to maintain progress in this area
    corecore