862 research outputs found

    Supporting Quality of Service in Scientific Workflows

    Get PDF
    While workflow management systems have been utilized in enterprises to support businesses for almost two decades, the use of workflows in scientific environments was fairly uncommon until recently. Nowadays, scientists use workflow systems to conduct scientific experiments, simulations, and distributed computations. However, most scientific workflow management systems have not been built using existing workflow technology; rather they have been designed and developed from scratch. Due to the lack of generality of early scientific workflow systems, many domain-specific workflow systems have been developed. Generally speaking, those domain-specific approaches lack common acceptance and tool support and offer lower robustness compared to business workflow systems. In this thesis, the use of the industry standard BPEL, a workflow language for modeling business processes, is proposed for the modeling and the execution of scientific workflows. Due to the widespread use of BPEL in enterprises, a number of stable and mature software products exist. The language is expressive (Turingcomplete) and not restricted to specific applications. BPEL is well suited for the modeling of scientific workflows, but existing implementations of the standard lack important features that are necessary for the execution of scientific workflows. This work presents components that extend an existing implementation of the BPEL standard and eliminate the identified weaknesses. The components thus provide the technical basis for use of BPEL in academia. The particular focus is on so-called non-functional (Quality of Service) requirements. These requirements include scalability, reliability (fault tolerance), data security, and cost (of executing a workflow). From a technical perspective, the workflow system must be able to interface with the middleware systems that are commonly used by the scientific workflow community to allow access to heterogeneous, distributed resources (especially Grid and Cloud resources). The major components cover exactly these requirements: Cloud Resource Provisioner Scalability of the workflow system is achieved by automatically adding additional (Cloud) resources to the workflow system’s resource pool when the workflow system is heavily loaded. Fault Tolerance Module High reliability is achieved via continuous monitoring of workflow execution and corrective interventions, such as re-execution of a failed workflow step or replacement of the faulty resource. Cost Aware Data Flow Aware Scheduler The majority of scientific workflow systems only take the performance and utilization of resources for the execution of workflow steps into account when making scheduling decisions. The presented workflow system goes beyond that. By defining preference values for the weighting of costs and the anticipated workflow execution time, workflow users may influence the resource selection process. The developed multiobjective scheduling algorithm respects the defined weighting and makes both efficient and advantageous decisions using a heuristic approach. Security Extensions Because it supports various encryption, signature and authentication mechanisms (e.g., Grid Security Infrastructure), the workflow system guarantees data security in the transfer of workflow data. Furthermore, this work identifies the need to equip workflow developers with workflow modeling tools that can be used intuitively. This dissertation presents two modeling tools that support users with different needs. The first tool, DAVO (domain-adaptable, Visual BPEL Orchestrator), operates at a low level of abstraction and allows users with knowledge of BPEL to use the full extent of the language. DAVO is a software that offers extensibility and customizability for different application domains. These features are used in the implementation of the second tool, SimpleBPEL Composer. SimpleBPEL is aimed at users with little or no background in computer science and allows for quick and intuitive development of BPEL workflows based on predefined components

    Many-Task Computing and Blue Waters

    Full text link
    This report discusses many-task computing (MTC) generically and in the context of the proposed Blue Waters systems, which is planned to be the largest NSF-funded supercomputer when it begins production use in 2012. The aim of this report is to inform the BW project about MTC, including understanding aspects of MTC applications that can be used to characterize the domain and understanding the implications of these aspects to middleware and policies. Many MTC applications do not neatly fit the stereotypes of high-performance computing (HPC) or high-throughput computing (HTC) applications. Like HTC applications, by definition MTC applications are structured as graphs of discrete tasks, with explicit input and output dependencies forming the graph edges. However, MTC applications have significant features that distinguish them from typical HTC applications. In particular, different engineering constraints for hardware and software must be met in order to support these applications. HTC applications have traditionally run on platforms such as grids and clusters, through either workflow systems or parallel programming systems. MTC applications, in contrast, will often demand a short time to solution, may be communication intensive or data intensive, and may comprise very short tasks. Therefore, hardware and software for MTC must be engineered to support the additional communication and I/O and must minimize task dispatch overheads. The hardware of large-scale HPC systems, with its high degree of parallelism and support for intensive communication, is well suited for MTC applications. However, HPC systems often lack a dynamic resource-provisioning feature, are not ideal for task communication via the file system, and have an I/O system that is not optimized for MTC-style applications. Hence, additional software support is likely to be required to gain full benefit from the HPC hardware

    The HdpH DSLs for scalable reliable computation

    Get PDF
    The statelessness of functional computations facilitates both parallelism and fault recovery. Faults and non-uniform communication topologies are key challenges for emergent large scale parallel architectures. We report on HdpH and HdpH-RS, a pair of Haskell DSLs designed to address these challenges for irregular task-parallel computations on large distributed-memory architectures. Both DSLs share an API combining explicit task placement with sophisticated work stealing. HdpH focuses on scalability by making placement and stealing topology aware whereas HdpH-RS delivers reliability by means of fault tolerant work stealing. We present operational semantics for both DSLs and investigate conditions for semantic equivalence of HdpH and HdpH-RS programs, that is, conditions under which topology awareness can be transparently traded for fault tolerance. We detail how the DSL implementations realise topology awareness and fault tolerance. We report an initial evaluation of scalability and fault tolerance on a 256-core cluster and on up to 32K cores of an HPC platform

    A Case for Cooperative and Incentive-Based Coupling of Distributed Clusters

    Full text link
    Research interest in Grid computing has grown significantly over the past five years. Management of distributed resources is one of the key issues in Grid computing. Central to management of resources is the effectiveness of resource allocation as it determines the overall utility of the system. The current approaches to superscheduling in a grid environment are non-coordinated since application level schedulers or brokers make scheduling decisions independently of the others in the system. Clearly, this can exacerbate the load sharing and utilization problems of distributed resources due to suboptimal schedules that are likely to occur. To overcome these limitations, we propose a mechanism for coordinated sharing of distributed clusters based on computational economy. The resulting environment, called \emph{Grid-Federation}, allows the transparent use of resources from the federation when local resources are insufficient to meet its users' requirements. The use of computational economy methodology in coordinating resource allocation not only facilitates the QoS based scheduling, but also enhances utility delivered by resources.Comment: 22 pages, extended version of the conference paper published at IEEE Cluster'05, Boston, M

    Improving The Fault Tolerance of Ad Hoc Routing Protocols using Aspect-oriented Programming

    Full text link
    [ES] Las redes ad hoc son redes inalámbricas distribuidas formadas por nodos móviles que se ubican libremente y dinámicamente, capaces de organizarse de manera propia en topologías arbitrarias y temporales, a través de la actuación de los protocolos de encaminamiento. Estas redes permiten a las personas y dispositivos conectarse sin problemas rápidamente, en áreas sin una infraestructura de comunicaciones previa y con un bajo coste. Muchos estudios demuestran que los protocolos de encaminamiento ad hoc se ven amenazados por una variedad de fallos accidentales y maliciosos, como la saturación de vecinos, que puede afectar a cualquier tipo de red ad hoc, y el ruido ambiental, que puede afectar en general a todas las redes inalámbricas. Por lo tanto, el desarrollo y la implementación de estrategias de tolerancia a fallos para mitigar el efecto de las fallos, es esencial para el uso práctico de este tipo de redes. Sin embargo, los mecanismos de tolerancia a fallos suelen estar implementados de manera específica, dentro del código fuente de los protocolos de encaminamiento que hace que i) ser reescrito y reorganizado cada vez que una nueva versión de un protocolo se libera, y ii) tener un carácter completamente remodelado y adaptado a las nuevas versiones de los protocolos. Esta tesis de máster explora la viabilidad de utilizar programación orientada a aspectos (AOP), para desarrollar e implementar los mecanismos de tolerancia a fallos adecuados para toda una familia de protocolos de encaminamiento, es decir, las versiones actuales y futuras de un protocolo determinado (OLSR en este caso). Por otra parte, se propone una nueva metodología para ampliar estos mecanismos a diferentes familias de protocolos proactivos (OLSR, BATMAN y Babel) con un nuevo concepto de AOP, el metaaspecto. La viabilidad y efectividad de la propuesta se ha evaluado experimentalmente, estableciendo así un nuevo método para mejorar la implementación de la portabilidad y facilidad de mantenimiento de los mecanismos de tolerancia a fallos en los protocolos de enrutamiento ad hoc y, por lo tanto, la fiabilidad de las redes ad hoc.[EN] Ad hoc networks are distributed networks consisting of wireless mobile nodes that can freely and dynamically self-organize into arbitrary and temporary topologies, through the operation of routing protocols. These networks allow people and devices to seamlessly interconnect rapidly in areas with no pre-existing communication infrastructure and with a low cost. Many studies show that ad hoc routing protocols are threatened by a variety of accidental and malicious faults, like neighbour saturation, which may affect any kind of ad hoc network, and ambient noise, which may impact all wireless networks in general. Therefore, developing and deploying fault tolerance strategies to mitigate the effect of such faults is essential for the practical use of this kind of networks. However, those fault tolerance mechanisms are usually embedded into the source code of routing protocols which causes that i) they must be rewritten and redeployed whenever a new version of a protocol is released, and ii) they must be completely redeveloped and adapted to new routing protocols. This master thesis explores the feasibility of using Aspect-Oriented Programming (AOP) to develop and deploy fault tolerance mechanisms suitable for a whole family of routing protocols, i.e. existing and future versions of a given protocol (OLSR in this case). Furthermore, a new methodology is proposed to extend these mechanisms to different families of proactive protocols (OLSR, B.A.T.M.A.N and Babel) using a new concept in AOP, the meta-aspect. The feasibility and effectiveness of the proposal is experimentally assessed, thus establishing a new method to improve the deployment, portability, and maintainability of fault tolerance mechanisms for ad hoc routing protocols and, therefore, the dependability of ad hoc networks.Bustos Rodríguez, AJ. (2012). Improving The Fault Tolerance of Ad Hoc Routing Protocols using Aspect-oriented Programming. http://hdl.handle.net/10251/18421Archivo delegad

    Grid-centric scheduling strategies for workflow applications

    Get PDF
    Grid computing faces a great challenge because the resources are not localized, but distributed, heterogeneous and dynamic. Thus, it is essential to provide a set of programming tools that execute an application on the Grid resources with as little input from the user as possible. The thesis of this work is that Grid-centric scheduling techniques of workflow applications can provide good usability of the Grid environment by reliably executing the application on a large scale distributed system with good performance. We support our thesis with new and effective approaches in the following five aspects. First, we modeled the performance of the existing scheduling approaches in a multi-cluster Grid environment. We implemented several widely-used scheduling algorithms and identified the best candidate. The study further introduced a new measurement, based on our experiments, which can improve the schedule quality of some scheduling algorithms as much as 20 fold in a multi-cluster Grid environment. Second, we studied the scalability of the existing Grid scheduling algorithms. To deal with Grid systems consisting of hundreds of thousands of resources, we designed and implemented a novel approach that performs explicit resource selection decoupled from scheduling Our experimental evaluation confirmed that our decoupled approach can be scalable in such an environment without sacrificing the quality of the schedule by more than 10%. Third, we proposed solutions to address the dynamic nature of Grid computing with a new cluster-based hybrid scheduling mechanism. Our experimental results collected from real executions on production clusters demonstrated that this approach produces programs running 30% to 100% faster than the other scheduling approaches we implemented on both reserved and shared resources. Fourth, we improved the reliability of Grid computing by incorporating fault- tolerance and recovery mechanisms into the workow application execution. Our experiments on a simulated multi-cluster Grid environment demonstrated the effectiveness of our approach and also characterized the three-way trade-off between reliability, performance and resource usage when executing a workflow application. Finally, we improved the large batch-queue wait time often found in production Grid clusters. We developed a novel approach to partition the workow application and submit them judiciously to achieve less total batch-queue wait time. The experimental results derived from production site batch queue logs show that our approach can reduce total wait time by as much as 70%. Our approaches combined can greatly improve the usability of Grid computing while increasing the performance of workow applications on a multi-cluster Grid environment

    Runtime Adaptation of Scientific Service Workflows

    Get PDF
    Software landscapes are rather subject to change than being complete after having been built. Changes may be caused by a modified customer behavior, the shift to new hardware resources, or otherwise changed requirements. In such situations, several challenges arise. New architectural models have to be designed and implemented, existing software has to be integrated, and, finally, the new software has to be deployed, monitored, and, where appropriate, optimized during runtime under realistic usage scenarios. All of these situations often demand manual intervention, which causes them to be error-prone. This thesis addresses these types of runtime adaptation. Based on service-oriented architectures, an environment is developed that enables the integration of existing software (i.e., the wrapping of legacy software as web services). A workflow modeling tool that aims at an easy-to-use approach by separating the role of the workflow expert and the role of the domain expert. After the development of workflows, tools that observe the executing infrastructure and perform automatic scale-in and scale-out operations are presented. Infrastructure-as-a-Service providers are used to scale the infrastructure in a transparent and cost-efficient way. The deployment of necessary middleware tools is automatically done. The use of a distributed infrastructure can lead to communication problems. In order to keep workflows robust, these exceptional cases need to treated. But, in this way, the process logic of a workflow gets mixed up and bloated with infrastructural details, which yields an increase in its complexity. In this work, a module is presented that can deal automatically with infrastructural faults and that thereby allows to keep the separation of these two layers. When services or their components are hosted in a distributed environment, some requirements need to be addressed at each service separately. Although techniques as object-oriented programming or the usage of design patterns like the interceptor pattern ease the adaptation of service behavior or structures. Still, these methods require to modify the configuration or the implementation of each individual service. On the other side, aspect-oriented programming allows to weave functionality into existing code even without having its source. Since the functionality needs to be woven into the code, it depends on the specific implementation. In a service-oriented architecture, where the implementation of a service is unknown, this approach clearly has its limitations. The request/response aspects presented in this thesis overcome this obstacle and provide a SOA-compliant and new methods to weave functionality into the communication layer of web services. The main contributions of this thesis are the following: Shifting towards a service-oriented architecture: The generic and extensible Legacy Code Description Language and the corresponding framework allow to wrap existing software, e.g., as web services, which afterwards can be composed into a workflow by SimpleBPEL without overburdening the domain expert with technical details that are indeed handled by a workflow expert. Runtime adaption: Based on the standardized Business Process Execution Language an automatic scheduling approach is presented that monitors all used resources and is able to automatically provision new machines in case a scale-out becomes necessary. If the resource's load drops, e.g., because of less workflow executions, a scale-in is also automatically performed. The scheduling algorithm takes the data transfer between the services into account in order to prevent scheduling allocations that eventually increase the workflow's makespan due to unnecessary or disadvantageous data transfers. Furthermore, a multi-objective scheduling algorithm that is based on a genetic algorithm is able to additionally consider cost, in a way that a user can define her own preferences rising from optimized execution times of a workflow and minimized costs. Possible communication errors are automatically detected and, according to certain constraints, corrected. Adaptation of communication: The presented request/response aspects allow to weave functionality into the communication of web services. By defining a pointcut language that only relies on the exchanged documents, the implementation of services must neither be known nor be available. The weaving process itself is modeled using web services. In this way, the concept of request/response aspects is naturally embedded into a service-oriented architecture

    A fault tolerant, peer-to-peer based scheduler for home grids

    Get PDF
    This thesis presents a fault-tolerant, Peer-to-Peer (P2P) based grid scheduling system for highly dynamic and highly heterogeneous environments, such as home networks, where we can find a variety of devices (laptops, PCs, game consoles, etc.) and networks. The number of devices found in a house that are capable of processing data has been increasing in the last few years. However, being able to process data does not mean that these devices are powerful, and, in a home environment, there will be a demand for some applications that need significant computing resources, beyond the capabilities of a single domestic device, such as a set top box (examples of such applications are TV recommender systems, image processing and photo indexing systems). A computational grid is a possible solution for this problem, but the constrained environment in the home makes it difficult to use conventional grid scheduling technologies, which demand a powerful infrastructure. Our solution is based on the distribution of the matchmaking task among providers, leaving the final allocation decision to a central scheduler that can be running on a limited device without a big loss in performance. We evaluate our solution by simulating different scenarios and configurations against the Opportunistic Load Balance (OLB) scheduling heuristic, which we found to be the best option for home grids from the existing solutions that we analysed. The results have shown that our solution performs similar or better to OLB. Furthermore, our solution also provides fault tolerance, which is not achieved with OLB, and we have formally verified the behaviour our solution against two cases of network partition failure

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
    • …
    corecore