256 research outputs found

    Management of customizable software-as-a-service in cloud and network environments

    Get PDF

    Methodology for malleable applications on distributed memory systems

    Get PDF
    A la portada logo BSC(English) The dominant programming approach for scientific and industrial computing on clusters is MPI+X. While there are a variety of approaches within the node, denoted by the ``X'', Message Passing interface (MPI) is the standard for programming multiple nodes with distributed memory. This thesis argues that the OmpSs-2 tasking model can be extended beyond the node to naturally support distributed memory, with three benefits: First, at small to medium scale the tasking model is a simpler and more productive alternative to MPI. It eliminates the need to distribute the data explicitly and convert all dependencies into explicit message passing. It also avoids the complexity of hybrid programming using MPI+X. Second, the ability to offload parts of the computation among the nodes enables the runtime to automatically balance the loads in a full-scale MPI+X program. This approach does not require a cost model, and it is able to transparently balance the computational loads across the whole program, on all its nodes. Third, because the runtime handles all low-level aspects of data distribution and communication, it can change the resource allocation dynamically, in a way that is transparent to the application. This thesis describes the design, development and evaluation of OmpSs-2@Cluster, a programming model and runtime system that extends the OmpSs-2 model to allow a virtually unmodified OmpSs-2 program to run across multiple distributed memory nodes. For well-balanced applications it provides similar performance to MPI+OpenMP on up to 16 nodes, and it improves performance by up to 2x for irregular and unbalanced applications like Cholesky factorization. This work also extended OmpSs-2@Cluster for interoperability with MPI and Barcelona Supercomputing Center (BSC)'s state-of-the-art Dynamic Load Balance (DLB) library in order to dynamically balance MPI+OmpSs-2 applications by transparently offloading tasks among nodes. This approach reduces the execution time of a microscale solid mechanics application by 46% on 64 nodes and on a synthetic benchmark, it is within 10% of perfect load balancing on up to 8 nodes. Finally, the runtime was extended to transparently support malleability for pure OmpSs-2@Cluster programs and interoperate with the Resources Management System (RMS). The only change to the application is to explicitly call an API function to control the addition or removal of nodes. In this regard we additionally provide the runtime with the ability to semi-transparently save and recover part of the application status to perform checkpoint and restart. Such a feature hides the complexity of data redistribution and parallel IO from the user while allowing the program to recover and continue previous executions. Our work is a starting point for future research on fault tolerance. In summary, OmpSs-2@Cluster expands the OmpSs-2 programming model to encompass distributed memory clusters. It allows an existing OmpSs-2 program, with few if any changes, to run across multiple nodes. OmpSs-2@Cluster supports transparent multi-node dynamic load balancing for MPI+OmpSs-2 programs, and enables semi-transparent malleability for OmpSs-2@Cluster programs. The runtime system has a high level of stability and performance, and it opens several avenues for future work.(Español) El modelo de programación dominante para clusters tanto en ciencia como industria es actualmente MPI+X. A pesar de que hay alguna variedad de alternativas para programar dentro de un nodo (indicado por la "X"), el estandar para programar múltiples nodos con memoria distribuida sigue siendo Message Passing Interface (MPI). Esta tesis propone la extensión del modelo de programación basado en tareas OmpSs-2 para su funcionamiento en sistemas de memoria distribuida, destacando 3 beneficios principales: En primer lugar; a pequeña y mediana escala, un modelo basado en tareas es más simple y productivo que MPI y elimina la necesidad de distribuir los datos explícitamente y convertir todas las dependencias en mensajes. Además, evita la complejidad de la programacion híbrida MPI+X. En segundo lugar; la capacidad de enviar partes del cálculo entre los nodos permite a la librería balancear la carga de trabajo en programas MPI+X a gran escala. Este enfoque no necesita un modelo de coste y permite equilibrar cargas transversalmente en todo el programa y todos los nodos. En tercer lugar; teniendo en cuenta que es la librería quien maneja todos los aspectos relacionados con distribución y transferencia de datos, es posible la modificación dinámica y transparente de los recursos que utiliza la aplicación. Esta tesis describe el diseño, desarrollo y evaluación de OmpSs-2@Cluster; un modelo de programación y librería que extiende OmpSs-2 permitiendo la ejecución de programas OmpSs-2 existentes en múltiples nodos sin prácticamente necesidad de modificarlos. Para aplicaciones balanceadas, este modelo proporciona un rendimiento similar a MPI+OpenMP hasta 16 nodos y duplica el rendimiento en aplicaciones irregulares o desbalanceadas como la factorización de Cholesky. Este trabajo incluye la extensión de OmpSs-2@Cluster para interactuar con MPI y la librería de balanceo de carga Dynamic Load Balancing (DLB) desarrollada en el Barcelona Supercomputing Center (BSC). De este modo es posible equilibrar aplicaciones MPI+OmpSs-2 mediante la transferencia transparente de tareas entre nodos. Este enfoque reduce el tiempo de ejecución de una aplicación de mecánica de sólidos a micro-escala en un 46% en 64 nodos; en algunos experimentos hasta 8 nodos se pudo equilibrar perfectamente la carga con una diferencia inferior al 10% del equilibrio perfecto. Finalmente, se implementó otra extensión de la librería para realizar operaciones de maleabilidad en programas OmpSs-2@Cluster e interactuar con el Sistema de Manejo de Recursos (RMS). El único cambio requerido en la aplicación es la llamada explicita a una función de la interfaz que controla la adición o eliminación de nodos. Además, se agregó la funcionalidad de guardar y recuperar parte del estado de la aplicación de forma semitransparente con el objetivo de realizar operaciones de salva-reinicio. Dicha funcionalidad oculta al usuario la complejidad de la redistribución de datos y las operaciones de lectura-escritura en paralelo, mientras permite al programa recuperar y continuar ejecuciones previas. Este es un punto de partida para futuras investigaciones en tolerancia a fallos. En resumen, OmpSs-2@Cluster amplía el modelo de programación de OmpSs-2 para abarcar sistemas de memoria distribuida. El modelo permite la ejecución de programas OmpSs-2 en múltiples nodos prácticamente sin necesidad de modificarlos. OmpSs-2@Cluster permite además el balanceo dinámico de carga en aplicaciones híbridas MPI+OmpSs-2 ejecutadas en varios nodos y es capaz de realizar maleabilidad semi-transparente en programas OmpSs-2@Cluster puros. La librería tiene un niveles de rendimiento y estabilidad altos y abre varios caminos para trabajos futuro.Arquitectura de computador

    An Integrated Modeling Framework for Managing the Deployment and Operation of Cloud Applications

    Get PDF
    Cloud computing can help Software as a Service (SaaS) providers to take advantage of the sheer number of cloud benefits such as, agility, continuity, cost reduction, autonomy, and easy management of resources. To reap the benefits, SaaS providers should create their applications to utilize the cloud platform capabilities. However, this is a daunting task. First, it requires a full understanding of the service offerings from different providers, and the meta-data artifacts required by each provider to configure the platform to efficiently deploy, run and manage the application. Second, it involves complex decisions that are specified by different stakeholders. Examples include, financial decisions (e.g., selecting a platform to reduces costs), architectural decisions (e.g., partition the application to maximize scalability), and operational decisions (e.g., distributing modules to insure availability and porting the application to other platforms). Finally, while each stakeholder may conduct a certain type of change to address a specific concern, the impact of a change may span multiple models and influence the decisions of several stakeholders. These factors motivate the need for: (i) a new architectural view model that focuses on service operation and reflects the cloud stakeholder perspectives, and (ii) a novel framework that facilitates providing holistic as well as partial architectural views, and generating the required platform artifacts by fragmenting the model into artifacts that can be easily modified separately. This PhD research devises a novel architecture framework, "The 5+1 Architectural View Model", for cloud applications, in which each view corresponds to a different perspective on cloud application deployment. The architectural framework is realized as a cloud modeling framework, called "StratusML", which consists of a modeling language that uses layers to specify the cloud configuration space, and a transformation engine to generate the configuration space artifacts. The usefulness and practical applicability of StratusML to model multi-cloud and multi-tenant applications have been demonstrated though a representative domain example. Moreover, to automate the framework evolution as new concerns and cloud platforms emerge, this research work introduces also a novel schema matching technique, called "Liberate". Liberate supports the process of domain model creation, evolution, and transformations. Liberate helps solve the vendor lock-in problem by reducing the manual efforts required to map complex correspondences between cloud schemas whose domain concepts do not share linguistic similarities. The evaluation of Liberate shows its superiority in the cloud domain over existing schema matching approaches

    Performance optimization and energy efficiency of big-data computing workflows

    Get PDF
    Next-generation e-science is producing colossal amounts of data, now frequently termed as Big Data, on the order of terabyte at present and petabyte or even exabyte in the predictable future. These scientific applications typically feature data-intensive workflows comprised of moldable parallel computing jobs, such as MapReduce, with intricate inter-job dependencies. The granularity of task partitioning in each moldable job of such big data workflows has a significant impact on workflow completion time, energy consumption, and financial cost if executed in clouds, which remains largely unexplored. This dissertation conducts an in-depth investigation into the properties of moldable jobs and provides an experiment-based validation of the performance model where the total workload of a moldable job increases along with the degree of parallelism. Furthermore, this dissertation conducts rigorous research on workflow execution dynamics in resource sharing environments and explores the interactions between workflow mapping and task scheduling on various computing platforms. A workflow optimization architecture is developed to seamlessly integrate three interrelated technical components, i.e., resource allocation, job mapping, and task scheduling. Cloud computing provides a cost-effective computing platform for big data workflows where moldable parallel computing models are widely applied to meet stringent performance requirements. Based on the moldable parallel computing performance model, a big-data workflow mapping model is constructed and a workflow mapping problem is formulated to minimize workflow makespan under a budget constraint in public clouds. This dissertation shows this problem to be strongly NP-complete and designs i) a fully polynomial-time approximation scheme for a special case with a pipeline-structured workflow executed on virtual machines of a single class, and ii) a heuristic for a generalized problem with an arbitrary directed acyclic graph-structured workflow executed on virtual machines of multiple classes. The performance superiority of the proposed solution is illustrated by extensive simulation-based results in Hadoop/YARN in comparison with existing workflow mapping models and algorithms. Considering that large-scale workflows for big data analytics have become a main consumer of energy in data centers, this dissertation also delves into the problem of static workflow mapping to minimize the dynamic energy consumption of a workflow request under a deadline constraint in Hadoop clusters, which is shown to be strongly NP-hard. A fully polynomial-time approximation scheme is designed for a special case with a pipeline-structured workflow on a homogeneous cluster and a heuristic is designed for the generalized problem with an arbitrary directed acyclic graph-structured workflow on a heterogeneous cluster. This problem is further extended to a dynamic version with deadline-constrained MapReduce workflows to minimize dynamic energy consumption in Hadoop clusters. This dissertation proposes a semi-dynamic online scheduling algorithm based on adaptive task partitioning to reduce dynamic energy consumption while meeting performance requirements from a global perspective, and also develops corresponding system modules for algorithm implementation in the Hadoop ecosystem. The performance superiority of the proposed solutions in terms of dynamic energy saving and deadline missing rate is illustrated by extensive simulation results in comparison with existing algorithms, and further validated through real-life workflow implementation and experiments using the Oozie workflow engine in Hadoop/YARN systems

    Data Management Strategies for Relative Quality of Service in Virtualised Storage Systems

    No full text
    The amount of data managed by organisations continues to grow relentlessly. Driven by the high costs of maintaining multiple local storage systems, there is a well established trend towards storage consolidation using multi-tier Virtualised Storage Systems (VSSs). At the same time, storage infrastructures are increasingly subject to stringent Quality of Service (QoS) demands. Within a VSS, it is challenging to match desired QoS with delivered QoS, considering the latter can vary dramatically both across and within tiers. Manual efforts to achieve this match require extensive and ongoing human intervention. Automated efforts are based on workload analysis, which ignores the business importance of infrequently accessed data. This thesis presents our design, implementation and evaluation of data maintenance strategies in an enhanced version of the popular Linux Extended 3 Filesystem which features support for the elegant specification of QoS metadata while maintaining compatibility with stock kernels. Users and applications specify QoS requirements using a chmod-like interface. System administrators are provided with a character device kernel interface that allows for profiling of the QoS delivered by the underlying storage. We propose a novel score-based metric, together with associated visualisation resources, to evaluate the degree of QoS matching achieved by any given data layout. We also design and implement new inode and datablock allocation and migration strategies which exploit this metric in seeking to match the QoS attributes set by users and/or applications on files and directories with the QoS actually delivered by each of the filesystem’s block groups. To create realistic test filesystems we have included QoS metadata support in the Impressions benchmarking framework. The effectiveness of the resulting data layout in terms of QoS matching is evaluated using a special kernel module that is capable of inspecting detailed filesystem data on-the-fly. We show that our implementations of the proposed inode and datablock allocation strategies are capable of dramatically improving data placement with respect to QoS requirements when compared to the default allocators

    Optimizing energy-efficiency for multi-core packet processing systems in a compiler framework

    Get PDF
    Network applications become increasingly computation-intensive and the amount of traffic soars unprecedentedly nowadays. Multi-core and multi-threaded techniques are thus widely employed in packet processing system to meet the changing requirement. However, the processing power cannot be fully utilized without a suitable programming environment. The compilation procedure is decisive for the quality of the code. It can largely determine the overall system performance in terms of packet throughput, individual packet latency, core utilization and energy efficiency. The thesis investigated compilation issues in networking domain first, particularly on energy consumption. And as a cornerstone for any compiler optimizations, a code analysis module for collecting program dependency is presented and incorporated into a compiler framework. With that dependency information, a strategy based on graph bi-partitioning and mapping is proposed to search for an optimal configuration in a parallel-pipeline fashion. The energy-aware extension is specifically effective in enhancing the energy-efficiency of the whole system. Finally, a generic evaluation framework for simulating the performance and energy consumption of a packet processing system is given. It accepts flexible architectural configuration and is capable of performingarbitrary code mapping. The simulation time is extremely short compared to full-fledged simulators. A set of our optimization results is gathered using the framework

    Heterogeneity-aware scheduling and data partitioning for system performance acceleration

    Get PDF
    Over the past decade, heterogeneous processors and accelerators have become increasingly prevalent in modern computing systems. Compared with previous homogeneous parallel machines, the hardware heterogeneity in modern systems provides new opportunities and challenges for performance acceleration. Classic operating systems optimisation problems such as task scheduling, and application-specific optimisation techniques such as the adaptive data partitioning of parallel algorithms, are both required to work together to address hardware heterogeneity. Significant effort has been invested in this problem, but either focuses on a specific type of heterogeneous systems or algorithm, or a high-level framework without insight into the difference in heterogeneity between different types of system. A general software framework is required, which can not only be adapted to multiple types of systems and workloads, but is also equipped with the techniques to address a variety of hardware heterogeneity. This thesis presents approaches to design general heterogeneity-aware software frameworks for system performance acceleration. It covers a wide variety of systems, including an OS scheduler targeting on-chip asymmetric multi-core processors (AMPs) on mobile devices, a hierarchical many-core supercomputer and multi-FPGA systems for high performance computing (HPC) centers. Considering heterogeneity from on-chip AMPs, such as thread criticality, core sensitivity, and relative fairness, it suggests a collaborative based approach to co-design the task selector and core allocator on OS scheduler. Considering the typical sources of heterogeneity in HPC systems, such as the memory hierarchy, bandwidth limitations and asymmetric physical connection, it proposes an application-specific automatic data partitioning method for a modern supercomputer, and a topological-ranking heuristic based schedule for a multi-FPGA based reconfigurable cluster. Experiments on both a full system simulator (GEM5) and real systems (Sunway Taihulight Supercomputer and Xilinx Multi-FPGA based clusters) demonstrate the significant advantages of the suggested approaches compared against the state-of-the-art on variety of workloads."This work is supported by St Leonards 7th Century Scholarship and Computer Science PhD funding from University of St Andrews; by UK EPSRC grant Discovery: Pattern Discovery and Program Shaping for Manycore Systems (EP/P020631/1)." -- Acknowledgement

    Distributed services across the network from edge to core

    Get PDF
    The current internet architecture is evolving from a simple carrier of bits to a platform able to provide multiple complex services running across the entire Network Service Provider (NSP) infrastructure. This calls for increased flexibility in resource management and allocation to provide dedicated, on-demand network services, leveraging a distributed infrastructure consisting of heterogeneous devices. More specifically, NSPs rely on a plethora of low-cost Customer Premise Equipment (CPE), as well as more powerful appliances at the edge of the network and in dedicated data-centers. Currently a great research effort is spent to provide this flexibility through Fog computing, Network Functions Virtualization (NFV), and data plane programmability. Fog computing or Edge computing extends the compute and storage capabilities to the edge of the network, closer to the rapidly growing number of connected devices and applications that consume cloud services and generate massive amounts of data. A complementary technology is NFV, a network architecture concept targeting the execution of software Network Functions (NFs) in isolated Virtual Machines (VMs), potentially sharing a pool of general-purpose hosts, rather than running on dedicated hardware (i.e., appliances). Such a solution enables virtual network appliances (i.e., VMs executing network functions) to be provisioned, allocated a different amount of resources, and possibly moved across data centers in little time, which is key in ensuring that the network can keep up with the flexibility in the provisioning and deployment of virtual hosts in today’s virtualized data centers. Moreover, recent advances in networking hardware have introduced new programmable network devices that can efficiently execute complex operations at line rate. As a result, NFs can be (partially or entirely) folded into the network, speeding up the execution of distributed services. The work described in this Ph.D. thesis aims at showing how various network services can be deployed throughout the NSP infrastructure, accommodating to the different hardware capabilities of various appliances, by applying and extending the above-mentioned solutions. First, we consider a data center environment and the deployment of (virtualized) NFs. In this scenario, we introduce a novel methodology for the modelization of different NFs aimed at estimating their performance on different execution platforms. Moreover, we propose to extend the traditional NFV deployment outside of the data center to leverage the entire NSP infrastructure. This can be achieved by integrating native NFs, commonly available in low-cost CPEs, with an existing NFV framework. This facilitates the provision of services that require NFs close to the end user (e.g., IPsec terminator). On the other hand, resource-hungry virtualized NFs are run in the NSP data center, where they can take advantage of the superior computing and storage capabilities. As an application, we also present a novel technique to deploy a distributed service, specifically a web filter, to leverage both the low latency of a CPE and the computational power of a data center. We then show that also the core network, today dedicated solely to packet routing, can be exploited to provide useful services. In particular, we propose a novel method to provide distributed network services in core network devices by means of task distribution and a seamless coordination among the peers involved. The aim is to transform existing network nodes (e.g., routers, switches, access points) into a highly distributed data acquisition and processing platform, which will significantly reduce the storage requirements at the Network Operations Center and the packet duplication overhead. Finally, we propose to use new programmable network devices in data center networks to provide much needed services to distributed applications. By offloading part of the computation directly to the networking hardware, we show that it is possible to reduce both the network traffic and the overall job completion time
    corecore