182 research outputs found

    Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

    Get PDF
    Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability

    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

    Elastic, Interoperable and Container-based Cloud Infrastructures for High Performance Computing

    Full text link
    Tesis por compendio[ES] Las aplicaciones científicas implican generalmente una carga computacional variable y no predecible a la que las instituciones deben hacer frente variando dinámicamente la asignación de recursos en función de las distintas necesidades computacionales. Las aplicaciones científicas pueden necesitar grandes requisitos. Por ejemplo, una gran cantidad de recursos computacionales para el procesado de numerosos trabajos independientes (High Throughput Computing o HTC) o recursos de alto rendimiento para la resolución de un problema individual (High Performance Computing o HPC). Los recursos computacionales necesarios en este tipo de aplicaciones suelen acarrear un coste muy alto que puede exceder la disponibilidad de los recursos de la institución o estos pueden no adaptarse correctamente a las necesidades de las aplicaciones científicas, especialmente en el caso de infraestructuras preparadas para la ejecución de aplicaciones de HPC. De hecho, es posible que las diferentes partes de una aplicación necesiten distintos tipos de recursos computacionales. Actualmente las plataformas de servicios en la nube se han convertido en una solución eficiente para satisfacer la demanda de las aplicaciones HTC, ya que proporcionan un abanico de recursos computacionales accesibles bajo demanda. Por esta razón, se ha producido un incremento en la cantidad de clouds híbridos, los cuales son una combinación de infraestructuras alojadas en servicios en la nube y en las propias instituciones (on-premise). Dado que las aplicaciones pueden ser procesadas en distintas infraestructuras, actualmente la portabilidad de las aplicaciones se ha convertido en un aspecto clave. Probablemente, las tecnologías de contenedores son la tecnología más popular para la entrega de aplicaciones gracias a que permiten reproducibilidad, trazabilidad, versionado, aislamiento y portabilidad. El objetivo de la tesis es proporcionar una arquitectura y una serie de servicios para proveer infraestructuras elásticas híbridas de procesamiento que puedan dar respuesta a las diferentes cargas de trabajo. Para ello, se ha considerado la utilización de elasticidad vertical y horizontal desarrollando una prueba de concepto para proporcionar elasticidad vertical y se ha diseñado una arquitectura cloud elástica de procesamiento de Análisis de Datos. Después, se ha trabajo en una arquitectura cloud de recursos heterogéneos de procesamiento de imágenes médicas que proporciona distintas colas de procesamiento para trabajos con diferentes requisitos. Esta arquitectura ha estado enmarcada en una colaboración con la empresa QUIBIM. En la última parte de la tesis, se ha evolucionado esta arquitectura para diseñar e implementar un cloud elástico, multi-site y multi-tenant para el procesamiento de imágenes médicas en el marco del proyecto europeo PRIMAGE. Esta arquitectura utiliza un almacenamiento distribuido integrando servicios externos para la autenticación y la autorización basados en OpenID Connect (OIDC). Para ello, se ha desarrollado la herramienta kube-authorizer que, de manera automatizada y a partir de la información obtenida en el proceso de autenticación, proporciona el control de acceso a los recursos de la infraestructura de procesamiento mediante la creación de las políticas y roles. Finalmente, se ha desarrollado otra herramienta, hpc-connector, que permite la integración de infraestructuras de procesamiento HPC en infraestructuras cloud sin necesitar realizar cambios en la infraestructura HPC ni en la arquitectura cloud. Cabe destacar que, durante la realización de esta tesis, se han utilizado distintas tecnologías de gestión de trabajos y de contenedores de código abierto, se han desarrollado herramientas y componentes de código abierto y se han implementado recetas para la configuración automatizada de las distintas arquitecturas diseñadas desde la perspectiva DevOps.[CA] Les aplicacions científiques impliquen generalment una càrrega computacional variable i no predictible a què les institucions han de fer front variant dinàmicament l'assignació de recursos en funció de les diferents necessitats computacionals. Les aplicacions científiques poden necessitar grans requisits. Per exemple, una gran quantitat de recursos computacionals per al processament de nombrosos treballs independents (High Throughput Computing o HTC) o recursos d'alt rendiment per a la resolució d'un problema individual (High Performance Computing o HPC). Els recursos computacionals necessaris en aquest tipus d'aplicacions solen comportar un cost molt elevat que pot excedir la disponibilitat dels recursos de la institució o aquests poden no adaptar-se correctament a les necessitats de les aplicacions científiques, especialment en el cas d'infraestructures preparades per a l'avaluació d'aplicacions d'HPC. De fet, és possible que les diferents parts d'una aplicació necessiten diferents tipus de recursos computacionals. Actualment les plataformes de servicis al núvol han esdevingut una solució eficient per satisfer la demanda de les aplicacions HTC, ja que proporcionen un ventall de recursos computacionals accessibles a demanda. Per aquest motiu, s'ha produït un increment de la quantitat de clouds híbrids, els quals són una combinació d'infraestructures allotjades a servicis en el núvol i a les mateixes institucions (on-premise). Donat que les aplicacions poden ser processades en diferents infraestructures, actualment la portabilitat de les aplicacions s'ha convertit en un aspecte clau. Probablement, les tecnologies de contenidors són la tecnologia més popular per a l'entrega d'aplicacions gràcies al fet que permeten reproductibilitat, traçabilitat, versionat, aïllament i portabilitat. L'objectiu de la tesi és proporcionar una arquitectura i una sèrie de servicis per proveir infraestructures elàstiques híbrides de processament que puguen donar resposta a les diferents càrregues de treball. Per a això, s'ha considerat la utilització d'elasticitat vertical i horitzontal desenvolupant una prova de concepte per proporcionar elasticitat vertical i s'ha dissenyat una arquitectura cloud elàstica de processament d'Anàlisi de Dades. Després, s'ha treballat en una arquitectura cloud de recursos heterogenis de processament d'imatges mèdiques que proporciona distintes cues de processament per a treballs amb diferents requisits. Aquesta arquitectura ha estat emmarcada en una col·laboració amb l'empresa QUIBIM. En l'última part de la tesi, s'ha evolucionat aquesta arquitectura per dissenyar i implementar un cloud elàstic, multi-site i multi-tenant per al processament d'imatges mèdiques en el marc del projecte europeu PRIMAGE. Aquesta arquitectura utilitza un emmagatzemament integrant servicis externs per a l'autenticació i autorització basats en OpenID Connect (OIDC). Per a això, s'ha desenvolupat la ferramenta kube-authorizer que, de manera automatitzada i a partir de la informació obtinguda en el procés d'autenticació, proporciona el control d'accés als recursos de la infraestructura de processament mitjançant la creació de les polítiques i rols. Finalment, s'ha desenvolupat una altra ferramenta, hpc-connector, que permet la integració d'infraestructures de processament HPC en infraestructures cloud sense necessitat de realitzar canvis en la infraestructura HPC ni en l'arquitectura cloud. Es pot destacar que, durant la realització d'aquesta tesi, s'han utilitzat diferents tecnologies de gestió de treballs i de contenidors de codi obert, s'han desenvolupat ferramentes i components de codi obert, i s'han implementat receptes per a la configuració automatitzada de les distintes arquitectures dissenyades des de la perspectiva DevOps.[EN] Scientific applications generally imply a variable and an unpredictable computational workload that institutions must address by dynamically adjusting the allocation of resources to their different computational needs. Scientific applications could require a high capacity, e.g. the concurrent usage of computational resources for processing several independent jobs (High Throughput Computing or HTC) or a high capability by means of using high-performance resources for solving complex problems (High Performance Computing or HPC). The computational resources required in this type of applications usually have a very high cost that may exceed the availability of the institution's resources or they are may not be successfully adapted to the scientific applications, especially in the case of infrastructures prepared for the execution of HPC applications. Indeed, it is possible that the different parts that compose an application require different type of computational resources. Nowadays, cloud service platforms have become an efficient solution to meet the need of HTC applications as they provide a wide range of computing resources accessible on demand. For this reason, the number of hybrid computational infrastructures has increased during the last years. The hybrid computation infrastructures are the combination of infrastructures hosted in cloud platforms and the computation resources hosted in the institutions, which are named on-premise infrastructures. As scientific applications can be processed on different infrastructures, the application delivery has become a key issue. Nowadays, containers are probably the most popular technology for application delivery as they ease reproducibility, traceability, versioning, isolation, and portability. The main objective of this thesis is to provide an architecture and a set of services to build up hybrid processing infrastructures that fit the need of different workloads. Hence, the thesis considered aspects such as elasticity and federation. The use of vertical and horizontal elasticity by developing a proof of concept to provide vertical elasticity on top of an elastic cloud architecture for data analytics. Afterwards, an elastic cloud architecture comprising heterogeneous computational resources has been implemented for medical imaging processing using multiple processing queues for jobs with different requirements. The development of this architecture has been framed in a collaboration with a company called QUIBIM. In the last part of the thesis, the previous work has been evolved to design and implement an elastic, multi-site and multi-tenant cloud architecture for medical image processing has been designed in the framework of a European project PRIMAGE. This architecture uses a storage integrating external services for the authentication and authorization based on OpenID Connect (OIDC). The tool kube-authorizer has been developed to provide access control to the resources of the processing infrastructure in an automatic way from the information obtained in the authentication process, by creating policies and roles. Finally, another tool, hpc-connector, has been developed to enable the integration of HPC processing infrastructures into cloud infrastructures without requiring modifications in both infrastructures, cloud and HPC. It should be noted that, during the realization of this thesis, different contributions to open source container and job management technologies have been performed by developing open source tools and components and configuration recipes for the automated configuration of the different architectures designed from the DevOps perspective. The results obtained support the feasibility of the vertical elasticity combined with the horizontal elasticity to implement QoS policies based on a deadline, as well as the feasibility of the federated authentication model to combine public and on-premise clouds.López Huguet, S. (2021). Elastic, Interoperable and Container-based Cloud Infrastructures for High Performance Computing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/172327TESISCompendi

    Protecting applications using trusted execution environments

    Get PDF
    While cloud computing has been broadly adopted, companies that deal with sensitive data are still reluctant to do so due to privacy concerns or legal restrictions. Vulnerabilities in complex cloud infrastructures, resource sharing among tenants, and malicious insiders pose a real threat to the confidentiality and integrity of sensitive customer data. In recent years trusted execution environments (TEEs), hardware-enforced isolated regions that can protect code and data from the rest of the system, have become available as part of commodity CPUs. However, designing applications for the execution within TEEs requires careful consideration of the elevated threats that come with running in a fully untrusted environment. Interaction with the environment should be minimised, but some cooperation with the untrusted host is required, e.g. for disk and network I/O, via a host interface. Implementing this interface while maintaining the security of sensitive application code and data is a fundamental challenge. This thesis addresses this challenge and discusses how TEEs can be leveraged to secure existing applications efficiently and effectively in untrusted environments. We explore this in the context of three systems that deal with the protection of TEE applications and their host interfaces: SGX-LKL is a library operating system that can run full unmodified applications within TEEs with a minimal general-purpose host interface. By providing broad system support inside the TEE, the reliance on the untrusted host can be reduced to a minimal set of low-level operations that cannot be performed inside the enclave. SGX-LKL provides transparent protection of the host interface and for both disk and network I/O. Glamdring is a framework for the semi-automated partitioning of TEE applications into an untrusted and a trusted compartment. Based on source-level annotations, it uses either dynamic or static code analysis to identify sensitive parts of an application. Taking into account the objectives of a small TCB size and low host interface complexity, it defines an application-specific host interface and generates partitioned application code. EnclaveDB is a secure database using Intel SGX based on a partitioned in-memory database engine. The core of EnclaveDB is its logging and recovery protocol for transaction durability. For this, it relies on the database log managed and persisted by the untrusted database server. EnclaveDB protects against advanced host interface attacks and ensures the confidentiality, integrity, and freshness of sensitive data.Open Acces

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Infrastructures and services for remote sensing data production management across multiple satellite data centers

    Get PDF
    With the number of satellite sensors and date centers being increased continuously, it is becoming a trend to manage and process massive remote sensing data from multiple distributed sources. However, the combination of multiple satellite data centers for massive remote sensing (RS) data collaborative processing still faces many challenges. In order to reduce the huge amounts of data migration and improve the efficiency of multi-datacenter collaborative process, this paper presents the infrastructures and services of the data management as well as workflow management for massive remote sensing data production. A dynamic data scheduling strategy was employed to reduce the duplication of data request and data processing. And by combining the remote sensing spatial metadata repositories and Gfarm grid file system, the unified management of the raw data, intermediate products and final products were achieved in the co-processing. In addition, multi-level task order repositories and workflow templates were used to construct the production workflow automatically. With the help of specific heuristic scheduling rules, the production tasks were executed quickly. Ultimately, the Multi-datacenter Collaborative Process System (MDCPS) were implemented for large-scale remote sensing data production based on the effective management of data and workflow. As a consequence, the performance of MDCPS in experiments environment showed that those strategies could significantly enhance the efficiency of co-processing across multiple data centers

    A Grid Based Distributed Cooperative Environment for Health Care Research

    Get PDF
    International audienceProviding a distributed cooperative environment is a challenging task, which requires a middleware infrastructure that provides, among others, management of distributed shared data, synchronization, consistency, recovery, security and privacy support. In this paper, we present the ECADeG project which proposes a layered architecture for developing distributed cooperative environments running on top of a desktop grid middleware that can encompass multiple organizations. We also present a particular cooperative environment for supporting scientific research focused at the health domain which uses the services supplied by the ECADeG architecture in order to allow researchers to share access to multiple institutions databases, visualize and analyze data by means of data mining techniques, edit research documents cooperatively, exchange information through forums and chats, etc.. Such a rich cooperative environment helps thus the establishment of partnerships between health care professionals and their institutions
    corecore