1,294 research outputs found

    Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications

    Get PDF
    As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures. A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks. In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications. Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes. Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable. El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks. En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores. Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones. Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version

    Composition and synchronization of real-time components upon one processor

    Get PDF
    Many industrial systems have various hardware and software functions for controlling mechanics. If these functions act independently, as they do in legacy situations, their overall performance is not optimal. There is a trend towards optimizing the overall system performance and creating a synergy between the different functions in a system, which is achieved by replacing more and more dedicated, single-function hardware by software components running on programmable platforms. This increases the re-usability of the functions, but their synergy requires also that (parts of) the multiple software functions share the same embedded platform. In this work, we look at the composition of inter-dependent software functions on a shared platform from a timing perspective. We consider platforms comprised of one preemptive processor resource and, optionally, multiple non-preemptive resources. Each function is implemented by a set of tasks; the group of tasks of a function that executes on the same processor, along with its scheduler, is called a component. The tasks of a component typically have hard timing constraints. Fulfilling these timing constraints of a component requires analysis. Looking at a single function, co-operative scheduling of the tasks within a component has already proven to be a powerful tool to make the implementation of a function more predictable. For example, co-operative scheduling can accelerate the execution of a task (making it easier to satisfy timing constraints), it can reduce the cost of arbitrary preemptions (leading to more realistic execution-time estimates) and it can guarantee access to other resources without the need for arbitration by other protocols. Since timeliness is an important functional requirement, (re-)use of a component for composition and integration on a platform must deal with timing. To enable us to analyze and specify the timing requirements of a particular component in isolation from other components, we reserve and enforce the availability of all its specified resources during run-time. The real-time systems community has proposed hierarchical scheduling frameworks (HSFs) to implement this isolation between components. After admitting a component on a shared platform, a component in an HSF keeps meeting its timing constraints as long as it behaves as specified. If it violates its specification, it may be penalized, but other components are temporally isolated from the malignant effects. A component in an HSF is said to execute on a virtual platform with a dedicated processor at a speed proportional to its reserved processor supply. Three effects disturb this point of view. Firstly, processor time is supplied discontinuously. Secondly, the actual processor is faster. Thirdly, the HSF no longer guarantees the isolation of an individual component when two arbitrary components violate their specification during access to non-preemptive resources, even when access is arbitrated via well-defined real-time protocols. The scientific contributions of this work focus on these three issues. Our solutions to these issues cover the system design from component requirements to run-time allocation. Firstly, we present a novel scheduling method that enables us to integrate the component into an HSF. It guarantees that each integrated component executes its tasks exactly in the same order regardless of a continuous or a discontinuous supply of processor time. Using our method, the component executes on a virtual platform and it only experiences that the processor speed is different from the actual processor speed. As a result, we can focus on the traditional scheduling problem of meeting deadline constraints of tasks on a uni-processor platform. For such platforms, we show how scheduling tasks co-operatively within a component helps to meet the deadlines of this component. We compare the strength of these cooperative scheduling techniques to theoretically optimal schedulers. Secondly, we standardize the way of computing the resource requirements of a component, even in the presence of non-preemptive resources. We can therefore apply the same timing analysis to the components in an HSF as to the tasks inside, regardless of their scheduling or their protocol being used for non-preemptive resources. This increases the re-usability of the timing analysis of components. We also make non-preemptive resources transparent during the development cycle of a component, i.e., the developer of a component can be unaware of the actual protocol being used in an HSF. Components can therefore be unaware that access to non-preemptive resources requires arbitration. Finally, we complement the existing real-time protocols for arbitrating access to non-preemptive resources with mechanisms to confine temporal faults to those components in the HSF that share the same non-preemptive resources. We compare the overheads of sharing non-preemptive resources between components with and without mechanisms for confinement of temporal faults. We do this by means of experiments within an HSF-enabled real-time operating system

    An energy-aware architecture : a practical implementation for autonomous underwater vehicles

    Get PDF
    Energy awareness, fault tolerance and performance estimation are important aspects for extending the autonomy levels of today’s autonomous vehicles. Those are related to the concepts of survivability and reliability, two important factors that often limit the trust of end users in conducting large-scale deployments of such vehicles. With the aim of preparing the way for persistent autonomous operations this work focuses its efforts on investigating those effects on underwater vehicles capable of long-term missions. A novel energy-aware architecture for autonomous underwater vehicles (AUVs) is presented. This, by monitoring at runtime the vehicle’s energy usage, is capable of detecting and mitigating failures in the propulsion subsystem, one of the most common sources of mission-time problems. Furthermore it estimates the vehicle’s performance when operating in unknown environments and in the presence of external disturbances. These capabilities are a great contribution for reducing the operational uncertainty that most underwater platforms face during their deployment. Using knowledge collected while conducting real missions the proposed architecture allows the optimisation of on-board resource usage. This improves the vehicle’s effectiveness when operating in unknown stochastic scenarios or when facing the problem of resource scarcity. The architecture has been implemented on a real vehicle, Nessie AUV, used for real sea experiments as part of multiple research projects. These gave the opportunity of evaluating the improvements of the proposed system when considering more complex autonomous tasks. Together with Nessie AUV, the commercial platform IVER3 AUV has been involved in the evaluating the feasibility of this approach. Results and operational experience, gathered both in real sea scenarios and in controlled environment experiments, are discussed in detail showing the benefits and the operational constraints of the introduced architecture, alongside suggestions for future research directions

    Re-use of tests and arguments for assesing dependable mixed-critically systems

    Get PDF
    The safety assessment of mixed-criticality systems (MCS) is a challenging activity due to system heterogeneity, design constraints and increasing complexity. The foundation for MCSs is the integrated architecture paradigm, where a compact hardware comprises multiple execution platforms and communication interfaces to implement concurrent functions with different safety requirements. Besides a computing platform providing adequate isolation and fault tolerance mechanism, the development of an MCS application shall also comply with the guidelines defined by the safety standards. A way to lower the overall MCS certification cost is to adopt a platform-based design (PBD) development approach. PBD is a model-based development (MBD) approach, where separate models of logic, hardware and deployment support the analysis of the resulting system properties and behaviour. The PBD development of MCSs benefits from a composition of modular safety properties (e.g. modular safety cases), which support the derivation of mixed-criticality product lines. The validation and verification (V&V) activities claim a substantial effort during the development of programmable electronics for safety-critical applications. As for the MCS dependability assessment, the purpose of the V&V is to provide evidences supporting the safety claims. The model-based development of MCSs adds more V&V tasks, because additional analysis (e.g., simulations) need to be carried out during the design phase. During the MCS integration phase, typically hardware-in-the-loop (HiL) plant simulators support the V&V campaigns, where test automation and fault-injection are the key to test repeatability and thorough exercise of the safety mechanisms. This dissertation proposes several V&V artefacts re-use strategies to perform an early verification at system level for a distributed MCS, artefacts that later would be reused up to the final stages in the development process: a test code re-use to verify the fault-tolerance mechanisms on a functional model of the system combined with a non-intrusive software fault-injection, a model to X-in-the-loop (XiL) and code-to-XiL re-use to provide models of the plant and distributed embedded nodes suited to the HiL simulator, and finally, an argumentation framework to support the automated composition and staged completion of modular safety-cases for dependability assessment, in the context of the platform-based development of mixed-criticality systems relying on the DREAMS harmonized platform.La dificultad para evaluar la seguridad de los sistemas de criticidad mixta (SCM) aumenta con la heterogeneidad del sistema, las restricciones de diseño y una complejidad creciente. Los SCM adoptan el paradigma de arquitectura integrada, donde un hardware embebido compacto comprende múltiples plataformas de ejecución e interfaces de comunicación para implementar funciones concurrentes y con diferentes requisitos de seguridad. Además de una plataforma de computación que provea un aislamiento y mecanismos de tolerancia a fallos adecuados, el desarrollo de una aplicación SCM además debe cumplir con las directrices definidas por las normas de seguridad. Una forma de reducir el coste global de la certificación de un SCM es adoptar un enfoque de desarrollo basado en plataforma (DBP). DBP es un enfoque de desarrollo basado en modelos (DBM), en el que modelos separados de lógica, hardware y despliegue soportan el análisis de las propiedades y el comportamiento emergente del sistema diseñado. El desarrollo DBP de SCMs se beneficia de una composición modular de propiedades de seguridad (por ejemplo, casos de seguridad modulares), que facilitan la definición de líneas de productos de criticidad mixta. Las actividades de verificación y validación (V&V) representan un esfuerzo sustancial durante el desarrollo de aplicaciones basadas en electrónica confiable. En la evaluación de la seguridad de un SCM el propósito de las actividades de V&V es obtener las evidencias que apoyen las aseveraciones de seguridad. El desarrollo basado en modelos de un SCM incrementa las tareas de V&V, porque permite realizar análisis adicionales (por ejemplo, simulaciones) durante la fase de diseño. En las campañas de pruebas de integración de un SCM habitualmente se emplean simuladores de planta hardware-in-the-loop (HiL), en donde la automatización de pruebas y la inyección de faltas son la clave para la repetitividad de las pruebas y para ejercitar completamente los mecanismos de tolerancia a fallos. Esta tesis propone diversas estrategias de reutilización de artefactos de V&V para la verificación temprana de un MCS distribuido, artefactos que se emplearán en ulteriores fases del desarrollo: la reutilización de código de prueba para verificar los mecanismos de tolerancia a fallos sobre un modelo funcional del sistema combinado con una inyección de fallos de software no intrusiva, la reutilización de modelo a X-in-the-loop (XiL) y código a XiL para obtener modelos de planta y nodos distribuidos aptos para el simulador HiL y, finalmente, un marco de argumentación para la composición automatizada y la compleción escalonada de casos de seguridad modulares, en el contexto del desarrollo basado en plataformas de sistemas de criticidad mixta empleando la plataforma armonizada DREAMS.Kritikotasun nahastuko sistemen segurtasun ebaluazioa jarduera neketsua da beraien heterogeneotasuna dela eta. Sistema hauen oinarria arkitektura integratuen paradigman datza, non hardware konpaktu batek exekuzio plataforma eta komunikazio interfaze ugari integratu ahal dituen segurtasun baldintza desberdineko funtzio konkurrenteak inplementatzeko. Konputazio plataformek isolamendu eta akatsen aurkako mekanismo egokiak emateaz gain, segurtasun arauek definituriko jarraibideak jarraitu behar dituzte kritikotasun mistodun aplikazioen garapenean. Sistema hauen zertifikazio prozesuaren kostua murrizteko aukera bat plataformetan oinarritutako garapenean (PBD) datza. Garapen planteamendu hau modeloetan oinarrituriko garapena da (MBD) non modeloaren logika, hardware eta garapen desberdinak sistemaren propietateen eta portaeraren aurka aztertzen diren. Kritikotasun mistodun sistemen PBD garapenak etekina ateratzen dio moduluetan oinarrituriko segurtasun propietateei, adibidez: segurtasun kasu modularrak (MSC). Modulu hauek kritikotasun mistodun produktu-lerroak ere hartzen dituzte kontutan. Berifikazio eta balioztatze (V&V) jarduerek esfortzu kontsideragarria eskatzen dute segurtasun-kiritikoetarako elektronika programagarrien garapenean. Kritikotasun mistodun sistemen konfiantzaren ebaluazioaren eta V&V jardueren helburua segurtasun eskariak jasotzen dituzten frogak proportzionatzea da. Kritikotasun mistodun sistemen modelo bidezko garapenek zeregin gehigarriak atxikitzen dizkio V&V jarduerari, fase honetan analisi gehigarriak (hots, simulazioak) zehazten direlako. Bestalde, kritikotasun mistodun sistemen integrazio fasean, hardware-in-the-loop (Hil) simulazio plantek V&V iniziatibak sostengatzen dituzte non testen automatizazioan eta akatsen txertaketan funtsezko jarduerak diren. Jarduera hauek frogen errepikapena eta segurtasun mekanismoak egiaztzea ahalbidetzen dute. Tesi honek V&V artefaktuen berrerabilpenerako estrategiak proposatzen ditu, kritikotasun mistodun sistemen egiaztatze azkarrerako sistema mailan eta garapen prozesuko azken faseetaraino erabili daitezkeenak. Esate baterako, test kodearen berrabilpena akats aurkako mekanismoak egiaztatzeko, modelotik X-in-the-loop (XiL)-ra eta kodetik XiL-rako konbertsioa HiL simulaziorako eta argumentazio egitura bat DREAMS Europear proiektuan definituriko arkitektura estiloan oinarrituriko segurtasun kasu modularrak automatikoki eta gradualki sortzeko

    Improved risk analysis for large projects: Bayesian networks approach

    Get PDF
    PhDGenerally risk is seen as an abstract concept which is difficult to measure. In this thesis, we consider quantification in the broader sense by measuring risk in the context of large projects. By improved risk measurement, it may be possible to identify and control risks in such a way that the project is completed successfully in spite of the risks. This thesis considers the trade-offs that may be made in project risk management, specifically time, cost and quality. The main objective is to provide a model which addresses the real problems and questions that project managers encounter, such as: • If I can afford only minimal resources, how much quality is it possible to achieve? • What resources do I need in order to achieve the highest quality possible? • If I have limited resources and I want the highest quality, how much functionality do I need to lose? We propose the use of a causal risk framework that is an improvement on the traditional modelling approaches, such as the risk register approach, and therefore contributes to better decision making. The approach is based on Bayesian Networks (BNs). BNs provide a framework for causal modelling and offer a potential solution to some of the classical modelling problems. Researchers have recently attempted to build BN models that incorporate relationships between time, cost, quality, functionality and various process variables. This thesis analyses such BN models and as part of a new validation study identifies their strengths and weaknesses. BNs have shown considerable promise in addressing the aforementioned problems, but previous BN models have not directly solved the trade-off problem. Major weaknesses are that they do not allow sensible risk event measurement and they do not allow full trade-off analysis. The main hypothesis is that it is possible to build BN models that overcome these limitations without compromising their basic philosophy

    Schedulability Analysis of Distributed Multi-core Avionics Systems with UPPAAL

    Get PDF

    Principles of Security and Trust: 7th International Conference, POST 2018, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings

    Get PDF
    authentication; computer science; computer software selection and evaluation; cryptography; data privacy; formal logic; formal methods; formal specification; internet; privacy; program compilers; programming languages; security analysis; security systems; semantics; separation logic; software engineering; specifications; verification; world wide we
    corecore